Manifold Learning: The Price of Normalization

We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims.

Authors

• 2 publications
• 1 publication
• 1 publication
• 2 publications
• Isometric Multi-Manifolds Learning

Isometric feature mapping (Isomap) is a promising manifold learning meth...
12/03/2009 ∙ by Mingyu Fan, et al. ∙ 0

• Spectral Convergence of the connection Laplacian from random samples

Spectral methods that are based on eigenvectors and eigenvalues of discr...
06/07/2013 ∙ by Amit Singer, et al. ∙ 0

• Metric Learning on Manifolds

Recent literature has shown that symbolic data, such as text and graphs,...
02/05/2019 ∙ by Max Aalto, et al. ∙ 0

• Local criteria for triangulation of manifolds

We present criteria for establishing a triangulation of a manifold. Give...
03/20/2018 ∙ by Jean-Daniel Boissonnat, et al. ∙ 0

• The Geometry of Nodal Sets and Outlier Detection

Let (M,g) be a compact manifold and let -Δϕ_k = λ_k ϕ_k be the sequence ...
06/05/2017 ∙ by Xiuyuan Cheng, et al. ∙ 0

• Functorial Manifold Learning and Overlapping Clustering

We adapt previous research on topological unsupervised learning to devel...
11/15/2020 ∙ by Dan Shiebler, et al. ∙ 0

• Selecting the independent coordinates of manifolds with large aspect ratios

Many manifold embedding algorithms fail apparently when the data manifol...
07/02/2019 ∙ by Yu-Chia Chen, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many seemingly complex systems described by high-dimensional data sets are in fact governed by a surprisingly low number of parameters. Revealing the low-dimensional representation of such high-dimensional data sets not only leads to a more compact description of the data, but also enhances our understanding of the system. Dimension-reducing algorithms attempt to simplify the system’s representation without losing significant structural information. Various dimension-reduction algorithms were developed recently to perform embeddings for manifold-based data sets. These include the following algorithms: Locally Linear Embedding

(LLE, Roweis and Saul, 2000), Isomap (Tenenbaum et al., 2000), Laplacian Eigenmaps (LEM, Belkin and Niyogi, 2003), Local Tangent Space Alignment (LTSA, Zhang and Zha, 2004), Hessian Eigenmap (HLLE, Donoho and Grimes, 2004), Semi-definite Embedding (SDE, Weinberger and Saul, 2006) and Diffusion Maps (DFM, Coifman and Lafon, 2006).

These manifold-learning algorithms compute an embedding for some given input. It is assumed that this input lies on a low-dimensional manifold, embedded in some high-dimensional space. Here a manifold is defined as a topological space that is locally equivalent to a Euclidean space. It is further assumed that the manifold is the image of a low-dimensional domain. In particular, the input points are the image of a sample taken from the domain. The goal of the manifold-learning algorithms is to recover the original domain structure, up to some scaling and rotation. The non-linearity of these algorithms allows them to reveal the domain structure even when the manifold is not linearly embedded.

The central question that arises when considering the output of a manifold-learning algorithm is, whether the algorithm reveals the underlying low-dimensional structure of the manifold. The answer to this question is not simple. First, one should define what “revealing the underlying lower-dimensional description of the manifold” actually means. Ideally, one could measure the degree of similarity between the output and the original sample. However, the original low-dimensional data representation is usually unknown. Nevertheless, if the low-dimensional structure of the data is known in advance, one would expect it to be approximated by the dimension-reducing algorithm, at least up to some rotation, translation, and global scaling factor. Furthermore, it would be reasonable to expect the algorithm to succeed in recovering the original sample’s structure asymptotically, namely, when the number of input points tends to infinity. Finally, one would hope that the algorithm would be robust in the presence of noise.

Previous papers have addressed the central question posed earlier. Zhang and Zha (2004)

presented some bounds on the local-neighborhoods’ error-estimation for LTSA. However, their analysis says nothing about the global embedding.

Huo and Smith (2006) proved that, asymptotically, LTSA recovers the original sample up to an affine transformation. They assume in their analysis that the level of noise tends to zero when the number of input points tends to infinity. Bernstein et al. (2000.) proved that, asymptotically, the embedding given by the Isomap algorithm (Tenenbaum et al., 2000) recovers the geodesic distances between points on the manifold.

In this paper we develop theoretical results regarding the performance of a class of manifold-learning algorithms, which includes the following five algorithms: Locally Linear Embedding (LLE), Laplacian Eigenmap (LEM), Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps (DFM).

We refer to this class of algorithms as the normalized-output algorithms. The normalized-output algorithms share a common scheme for recovering the domain structure of the input data set. This scheme is constructed in three steps. In the first step, the local neighborhood of each point is found. In the second step, a description of these neighborhoods is computed. In the third step, a low-dimensional output is computed by solving some convex optimization problem under some normalization constraints. A detailed description of the algorithms is given in Section 2.

In Section 3 we discuss informally the criteria for determining the success of manifold-learning algorithms. We show that one should not expect the normalized-output algorithms to recover geodesic distances or local structures. A more reasonable criterion for success is a high degree of similarity between the output of the algorithms and the original sample, up to some affine transformation; the definition of similarity will be discussed later. We demonstrate that under certain circumstances, this high degree of similarity does not occur. In Section 4 we find necessary conditions for the successful performance of LEM and DFM on the two-dimensional grid. This section serves as an explanatory introduction to the more general analysis that appears in Section 5. Some of the ideas that form the basis of the analysis in Section 4 were discussed independently by both Gerber et al. (2007) and ourselves (Goldberg et al., 2007). Section 5 finds necessary conditions for the successful performance of all the normalized-output algorithms on general two-dimensional manifolds. In Section 6 we discuss the performance of the algorithms in the asymptotic case. Concluding remarks appear in Section 7. The detailed proofs appear in the Appendix.

Our paper has two main results. First, we give well-defined necessary conditions for the successful performance of the normalized-output algorithms. Second, we show that there exist simple manifolds that do not fulfill the necessary conditions for the success of the algorithms. For these manifolds, the normalized-output algorithms fail to generate output that recovers the structure of the original sample. We show that these results hold asymptotically for LEM and DFM. Moreover, when noise, even of small variance, is introduced, LLE, LTSA, and HLLE will fail asymptotically on some manifolds. Throughout the paper, we present numerical results that demonstrate our claims.

2 Description of Output-normalized Algorithms

In this section we describe in short the normalized-output algorithms. The presentation of these algorithms is not in the form presented by the respective authors. The form used in this paper emphasizes the similarities between the algorithms and is better-suited for further derivations. In Appendix A.1 we show the equivalence of our representation of the algorithms and the representations that appear in the original papers.

Let be the input data where is the dimension of the ambient space and is the size of the sample. The normalized-output algorithms attempt to recover the underlying structure of the input data in three steps.

In the first step, the normalized-output algorithms assign neighbors to each input point based on the Euclidean distances in the high-dimensional space111The neighborhoods are not mentioned explicitly by Coifman and Lafon (2006). However, since a sparse optimization problem is considered, it is assumed implicitly that neighborhoods are defined (see Sec. 2.7 therein).. This can be done, for example, by choosing all the input points in an -ball around or alternatively by choosing ’s -nearest-neighbors. The neighborhood of is given by the matrix where are the neighbors of . Note that can be a function of , the index of the neighborhood, yet we omit this index to simplify the notation. For each neighborhood, we define the radius of the neighborhood as

 r(i)=maxj,k∈{0,…,K}∥∥xi,j−xi,k∥∥ (1)

where we define . Finally, we assume throughout this paper that the neighborhood graph is connected.

In the second step, the normalized-output algorithms compute a description of the local neighborhoods that were found in the previous step. The description of the -th neighborhood is given by some weight matrix . The matrices for the different algorithms are presented.

• LEM and DFM: is a matrix,

 Wi=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝w1/2i,1−w1/2i,10⋯0w1/2i,20−w1/2i,2⋱⋮⋮⋮⋱⋱0w1/2i,K0⋯0−w1/2i,K⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

For LEM is a natural choice, yet it is also possible to define the weights as , where is the width parameter of the kernel. For the case of DFM,

 wi,j=kε(xi,xi,j)qε(xi)αqε(xi,j)α, (2)

where is some rotation-invariant kernel, and is again a width parameter. We will use in the normalization of the diffusion kernel, yet other values of can be considered (see details in Coifman and Lafon, 2006). For both LEM and DFM, we define the matrix to be a diagonal matrix where .

• LLE: is a matrix,

 Wi=(1−wi,1⋯−wi,K).

The weights are chosen so that can be best linearly reconstructed from its neighbors. The weights minimize the reconstruction error function

 Δi(wi,1,…,wi,K)=∥xi−∑jwi,jxi,j∥2 (3)

under the constraint . In the case where there is more than one solution that minimizes , regularization is applied to force a unique solution (for details, see Saul and Roweis, 2003).

• LTSA: is a matrix,

 Wi=(I−PiPi′)H.

Let be the SVD of where is the sample mean of and 1

is a vector of ones

(for details about SVD, see, for example, Golub and Loan, 1983). Let be the matrix that holds the first columns of where is the output dimension. The matrix is the centering matrix. See also Huo and Smith (2006) regarding this representation of the algorithm.

• HLLE: is a matrix,

 Wi=(0,Hi)

where 0 is a vector of zeros and is the Hessian estimator.
The estimator can be calculated as follows. Let be the SVD of . Let

 Mi=[1,U(1)i,…,U(d)i,diag(U(1)iU(1)i′),diag(U(1)iU(2)i′),…,diag(U(d)iU(d)i′)],

where the operator diag returns a column vector formed from the diagonal elements of the matrix. Let be the result of the Gram-Schmidt orthonormalization on . Then is defined as the transpose of the last columns of .

The third step of the normalized-output algorithms is to find a set of points where is the dimension of the manifold. is found by minimizing a convex function under some normalization constraints, as follows. Let be any matrix. We define the -th neighborhood matrix using the same pairs of indices as in . The cost function for all of the normalized-output algorithms is given by

 Φ(Y)=N∑i=1ϕ(Yi)=N∑i=1∥WiYi∥2F, (4)

under the normalization constraints

 {Y′DY=IY′D1=0forLEMandDFM,{Cov(Y)=IY′1=0forLLE,LTSAandHLLE, (5)

where stands for the Frobenius norm, and is algorithm-dependent.

Define the output matrix to be the matrix that achieves the minimum of under the normalization constraints of Eq. 5 ( is defined up to rotation). Then we have the following: the embeddings of LEM and LLE are given by the according output matrices ; the embeddings of LTSA and HLLE are given by the according output matrices

; and the embedding of DFM is given by a linear transformation of

as discussed in Appendix A.1. The discussion of the algorithms’ output in this paper holds for any affine transformation of the output (see Section 3). Thus, without loss of generality, we prefer to discuss the output matrix directly, rather than the different embeddings. This allows a unified framework for all five normalized-output algorithms.

3 Embedding quality

In this section we discuss possible definitions of “successful performance” of manifold-learning algorithms. To open our discussion, we present a numerical example. We chose to work with LTSA rather arbitrarily. Similar results can be obtained using the other algorithms.

The example we consider is a uniform sample from a two-dimensional strip, shown in Fig. 1A. Note that in this example, ; i.e., the input data is identical to the original data. Fig. 1B presents the output of LTSA on the input in Fig. 1A. The most obvious difference between input and output is that while the input is a strip, the output is roughly square. While this may seem to be of no importance, note that it means that the algorithm, like all the normalized-output algorithms, does not preserve geodesic distances even up to a scaling factor. By definition, the geodesic distance between two points on a manifold is the length of the shortest path on the manifold between the two points. Preservation of geodesic distances is particularly relevant when the manifold is isometrically embedded. In this case, assuming the domain is convex, the geodesic distance between any two points on the manifold is equal to the Euclidean distance between the corresponding domain points. Geodesic distances are conserved, for example, by the Isomap algorithm (Tenenbaum et al., 2000).

Figs. 1E and 1F present closeups of Figs. 1A and 1B, respectively. Here, a less obvious phenomenon is revealed: the structure of the local neighborhood is not preserved by LTSA. By local structure we refer to the angles and distances (at least up to a scale) between all points within each local neighborhood. Mappings that preserve local structures up to a scale are called conformal mappings (see for example de Silva and Tenenbaum, 2003; Sha and Saul, 2005). In addition to the distortion of angles and distances, the -nearest-neighbors of a given point on the manifold do not necessarily correspond to the -nearest-neighbors of the respective output point, as shown in Figs. 1E and 1F. Accordingly, we conclude that the original structure of the local neighborhoods is not necessarily preserved by the normalized-output algorithms.

The above discussion highlights the fact that one cannot expect the normalized-output algorithms to preserve geodesic distances or local neighborhood structure. However, it seems reasonable to demand that the output of the normalized-output algorithms resemble an affine transformation of the original sample. In fact, the output presented in Fig. 1B is an affine transformation of the input, which is the original sample, presented in Fig. 1A. A formal similarity criterion based on affine transformations is given by Huo and Smith (2006). In the following, we will claim that a normalized-output algorithm succeeds (or fails) based on the existence (or lack thereof) of resemblance between the output and the original sample, up to an affine transformation.

Fig. 1D presents the output of LTSA on a noisy version of the input, shown in Fig. 1C. In this case, the algorithm prefers an output that is roughly a one-dimensional curve embedded in . While this result may seem incidental, the results of all the other normalized-output algorithms for this example are essentially the same.

Using the affine transformation criterion, we can state that LTSA succeeds in recovering the underlying structure of the strip shown in Fig. 1A. However, in the case of the noisy strip shown in Fig. 1C, LTSA fails to recover the structure of the input. We note that all the other normalized-output algorithms perform similarly.

For practical purposes, we will now generalize the definition of failure of the normalized-output algorithms. This definition is more useful when it is necessary to decide whether an algorithm has failed, without actually computing the output. This is useful, for example, when considering the outputs of an algorithm for a class of manifolds.

We now present the generalized definition of failure of the algorithms. Let be the original sample. Assume that the input is given by , where is some smooth function, and is the dimension of the input. Let be an affine transformation of the original sample , such that the normalization constraints of Eq. 5 hold. Note that is algorithm-dependent, and that for each algorithm, is unique up to rotation and translation. When the algorithm succeeds it is expected that the output will be similar to a normalized version of , namely to . Let be any matrix that satisfies the same normalization constraints. We say that the algorithm has failed if , and is substantially different from , and hence also from . In other words, we say that the algorithm has failed when a substantially different embedding has a lower cost than the most appropriate embedding . A precise definition of “substantially different” is not necessary for the purposes of this paper. It is enough to consider substantially different from when is of lower dimension than , as in Fig. 1D.

We emphasize that the matrix is not necessarily similar to the output of the algorithm in question. It is a mathematical construction that shows when the output of the algorithm is not likely to be similar to , the normalized version of the true manifold structure. The following lemma shows that if , the inequality is also true for a small perturbation of . Hence, it is not likely that an output that resembles will occur when and is substantially different from .

Lemma 3.1

Let be an matrix. Let be a perturbation of , where is an matrix such that and where . Let be the maximum number of neighborhoods to which a single input point belongs. Then for LLE with positive weights , LEM, DFM, LTSA, and HLLE, we have

 Φ(˜Y)>(1−4ε)Φ(Y)−4εCaS,

where is a constant that depends on the algorithm.

The use of positive weights in LLE is discussed in Saul and Roweis (2003, Section 5); a similar result for LLE with general weights can be obtained if one allows a bound on the values of . The proof of Lemma 3.1 is given in Appendix A.2.

4 Analysis of the two-dimensional grid

In this section we analyze the performance of LEM on the two-dimensional grid. In particular, we argue that LEM cannot recover the structure of a two-dimensional grid in the case where the aspect ratio of the grid is greater than . Instead, LEM prefers a one-dimensional curve in . Implications also follow for DFM, as explained in Section 4.3, followed by a discussion of the other normalized-output algorithms. Finally, we present empirical results that demonstrate our claims.

In Section 5 we prove a more general statement regarding any two-dimensional manifold. Necessary conditions for successful performance of the normalized-output algorithms on such manifolds are presented. However, the analysis in this section is important in itself for two reasons. First, the conditions for the success of LEM on the two-dimensional grid are more limiting. Second, the analysis is simpler and points out the reasons for the failure of all the normalized-output algorithms when the necessary conditions do not hold.

4.1 Possible embeddings of a two-dimensional grid

We consider the input data set to be the two-dimensional grid , where . We denote . For convenience, we regard as an matrix, where is the number of points in the grid. Note that in this specific case, the original sample and the input are the same.

In the following we present two different embeddings, and . Embedding is the grid itself, normalized so that . Embedding collapses each column to a point and positions the resulting points in the two-dimensional plane in a way that satisfies the constraint (see Fig. 2 for both). The embedding is a curve in and clearly does not preserve the original structure of the grid.

We first define the embeddings more formally. We start by defining . Note that this is the only linear transformation of (up to rotation) that satisfies the conditions and , which are the normalization constraints for LEM (see Eq. 5). However, the embedding depends on the matrix , which in turn depends on the choice of neighborhoods. Recall that the matrix is a diagonal matrix, where equals the number of neighbors of the -th point. Choose to be the radius of the neighborhoods. Then, for all inner points , the number of neighbors is a constant, which we denote as . We shall call all points with less than neighbors boundary points. Note that the definition of boundary points depends on the choice of . For inner points of the grid we have . Thus, when we have .

We define . Note that , and for , . In this section we analyze the embedding instead of , thereby avoiding the dependence on the matrix and hence simplifying the notation. This simplification does not significantly change the problem and does not affect the results we present. Similar results are obtained in the next section for general two-dimensional manifolds, using the exact normalization constraints (see Section 5.2).

Note that can be described as the set of points , where . The constants and ensure that the normalization constraint holds. Straightforward computation (see Appendix A.3) shows that

 σ2=(m+1)m3;τ2=(q+1)q3. (6)

The definition of the embedding is as follows:

 zij=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(iσ,−2iρ−¯z(2))i≤0(iσ,2iρ−¯z(2))i≥0, (7)

where ensures that , and (the same as before; see below) and are chosen so that sample variance of and is equal to one. The symmetry of about the origin implies that , hence the normalization constraint holds. is as defined in Eq. 6, since (with both defined similarly to ). Finally, note that the definition of does not depend on .

4.2 Main result for LEM on the two-dimensional grid

We estimate by (see Eq. 4), where is an inner point of the grid and is the neighborhood of ; likewise, we estimate by for an inner point . For all inner points, the value of is equal to some value . For boundary points, is bounded by multiplied by some constant that depends only on the number of neighbors. Hence, for large and , the difference between and is negligible.

The main result of this section states:

Theorem 4.1

Let be an inner point and let the ratio be greater than . Then

 ϕ(Yij)>ϕ(Zij)

for neighborhood-radius that satisfies , or similarly, for -nearest neighborhoods where .

This indicates that for aspect ratios that are greater than and above, mapping , which is essentially one-dimensional, is preferred to , which is a linear transformation of the grid. The case of general -ball neighborhoods is discussed in Appendix A.4 and indicates that similar results should be expected.

The proof of the theorem is as follows. It can be shown analytically (see Fig. 3) that

 ϕ(Yij)=F(K)(1σ2+1τ2), (8)

where

 F(4)=2;F(8)=6;F(12)=14. (9)

For higher , can be approximated for any -ball neighborhood of (see Appendix A.4).

It can be shown (see Fig. 3) that

 ϕ(Zij)=˜F(K)(1σ2+4ρ2), (10)

where for . For higher , it can be shown (see Appendix A.4) that for any -ball neighborhood.

A careful computation (see Appendix A.5) shows that

 ρ>σ, (11)

and therefore

 ϕ(Zij)<5F(K)σ2. (12)

Assume that . Since both and are integers, we have that . Hence, using Eq. 6 we have

 σ2=m(m+1)3>4q(q+1)3=4τ2.

Combining this result with Eqs. 8 and 12 we have

 mq>2⇒ϕ(Yij)>ϕ(Zij).

which proves Theorem 4.1.

4.3 Implications to other algorithms

We start with implications regarding DFM. There are two main differences between LEM and DFM. The first difference is the choice of the kernel. LEM chooses , which can be referred to as the “window” kernel (a Gaussian weight function was also considered by Belkin and Niyogi, 2003). DFM allows a more general rotation-invariant kernel, which includes the “window” kernel of LEM. The second difference is that DFM renormalizes the weights (see Eq. 2). However, for all the inner points of the grid with neighbors that are also inner points, the renormalization factor is a constant. Therefore, if DFM chooses the “window” kernel, it is expected to fail, like LEM. In other words, when DFM using the “window” kernel is applied to a grid with aspect ratio slightly greater than or above, DFM will prefer the embedding over the embedding (see Fig 2). For a more general choice of kernel, the discussion in Appendix A.4 indicates that a similar failure should occur. This is because the relation between the estimations of and presented in Eqs. 8 and 10 holds for any rotation-invariant kernel (see Appendix A.4). This observation is also evident in numerical examples, as shown in Figs. 4 and 5.

In the cases of LLE with no regularization, LTSA, and HLLE, it can be shown that . Indeed, for LTSA and HLLE, the weight matrix projects on a space that is perpendicular to the SVD of the neighborhood , thus . Since , we have , and, therefore, . For the case of LLE with no regularization, when , each point can be reconstructed perfectly from its neighbors, and the result follows. Hence, a linear transformation of the original data should be the preferred output. However, the fact that relies heavily on the assumption that both the input and the output are of the same dimension (see Theorem 5.1 for manifolds embedded in higher dimensions), which is typically not the case in dimension-reducing applications.

4.4 Numerical results

For the following numerical results, we used the Matlab implementation written by the respective algorithms’ authors as provided by Wittman (retrieved Jan. 2007) (a minor correction was applied to the code of HLLE).

We ran the LEM algorithm on data sets with aspect ratios above and below . We present results for both a grid and a uniformly sampled strip. The neighborhoods were chosen using -nearest neighbors with , and . We present the results for ; the results for , and are similar. The results for the grid and the random sample are presented in Figs. 4 and 5, respectively.

We ran the DFM algorithm on the same data sets. We used the normalization constant and the kernel width ; the results for , and are similar. The results for the grid and the random sample are presented in Figures 4 and 5, respectively.

Both examples clearly demonstrate that for aspect ratios sufficiently greater than , both LEM and DFM prefer a solution that collapses the input data to a nearly one-dimensional output, normalized in . This is exactly as expected, based on our theoretical arguments.

Finally, we ran LLE, HLLE, and LTSA on the same data sets. In the case of the grid, both LLE and LTSA (roughly) recovered the grid shape for , and , while HLLE failed to produce any output due to large memory requirements. In the case of the random sample, both LLE and HLLE succeeded for but failed for . LTSA succeeded for , and but failed for . The reasons for the failure for lower values of are not clear, but may be due to roundoff errors. In the case of LLE, the failure may also be related to the use of regularization in LLE’s second step.

5 Analysis for general two-dimensional manifolds

The aim of this section is to present necessary conditions for the success of the normalized-output algorithms on general two-dimensional manifolds embedded in high-dimensional space. We show how this result can be further generalized to manifolds of higher dimension. We demonstrate the theoretical results using numerical examples.

5.1 Two different embeddings for a two-dimensional manifold

We start with some definitions. Let be the original sample. Without loss of generality, we assume that

 ¯x=0;Cov(X)≡Σ=(ccσ200τ2).

As in Section 4, we assume that . Assume that the input for the normalized-output algorithms is given by where is a smooth function and is the dimension of the input. When the mapping is an isometry, we expect to be small. We now take a close look at .

 Φ(X)=N∑i=1∥WiXi∥2F=N∑i=1∥∥WiX(1)i∥∥2+N∑i=1∥∥WiX(2)i∥∥2,

where is the -th column of the neighborhood . Define , and note that depends on the different algorithms through the definition of the matrices . The quantity is the portion of error obtained by using the -th column of the -th neighborhood when using the original sample as output. Denote , the average error originating from the -th column.

We define two different embeddings for , following the logic of Sec. 4.1. Let

 Y=XΣ−1/2 (13)

be the first embedding. Note that is just the original sample up to a linear transformation that ensures that the normalization constraints and hold. Moreover, is the only transformation of that satisfies these conditions, which are the normalization constraints for LLE, HLLE, and LTSA. In Section 5.2 we discuss the modified embeddings for LEM and DFM.

The second embedding, , is given by

 zi=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(x(1)iσ,−x(1)iρ−¯z(2))x(1)i<0(x(1)iσ,κx(1)iρ−¯z(2))x(1)i≥0. (14)

Here

 κ=(∑i:x(1)i<0(x(1)i)2)1/2(∑i:x(1)i≥0(x(1)i)2)−1/2 (15)

ensures that , and and are chosen so that the sample mean and variance of are equal to zero and one, respectively. We assume without loss of generality that .

Note that depends only on the first column of . Moreover, each point is just a linear transformation of . In the case of neighborhoods , the situation can be different. If the first column of is either non-negative or non-positive, then is indeed a linear transformation of . However, if is located on both sides of zero, is not a linear transformation of . Denote by the set of indices of neighborhoods that are not linear transformations of . The number depends on the number of nearest neighbors . Recall that for each neighborhood, we defined the radius . Define to be the maximum radius of neighborhoods , such that .

5.2 The embeddings for LEM and DFM

So far we have claimed that given the original sample , we expect the output to resemble (see Eq. 13). However, does not satisfy the normalization constraints of Eq. 5 for the cases of LEM and DFM. Define to be the only affine transformation of (up to rotation) that satisfies the normalization constraint of LEM and DFM. When the original sample is given by , we expect the output of LEM and DFM to resemble . We note that unlike the matrix that was defined in terms of the matrix only, depends also on the choice of neighborhoods through the matrix that appears in the normalization constraints.

We define more formally. Denote . Note that is just a translation of that ensures that . The matrix is positive definite and therefore can be presented by where is a orthogonal matrix and

 ˆΣ=(^σ200^τ2),

where . Define ; then is the only affine transformation of that satisfies the normalization constraints of LEM and DFM; namely, we have and .

We define similarly to Eq. 14,

 ^zi=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(^x(1)i^σ,−^x(1)i^ρ−^¯z(2))^x(1)i<0(^x(1)i^σ,^κ^x(1)i^ρ−^¯z(2))^x(1)i≥0,

where is defined by Eq. 15 with respect to , and .

A similar analysis to that of and can be performed for and . The same necessary conditions for success are obtained, with , , and replaced by , , and , respectively. In the case where the distribution of the original points is uniform, the ratio is close to the ratio and thus the necessary conditions for the success of LEM and DFM are similar to the conditions in Corollary 5.2.

5.3 Characterization of the embeddings

The main result of this section provides necessary conditions for the success of the normalized-output algorithms. Following Section 3, we say that the algorithms fail if , where and are defined in Eqs. 13 and 14, respectively. Thus, a necessary condition for the success of the normalized-output algorithms is that .

Theorem 5.1

Let be a sample from a two-dimensional domain and let be its embedding in high-dimensional space. Let and be defined as above. Then

 κ2ρ2(¯e(1)+|N0|Ncar2max)<¯e(2)τ2⟹Φ(Y)>Φ(Z),

where is a constant that depends on the specific algorithm. For the algorithms LEM and DFM a more restrictive condition can be defined:

 κ2ρ2¯e(1)<¯e(2)τ2⟹Φ(Y)>Φ(Z).

For the proof, see Appendix A.6.

Adding some assumptions, we can obtain a simpler criterion. First note that, in general, and

should be of the same order, since it can be assumed that, locally, the neighborhoods are uniformly distributed. Second, following Lemma

A.2 (see Appendix A.8), when is a sample from a symmetric unimodal distribution it can be assumed that and . Then we have the following corollary:

Corollary 5.2

Let be as in Theorem 5.1. Let be the ratio between the variance of the first and second columns of . Assume that , , and . Then

 4(1+|N0|Ncar2max√2¯e(2))Φ(Z).

For LEM and DFM, we can write

 4Φ(Z).

We emphasize that both Theorem 5.1 and Corollary 5.2 do not state that is the output of the normalized-output algorithms. However, when the difference between the right side and the left side of the inequalities is large, one cannot expect the output to resemble the original sample (see Lemma 3.1). In these cases we say that the algorithms fail to recover the structure of the original domain.

5.4 Generalization of the results to manifolds of higher dimensions

The discussion above introduced necessary conditions for the normalized-output algorithms’ success on two-dimensional manifolds embedded in . Necessary conditions for success on general -dimensional manifolds, , can also be obtained. We present here a simple criterion to demonstrate the fact that there are -dimensional manifolds that the normalized-output algorithms cannot recover.

Let be a sample from a -dimensional domain. Assume that the input for the normalized-output algorithms is given by where is a smooth function and is the dimension of the input. We assume without loss of generality that and that is a diagonal matrix. Let . We define the matrix as follows. The first column of , , equals the first column of , namely, . We define the second column similarly to the definition in Eq. 14:

 Z(2)i=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩−x(1)iρ−¯z(2)x(1)i<0κx(1)iρ−¯z(2)x(1)i≥0,

where is defined as in Eq. 15, and and are chosen so that the sample mean and variance of are equal to zero and one, respectively. We define the next columns of by

 Z(j)=Y(j)−σ2jZ(2)√1−σ22j;j=3,…,d,

where . Note that and . Denote .

We bound from above:

 Φ(Z) = Φ(Y(1))+Φ(Z(2))+N∑i=1(11−σ22j)d∑j=3∥∥Wi(Y(j)i−σ2jZ(2)i)∥∥2 ≤ Φ(Y(1))+Φ(Z(2))+11−σ2maxN∑i=1d∑j=3∥∥WiY(j)i∥∥2+σ2max1−σ2maxN∑i=1d∑j=3∥∥WiZ(2)i∥∥2 = Φ(Y(1))+1+(d−3)σ2max1−σ2maxΦ(Z(2))+11−σ2maxd∑j=3Φ(Y(j)).

Since we may write , we have

 1+(d−3)σ2max1−σ2maxΦ(Z(2))<Φ(Y(2))+σ2max1−σ2maxd∑j=3Φ(Y(j))⇒Φ(Z)<Φ(Y).

When the sample is taken from a symmetric distribution with respect to the axes, one can expect to be small. In the specific case of the -dimensional grid, . Indeed, is symmetric around zero, and all values of appear for a given value of . Hence, both LEM and DFM are expected to fail whenever the ratio between the length of the grid in the first and second coordinates is slightly greater than or more, regardless of the length of grid in the other coordinates, similar to the result presented in Theorem 4.1. Corresponding results for the other normalized-output algorithms can also be obtained, similar to the derivation of Corollary 5.2.

5.5 Numerical results

We ran all five normalized-output algorithms, along with Isomap, on three data sets. We used the Matlab implementations written by the algorithms’ authors as provided by Wittman (retrieved Jan. 2007).

The first data set is a -point sample from the swissroll as obtained from Wittman (retrieved Jan. 2007). The results for the swissroll are given in Fig. 7, A1-F1. The results for the same swissroll, after its first dimension was stretched by a factor , are given in Fig. 7, A2-F2. The original and stretched swissrolls are presented in Fig. 6A. The results for are given in Fig. 7. We also checked for ; but “short-circuits” occur (see Balasubramanian et al., 2002, for a definition and discussion of “short-circuits”).

The second data set consists of points, uniformly sampled from a “fishbowl”, where a “fishbowl” is a two-dimensional sphere minus a neighborhood of the northern pole (see Fig. 6B for both the “fishbowl” and its stretched version). The results for are given in Fig. 8. We also checked for ; the results are roughly similar. Note that the “fishbowl” is a two-dimensional manifold embedded in , which is not an isometry. While our theoretical results were proved under the assumption of isometry, this example shows that the normalized-output algorithms prefer to collapse their output even in more general settings.

The third data set consists of images of the globe, each of pixels (see Fig. 6C). The images, provided by Hamm et al. (2005), were obtained by changing the globe’s azimuthal and elevation angles. The parameters of the variations are given by a array that contains to degrees of azimuth and to degrees of elevation. We checked the algorithms both on the entire set of images and on a strip of angular variations. The results for are given in Fig. 9. We also checked for ; the results are roughly similar.

These three examples, in addition to the noisy version of the two-dimensional strip discussed in Section 3 (see Fig. 1), clearly demonstrate that when the aspect ratio is sufficiently large, all the normalized-output algorithms prefer to collapse their output.

6 Asymptotics

In the previous sections we analyzed the phenomenon of global distortion obtained by the normalized-output algorithms on a finite input sample. However, it is of great importance to explore the limit behavior of the algorithms, i.e., the behavior when the number of input points tends to infinity. We consider the question of convergence in the case of input that consists of a -dimensional manifold embedded in , where the manifold is isometric to a convex subset of Euclidean space. By convergence we mean recovering the original subset of up to a non-singular affine transformation.

Some previous theoretical works presented results related to the convergence issue. Huo and Smith (2006) proved convergence of LTSA under some conditions. However, to the best of our knowledge, no proof or contradiction of convergence has been given for any other of the normalized-output algorithms. In this section we discuss the various algorithms separately. We also discuss the influence of noise on the convergence. Using the results from previous sections, we show that there are classes of manifolds on which the normalized-output algorithms cannot be expected to recover the original sample, not even asymptotically.

6.1 LEM and DFM

Let be a uniform sample from the two-dimensional strip . Let be the sample of size . Let be the number of nearest neighbors. Then when

there exists with probability one an

, such that for all the assumptions of Corollary 5.2 hold. Thus, if we do not expect either LEM or DFM to recover the structure of the strip as the number of points in the sample tends to infinity. Note that this result does not depend on the number of neighbors or the width of the kernel, which can be changed as a function of the number of points , as long as . Hence, we conclude that LEM and DFM generally do not converge, regardless of the choice of parameters.

In the rest of this subsection we present further explanations regarding the failure of LEM and DFM based on the asymptotic behavior of the graph Laplacian (see Belkin and Niyogi, 2003, for details)

. Although it was not mentioned explicitly in this paper, it is well known that the outputs of LEM and DFM are highly related to the lower non-negative eigenvectors of the normalized graph Laplacian matrix (see Appendix

A.1). It was shown by Belkin and Niyogi (2005), Hein et al. (2005), and Singer (2006)

that the graph Laplacian operator converges to the continuous Laplacian operator. Thus, taking a close look at the eigenfunctions of the continuous Laplacian operator may reveal some additional insight into the behavior of both LEM and DFM.

Our working example is the two-dimensional strip , which can be considered as the continuous counterpart of the grid defined in Section 4. Following Coifman and Lafon (2006) we impose the Neumann boundary condition (see details therein). The eigenfunctions