# Local Procrustes for Manifold Embedding: A Measure of Embedding Quality and Embedding Algorithms

We present the Procrustes measure, a novel measure based on Procrustes rotation that enables quantitative comparison of the output of manifold-based embedding algorithms (such as LLE (Roweis and Saul, 2000) and Isomap (Tenenbaum et al, 2000)). The measure also serves as a natural tool when choosing dimension-reduction parameters. We also present two novel dimension-reduction techniques that attempt to minimize the suggested measure, and compare the results of these techniques to the results of existing algorithms. Finally, we suggest a simple iterative method that can be used to improve the output of existing algorithms.

## Authors

• 2 publications
• 2 publications
07/29/2011

### A Invertible Dimension Reduction of Curves on a Manifold

In this paper, we propose a novel lower dimensional representation of a ...
06/27/2012

### An Iterative Locally Linear Embedding Algorithm

Local Linear embedding (LLE) is a popular dimension reduction method. In...
01/03/2020

### Scalability and robustness of spectral embedding: landmark diffusion is all you need

While spectral embedding is a widely applied dimension reduction techniq...
06/30/2021

### On choosing optimal response transformations for dimension reduction

It has previously been shown that response transformations can be very e...
06/02/2021

### Matrix factorisation and the interpretation of geodesic distance

Given a graph or similarity matrix, we consider the problem of recoverin...
12/22/2020

### Unsupervised Functional Data Analysis via Nonlinear Dimension Reduction

In recent years, manifold methods have moved into focus as tools for dim...
11/28/2018

### orthoDr: semiparametric dimension reduction via orthogonality constrained optimization

orthoDr is a package in R that solves dimension reduction problems using...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Technological advances constantly improve our ability to collect and store large sets of data. The main difficulty in analyzing such high-dimensional data sets is, that the number of observations required to estimate functions at a set level of accuracy grows exponentially with the dimension. This problem, often referred to as the curse of dimensionality, has led to various techniques that attempt to reduce the dimension of the original data.

Historically, the main approach to dimension reduction is the linear one. This is the approach used by principle component analysis (PCA) and factor analysis (see Mardia et al, 1979, for both). While these algorithms are largely successful, the assumption that a linear projection describes the data well is incorrect for many data sets. A more realistic assumption than that of an underlying linear structure is that the data is on, or next to, an embedded manifold of low dimension in the high-dimensional space. Here a manifold is defined as a topological space that is locally equivalent to a Euclidean space. Locally, the manifold can be estimated by linear approximations based on small neighborhoods of each point. Many algorithms were developed to perform embedding for manifold-based data sets, including the algorithms suggested by Roweis and Saul (2000); Tenenbaum et al (2000); Belkin and Niyogi (2003); Zhang and Zha (2004); Donoho and Grimes (2004); Weinberger and Saul (2006); Dollar et al (2007). Indeed, these algorithms have been shown to succeed even where the assumption of linear structure does not hold. However, to date there exists no good tool to estimate the quality of the result of these algorithms.

Ideally, the quality of an output embedding could be judged based on a comparison to the structure of the original manifold. Indeed, a measure based on the idea that the manifold structure is known to a good degree was recently suggested by Dollar et al (2007). However, in the general case, the manifold structure is not given, and is difficult to estimate accurately. As such ideal measures of quality cannot be used in the general case, an alternate quantitative measure is required.

In this work we suggest an easily computed function that measures the quality of any given embedding. We believe that a faithful embedding is an embedding that preserves the structure of the local neighborhood of each point. Therefore the quality of an embedding is determined by the success of the algorithm in preserving these local structures. The function we present, based on the Procrustes analysis, compares each neighborhood in the high-dimensional space and its corresponding low-dimensional embedding. Theoretical results regarding the convergence of the proposed measure are presented.

We further suggest two new algorithms for discovering the low-dimensional embedding of a high-dimensional data set, based on minimization of the suggested measure function. The first algorithm performs the embedding one neighborhood at a time. This algorithm is extremely fast, but may suffer from incremental distortion. The second algorithm, based on simulated annealing (Kirkpatrick et al, 1983), performs the embedding of all local neighborhoods simultaneously. A simple iterative procedure that improves on an existing output is also presented.

The paper is organized as follows. The problem of manifold learning is presented in Section 2. A discussion regarding the quality of embeddings in general and the suggested measure of quality are presented in Section 3. The embedding algorithms are presented in Section 4. In Section 5 we present numerical examples. All proofs are presented in the Appendix.

## 2 Manifold-learning problem setting and definitions

In this section we provide a formal definition of the manifold-learning dimension-reduction problem.

Let be a -dimensional manifold embedded in . Assume that a sample is taken from . The goal of manifold-learning is to find a faithful embedding of the sample in . The assumption that the sample is taken from a manifold is translated to the fact that small distances on the manifold can be approximated well by the Euclidian distance in . Therefore, to find an embedding, one first needs to approximate the structure of small neighborhoods on the manifold using the Euclidian metric in . Then one must find a unified embedding of the sample in that preserves the structure of local neighborhoods on .

In order to adhere to this scheme, we need two more assumptions. First, we assume that is isometrically embedded in . By definition, an isometric mapping between two manifolds preserves the inner product on the tangent bundle at each point. Less formally, this means that distances and angles are preserved by the mapping. This assumption is needed because we are interested in an embedding that everywhere preserves the local structure of distances and angles between neighboring points. If this assumption does not hold, close points on the manifold may originate from distant points in and vice versa. In this case, the structure of the local neighborhoods on the manifold will not reveal the structure of the original neighborhoods in . We remark here that the assumption that the embedding is isometric is strong but can be relaxed. One may assume instead that the embedding mapping is conformal. This means that the inner products on the tangent bundle at each point are preserved up to a scalar that may change continuously from point to point. Note that the class of isometric embeddings is included in the class of conformal embeddings. While our main discussion regards isometric embeddings, we will also discuss the conformal embedding problem, which is the framework of algorithms such as c-Isomap (de Silva and Tenenbaum, 2003) and Conformal Embeddings (CE) (Sha and Saul, 2005).

The second assumption is that the sample taken from the manifold is dense. We need to prevent the situation in which the local neighborhood of a point, which is computed according to the Euclidian metric in , includes distant geodesic points on the manifold. This can happen, for example, if the manifold is twisted. The result of having distant geodesic points in the same local neighborhood is that these distant points will be embedded close to each other instead of preserving the true geodesic distance between them.

To define the problem formally, we require some definitions.

The neighborhood of a point is a set of points that consists of points close to with respect to the Euclidean metric in . For example, neighbors can be -nearest neighbors or all the points in an -ball around .

The minimum radius of curvature is defined as follows:

 1r0=maxγ,t{∥∥¨γ(t)∥∥}

where varies over all unit-speed geodesics in and is in a domain of .

The minimum branch separation is defined as the largest positive number for which implies , where and is the geodesic distance between and  (see Bernstein et al, 2000., for both definitions).

We define the radius of neighborhood to be

 r(i)=maxj∈{1,…,k(i)}∥∥xi−xij∥∥

where is the -th out of the neighbors of . Finally, we define to be the maximum over .

We say that the sample is dense with respect to the chosen neighborhoods if

. Note that this condition depends on the manifold structure, the given sample, and the choice of neighborhoods. However, for a given compact manifold, if the distribution that produces the sample is supported throughout the entire manifold, this condition is valid with probability increasing towards

as the size of the sample is increased and the radius of the neighborhoods is decreased.

We now state the manifold-learning problem more formally. Let be a compact set and let be a smooth and invertible isometric mapping. Let be the -dimensional image of in . Let be a sample taken from . Define neighborhoods for each of the points . Assume that the sample is dense with respect to the choice of . Find that approximate up to rotation and translation.

## 3 Faithful embedding

As discussed in Section 2, a faithful embedding should preserve the structure of local neighborhoods on the manifold, while finding a global embedding mapping.

In this section, we will attempt to answer the following two questions:

1. How do we define “preservation of the local structure of a neighborhood”?

2. How do we find a global embedding that preserves the local structure?

We now address the first question. Under the assumption of isometry, it seems reasonable to demand that neighborhoods on the manifold and their corresponding embeddings be closely related. A neighborhood on the manifold and its embedding can be compared using the Procrustes statistic. The Procrustes statistic measures the distance between two configurations of points. The statistic computes the sum of squares between pairs of corresponding points after one of the configurations is rotated and translated to best match the other.

In the remainder of this paper we will represent any set of points as a matrix ; i.e., the -th row of the matrix corresponds to .

Let be a -point set in and let be a -point set in , where . We define the Procrustes statistic as

 G(X,Y) = inf{A,b:A′A=I,b∈Rq}k∑i=1∥xi−Ayi−b∥2 = inf{A,b:A′A=I,b∈Rq}tr((X−YA′−1b′)′(X−YA′−1b′))

where the rotation matrix is a columns-orthogonal matrix, is the adjoint of , and is a

-dimensional vector of ones.

The Procrustes rotation matrix and the Procrustes translation vector that minimize can be computed explicitly, as follows. Let where is the centering matrix. Let

be the singular-value decomposition (svd) of

, where is an orthogonal matrix, is a non-negative diagonal matrix, and is a orthogonal matrix (Mardia et al, 1979). Then, the Procrustes rotation matrix is given by  (Sibson, 1978). The Procrustes translation vector is given by , where and are the sample means of and , respectively. Due to the last fact, we may write without the translation vector as

 G(X,Y)=inf{A:A′A=I}tr((X−YA′)′H(X−YA′))=inf{A:A′A=I}∥∥H(X−YA′)∥∥2F,

where is the Frobenius norm.

Given , the minimum of can be computed explicitly and is achieved by the first principal components of . This result is a consequence of the following lemma.

###### Lemma 1

Let be a centered matrix of rank and let . Then

 inf{~X:rank(~X)=d}∥X−~X∥2F (2)

is obtained when equals the projection of on the subspace spanned by the first principal components of .

For proof, see Mardia et al (1979, page 220).

Returning to the questions posed at the beginning of this section, we will define how well an embedding preserves the local neighborhoods using the Procrustes statistic of each neighborhood-embedding pair . Therefore, a global embedding that preserves the local structure can be found by minimizing the sum of the Procrustes statistics of all neighborhood-embedding pairs.

More formally, let be the -dimensional sample from the manifold and let be a -dimensional embedding of . Let be the neighborhood of () and its embedding. Define

 R(X,Y)=1nn∑i=1G(Xi,Yi). (3)

The function measures the average quality of the neighborhood embeddings. Embedding is considered better than embedding in the local-neighborhood-preserving sense if . This means that on the average, preserves the structure of the local neighborhoods better than .

The function is sensitive to scaling, therefore normalization should be considered. A reasonable normalization is

 RN(X,Y)=1nn∑i=1G(Xi,Yi)/∥HXi∥2F. (4)

The -th summand of examines how well the rotated and translated “explains” , independent of the size of . This normalization solves the problem of increased weighting for larger neighborhoods that exists in the unnormalized . It also allows comparison of embeddings for data sets of different sizes. Hence, this normalized version is used to compare the results of different outputs (see Section 5).

In the remainder of this section, we will justify our choice of the Procrustes measure for a quantitative comparison of embeddings. We will also present two additional, closely related measures. One measure is , which can ease computation when the input space is of high dimension. The second measure is , which is a statistic designed for conformal mappings (see Section 2). Finally, we will discuss the relation between the measures suggested in this work to the objective functions of other algorithms, namely LTSA (Zhang and Zha, 2004) and SDE (Weinberger and Saul, 2006).

We now justify the use of the Procrustes statistic as a measure of the quality of the local embedding. First, estimates the relation between the entire input neighborhood and its embedding as one entity, instead of comparing angles and distances within the neighborhood with those within its embedding. Second, the Procrustes statistic is not highly sensitive to small perturbations of the embedding. More formally, , where and is a general matrix (see Sibson, 1979). Finally, the function is -norm-based and therefore prefers small differences at many points to big differences at fewer points. This is preferable in our context, as the local embedding of the neighborhood should be compatible with the embeddings of nearby neighborhoods.

The usage of as a measure of the quality of the global embedding of the manifold is justified by Theorem 3.1. Theorem 3.1 claims that when the number of input points increases, the low-dimensional points of the input data tend to zero . This implies that the minimizer of should be close to the original data set (up to rotation and translation).

###### Theorem 3.1

Let be a compact connected set. Let be an isometry. Let be an -point sample taken from , and let . Assume that the sample is dense with respect to the choice of neighborhoods for all . Then for all

 R(X(n),Z(n))=O(r4max). (5)

See Appendix A.1 for proof.

Replacing with the normalized version (see Eq. 4) and noting that , we obtain

###### Corollary 1
 RN(X(n),Z(n))=O(r2max).

To avoid heavy computations, a slightly different version of can be considered. Instead of measuring the difference between the original neighborhoods on the manifold and their embeddings, one can compare the local PCA projections (Mardia et al, 1979) of the original neighborhoods with their embeddings. We therefore define

 RPCA(X,Y)=1nn∑i=1G(XiPi,Yi), (6)

where is the -dimensional PCA projection matrix of .

The convergence theorem for is similar to Theorem 3.1, but the convergence is slower.

###### Theorem 3.2

Let be a compact connected set. Let be an isometry. Let be an -point sample taken from , and let . Assume that the sample is dense with respect to the choice of neighborhoods for all . Then for all

 RPCA(X(n),Z(n))=O(r3max),

See Appendix A.2 for proof.

We now present another version of the Procrustes measure , suitable for conformal mappings. Here we want to compare between each original neighborhood and its corresponding embedding , where we allow not only to be rotated and translated but also to be rescaled. Define

Note that the scalar was introduced here to allow scaling of . Let and let be the svd of . The minimizer rotation matrix is given by . The minimizer constant is given by (see Sibson, 1978). The (normalized) conformal Procrustes measure is given by

 RC(X,Y)=1nn∑i=1GC(Xi,Yi)/∥HXi∥2F. (7)

Note that since the constraints are relaxed in the definition of . However, the lower bound in both cases is equal (see Lemma 1).

We present a convergence theorem, similar to that of and .

###### Theorem 3.3

Let be a compact connected set. Let be a conformal mapping. Let be an -point sample taken from , and let . Assume that the sample is dense with respect to the choice of neighborhoods for all . Then for all we have

 RC(X(n),Z(n))=O(r2max).

See Appendix A.3 for proof. Note that this result is of the same convergence rate as of (see Corollary 1).

A cost function somewhat similar to the measure was presented by Zhang and Zha (2004) in a slightly different context. The local PCA projection of neighborhoods was used as an approximation of the tangent plane at each point. The resulting algorithm, local tangent subspaces alignment (LTSA), is based on minimizing the function

 n∑i=1∥∥(I−PiP′i)HYi∥∥2F,

where the matrices are as in . The minimization performed by LTSA is under a normalization constraint. This means that as a measure, LTSA’s objective function is designed to compare normalized outputs (otherwise would be the trivial minimizer) and is therefore unsuitable as a measure.

Another algorithm worth mentioning here is SDE (Weinberger and Saul, 2006). The constraints on the output required by this algorithm are that all the distances and angles within each neighborhood be preserved. Therefore the output of this algorithm should always be close to the minimum of . The maximization of the objective function of this algorithm

 ∑i,j∥∥yi−yj∥∥2

is reasonable when the aforementioned constraints are enforced. However, it is not relevant as a measure for comparison of general outputs that do not fulfill these constraints.

In summary, we return to the questions posed at the beginning of this section. We choose to define preservation of local neighborhoods as minimization of the Procrustes measure (see Eq. 3). We therefore find a global embedding that best preserves local structure by minimizing . For computational reasons, minimization of (see Eq. 6) may be preferred. For conformal maps we suggest the measure (see Eq. 7), which allows separate scaling of each neighborhood.

## 4 Algorithms

In Section 3 we showed that a faithful embedding should yield low and values. Therefore we may attempt to find a faithful embedding by minimizing these functions. However, and are not necessarily convex functions and may have more than one local minimum. In this section we present two algorithms for minimization of or . In addition, we present an iterative method that can improve the output of the two algorithms, as well as other existing algorithms.

The first algorithm, greedy Procrustes (GP), performs the embeddings one neighborhood at a time. The first neighborhood is embedded using the PCA projection (see Lemma 1). At each stage, the following neighborhood is embedded by finding the best embedding with respect to the embeddings already found. This greedy algorithm is presented in Section 4.1.

The second algorithm, Procrustes subspaces alignment (PSA), is based on an alignment of the local PCA projection subspaces. The use of local PCA produces a good low-dimensional description of the neighborhoods, but the description of each of the neighborhoods is in an arbitrary coordinate system. PSA performs the global embedding by finding the local embeddings and then aligning them. PSA is described in Section 4.2. Simulated annealing (SA) is used to find the alignment of the subspaces (see Section 4.3).

After an initial solution is found using either GP or PSA, the solution can be improved using an iterative procedure until there is no improvement (see Section 4.4).

### 4.1 Greedy Procrustes (GP)

GP finds the neighborhoods’ embeddings one by one. The flow of the algorithm is described below.

1. Initialization:

• Find the neighbors of each point and let be the indices of the neighbors .

• Initialize the list of all embedded points’ indices to .

2. Embedding of the first neighborhood:

• Choose an index (randomly).

• Calculate the embedding , where is the PCA projection matrix of .

• Update the list of indices of embedded points .

3. Find all other embeddings (iteratively):

• Find , where is the unembedded point with the largest number of embedded neighbors,
.

• Define , the points in that are already embedded.
Define as the (previously calculated) embedding of .
Define , the points in that are not embedded yet.

• Compute the Procrustes rotation matrix and translation vector between and .

• Define the embedding of the points in as .

• Update the list of indices of embedded points
.

4. Stopping condition:
Stop when , i.e., when all points are embedded.

The main advantage of GP is that it is fast. The embedding for is computed in where is the size of . Therefore, the overall computation time is , where . While GP does not claim to find the global minimum, it does find an embedding that preserves the local neighborhood’s structure. The main disadvantage of GP is that it has incremental errors.

### 4.2 Procrustes Subspaces Alignment (PSA)

can be written in terms of the Procrustes rotation matrices as

 R(X,Y) = 1nn∑i=1inf{Ai:A′iAi=I}∥∥H(Xi−YiA′i)∥∥2F,

where is the centering matrix. can be calculated, given and . However, as is not given, one way to find is by first guessing the matrices and then finding the that minimizes

 R(X,Y|A1,…,An)=1nn∑i=1∥∥H(Xi−YiA′i)∥∥2F. (8)

can be found by taking derivatives of .

We therefore need to choose wisely. The choice of the Jacobian matrices as a guess for is justified by the following, as shown in the proof of Theorem 3.1 (see Eq. A.1). As the size of the sample is increased, . This means that choosing will lead to a solution that is close to the minimizer of .

The PCA projection matrices approximate the unknown Jacobian matrices up to a rotation. To use them in place of , we must first align the projections correctly. Therefore, our guess for is of the form , where are rotation matrices that minimize

 f(A1,…,An)=n∑i=1∑j∈Neighbors(i)∥∥Ai−Aj∥∥2F. (9)

The rotation matrices can be found using simulated annealing, as described in Section 4.3.

Once the matrices are found, we need to minimize . We first write in matrix notation as

 R(X,Y|A1,…,An)=1nn∑i=1tr((X−YA′i)′Hi(X−YA′i)).

Here is the centering matrix of neighborhood

 Hi(k,l)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1−1k(i)k=landk∈Neighbors(i)−1k(i)k≠landk,l∈Neighbors(i)0elsewhere.

The rows of the matrix at indices are ’s centered neighborhood, where all the other rows equal zero, and similarly for . .

Taking the derivative of (Eq. 8) with respect to  (see Mardia et al, 1979) and using the fact that , we obtain

 ∂∂YR(X,Y|A1,…,An)=2nn∑i=1HiXAi−2nn∑i=1HiY. (10)

Using the general inverse of we can write as

 Y=(n∑i=1Hi)⊥n∑i=1HiXAi. (11)

Summarizing, we present the PSA algorithm.

1. Initialization:

• Find the neighbors of each point .

• Find the PCA projection matrices of the neighborhood .

2. Alignment of the projection matrices:
Find that minimize Eq. 9 using, for example, simulated annealing (see Section 4.3).

3. Find the embedding:
Compute according to Eq. 11.

The advantage of this algorithm is that it is global. The computation time of this algorithm (assuming that the matrices are already known) depends on multiplying by the inverse of the sparse symmetric semi-positive definite matrix , which can be very costly. However, this matrix need not be computed explicitly. Instead, one may solve -linear-equation systems of the form , which can be computed much faster (see, for example, Munksgaard, 1980).

### 4.3 Simulated Annealing (SA) alignment procedure

In step 2. of PSA (see Section 4.2), it is necessary to align the PCA projection matrices . In the following we suggest an alignment method based on simulated annealing (SA) (Kirkpatrick et al, 1983). The aim of the suggested algorithm is to find a set of columns-orthonormal matrices that minimize Eq. 9. A number of closely-related algorithms, designed to find embeddings using alignment of some local dimensionally-reduced descriptions, were previously suggested. Roweis et al (2001) and Verbeek et al (2002) introduced algorithms based on probabilistic mixtures of local FA and PCA structures, respectively. As Eq. 9 and these two algorithms suffer from local minima, the use of simulated annealing may be beneficial. Another algorithm, suggested by Teh and Roweis (2003), uses a convex objective function to find the alignment. The output matrices of this algorithm are not necessarily columns-orthonormal, as is required in our case .

Minimizing Eq. 9 is similar to the Ising model problem (see, for example, Cipra, 1987). The Ising model consists of a neighbor-graph and a configuration space that is the set of all possible assignments of or to each vertex of the graph. A low-energy state is one in which neighboring points have the same sign. Our problem consists of a neighbor-graph with a configuration space that includes all of the rotations of the projection matrices at each point . Minimizing the function is similar to finding a low-energy state of the Ising model. As solutions to the Ising model usually involve algorithms such as simulated annealing , we take the same path here.

We present the SA algorithm, following the algorithm suggested by Siarry et al (1997), modified for our problem.

1. Initialization:

• Choose an initial configuration (for example, using GP).

• Compute the initial temperature (see Siarry et al, 1997).

• Define the cooling scheme (see Siarry et al, 1997).

2. Single SA step:

• Choose randomly.
Generate a random rotation matrix  (see Stewart, 1980).
Define .

• Compute .
Note that it is enough to compute .

• Accept if either

 f(A1,…,ANewi,…,An)

or with some probability depending on the current temperature.

• Decrease the temperature and stop if the lowest temperature is reached.

3. Outer iterations:

• First iteration: Perform a run of SA on all matrices .
Find the largest cluster of aligned matrices (for example, use BFS (Corman et al, 1990) and define an alignment criterion).

• Other iterations: Apply SA to the matrices that are not in the largest cluster. Update the largest cluster after each run.

• Repeat until the size of the largest cluster includes almost all of the matrices .

Using SA is complicated. The cooling scheme requires the estimation of many parameters, and the run-time depends heavily on the correct choice of these parameters. For output of large dimension, the alignment is difficult, and the output of SA can be poor. Although SA is a time-consuming algorithm, each iteration is very simple, involving only operations, where is the maximum number of neighbors, and and are the input and output dimensions, respectively. In addition, the memory requirements are modest. Therefore, SA can run even when the number of points is large.

### 4.4 Iterative procedure

Given a solution of GP, PSA, or any other technique, it is usually possible to modify so that is further decreased. The idea of the iterative procedure we present here was suggested independently by Zhang and Zha (2004), but no details were supplied.

In Section 4.2, we showed that given , the improved matrices are obtained by finding the Procrustes matrices between and . Given the matrices , the embedding can be found using Eq. 11. An iterative procedure would require finding first the new matrices and then a new embedding at each stage. This would be repeated until the change in the value of was small.

The problem with this scheme is that it involves the computation of the inverse of the matrix (see end of Section 4.2). We therefore suggest a modified version of this iterative procedure, which is easier to compute. Recall that

 R(X,Y)=n∑i=1∑j∈Neighbors(i)∥∥xj−Aiyj−bi∥∥2.

The least-squares solution for is

 1|{Neighbors(i)}|∑j∈Neighbors(i)(xj−Aiyj). (12)

The least-squares solution for is

 1|{i:j∈Neighbors(i)}|∑{i:j∈Neighbors(i)}A′i(xj−bi). (13)

Note that we get a different solution than that in Eq. 10. The reason is that here we consider as constants when we look for a solution for . In fact, appear in the definition of the . However, as appear there multiplied by , these terms make only a small contribution.

We suggest performing the iterations as follows. First, find the Procrustes rotation matrices and the translation vectors using Eq. 12. Then find using Eq. 13. Repeat until no longer decreases significantly.

## 5 Numerical Examples

In this section we evaluate the algorithms GP and PSA on data sets that we assumed to be sampled from underlying manifolds. We compare the results to those obtained by LLE (Roweis and Saul, 2000), Isomap (Tenenbaum et al, 2000), and LTSA (Zhang and Zha, 2004), both visually and using the measures and (see Table 2).

The algorithms GP and PSA were implemented in the Matlab environment, running on a Pentium 4 with a 3 Ghz CPU and 0.98 GB of RAM. The alignment stage of PSA was implemented using SA (see Section 4.3). The runs of both GP and PSA were followed by the iterative procedure described in Section 4.4 to improve the minimization of . LLE, Isomap, and LTSA were evaluated using the Matlab code taken from the sites of the respective authors. The algorithm SDE (Weinberger and Saul, 2006), whose minimization is closest in spirit to ours, was also tested; however, it suffers from heavy computational demands, and the results of this algorithm could not be obtained using the code provided in the site.

The data sets are described in Table 1. We ran all five algorithms using and nearest neighbors. The minimum values for and are presented in Table 2. The results in all cases were qualitatively the same, therefore in the images we show the results for only.

Overall, GP and PSA perform satisfactorily as shown both in the figures and in Table 2. The fact that in most of the examples GP and PSA get lower values than LLE, Isomap, and LTSA is perhaps not surprising, as GP and PSA are designed to minimize . The run-times of the algorithms excluding PSA is on the order of seconds while it takes PSA a few hours to run. Memory requirements of GP and PSA are lower than those of the other algorithms. As a consequence of the memory requirements, results could not be obtained for LLE, Isomap and LTSA for .

Use of the measure allows a quantitative comparison of visually similar outputs. Regarding the output of the cylinder (see Fig. 3), for example, PSA and Isomap both give topologically sound results; however, shows that locally, PSA does a better job. In addition, can be used to optimize embedding parameters such as neighborhood size (see Fig.5).

## 6 Discussion

In this section, we emphasize the main results of this work and indicate possible directions for future research.

We demonstrated that overall, the measure provides a good estimation of the quality of the embedding. It allows a quantitative comparison of the outputs of various embedding algorithms. Moreover, it is quickly and easily computed. However, two points should be noted.

First, measures only the local quality of the embedding. As emphasized in Fig. 3, even outputs that do not preserve the global structure of the input may yield relatively low -values if the local neighborhood structure is generally preserved. This problem is shared by all manifold-embedding techniques that try to minimize only local attributes of the data. The problem can be circumvented by adding a penalty for outputs that embed distant geodesic points close to each other. Distant geodesic points can be defined, for example, as points at least -distant on the neighborhood graph, with .

Second, is not an ideal measure of the quality of embedding for algorithms that normalize their output, such as LLE (Roweis and Saul, 2000), Laplacian Eigenmap (Belkin and Niyogi, 2003), and LTSA (Zhang and Zha, 2004). This is because normalization of the output distorts the structure of the local neighborhoods and therefore yields high -values even if the output seems to find the underlying structure of the input. This point (see also discussion in Sha and Saul, 2005, Section 2) raises the questions, which qualities are preserved by these algorithms and how can one quantify these qualities. However, it is clear that these algorithms do not perform faithful embedding in the sense defined in Section 3. The measure addresses this problem to some degree, by allowing separate scaling of each neighborhood (see Table 2

). One could consider an even more relaxed measure which allows not only rotation, translations and scaling but a general linear transformation of each neighborhood. However, it is not clear what exactly such measure would quantify. Two new embedding algorithms were introduced. We discuss some aspects of these algorithms below.

PSA, in the form we suggested in this work, uses SA to align the tangent subspaces at all points. While PSA works reasonably well for small input sets and low output dimension spaces, it is not suitable for large data sets. However, the algorithm should not be rejected as a whole. Rather, a different or modified technique for subspaces alignment, for example the use of landmarks (de Silva and Tenenbaum, 2003), is required in order to make this algorithm truly useful.

GP is very fast ( where is the number of sample points), can work on very large input sets (even in less than an hour), and obtains good results as shown both in Figs. 1-4 and in Table 2. This algorithm is therefore an efficient alternative to the existing algorithms. It may also be used to choose optimal parameters, such as neighborhood size and output dimension, before other algorithms are applied. can be used for the comparison of GP outputs for varied parameters.

An important issue that was not considered in depth in this paper is that of noisy input data. The main problem with noisy data is that, locally, the data seems -dimensional, even if the manifold is -dimensional, . To overcome this problem, one should choose neighborhoods that are large relative to the magnitude of the noise, but not too large with respect to the curvature of the manifold. If the neighborhood size is chosen wisely, both PSA and GP should overcome the noise and perform the embedding well (see Fig. 6). This is due to the fact that both algorithms are based on Procrustes analysis and PCA, which are relatively robust against noise. Further study is required to define a method for choosing the optimal neighborhood size.

## Appendix A Proofs

### a.1 Proof of Theorem 3.1

In this section, we denote the points of neighborhood as , where is the number of neighbors in .

###### Proof

In order to prove that is , it is enough to show that for each , , where is the radius of the -th neighborhood. The proof consists of replacing the Procrustes rotation matrix by , the Jacobian of the mapping at . Note that the fact that is an isometry ensures that . The Procrustes translation vector is replaced by .

Recall that by definition ; therefore . Hence,

 G(Xi,Zi) = infAi,bik(i)∑j=1∥∥xij−Aizij−bi∥∥2 ≤ k(i)∑j=1∥∥xij−Jizij−(xi−Jizi)∥∥2 = k(i)∑j=1∥∥(xij−xi)−Ji(zij−zi)∥∥2 = k(i)∑j=1O(∥∥zij−zi∥∥4).

is an isometry, therefore , where is the geodesic metric. The sample is assumed to be dense, hence , where is the minimum branch separation (see Section 2). Using Bernstein et al (2000., Lemma 3) we conclude that

 ∥∥zij−zi∥∥=dM(xij,xi)<π2∥∥xij−xi∥∥.

We can therefore write , which completes the proof.

### a.2 Proof of Theorem 3.2

The proof of Theorem 3.2 is based on the idea that the PCA projection matrix is usually a good approximation of the span of the Jacobian . The structure of the proof is as follows. First we quantify the relations between and , the projections of the -th neighborhood using the PCA projection matrix and the Jacobian, respectively. Then we follow the lines of the proof of Theorem 3.1, using the bounds obtained previously.

To compare the PCA projection subspace and tangent subspace at we use the notation of angle between subspaces. Note that both subspaces are -dimensional and are embedded in the Euclidian space . The columns of the matrices and consist of orthonormal bases of the PCA projection space and of the tangent space, respectively. Denote these subspaces by and , respectively. Surprisingly, the angle between and can be arbitrarily large. However, in the following we show that even if the angle between the subspaces is large, the projected neighborhoods are close.

We start with some definitions. The principal angles between and may be defined recursively for as (see Golub and Loan, 1983)

 cos(σp)=maxv∈Pimaxw∈Jiv′w,

subject to

 ∥v∥=∥w∥=1,v′vk=0,w′wk=0;k=1,…,p−1.

The vectors and are called principal vectors.

The fact that and have orthogonal columns leads to a simple way to calculate the principal vectors and angles explicitly. Let be the svd of . Then (see Golub and Loan, 1983)

1. are given by the columns of .

2. are given by the columns of .

The relations between the two sets of vectors plays an important role in our computations. Write , where , and . Note that is the cosine of the -th principal angle between and . The angle between the subspaces is defined as and the distance between the two subspaces is defined to be .

We now prove some basic claims related to the principal vectors.

###### Lemma 2

Let be the projection matrix of the neighborhood and let be the Jacobian of at . Let be the svd of and and be the columns of and , respectively. Then

1. are an orthonormal vector system.

2. are an orthonormal vector system.

3. for .

4. for .

5. for .

###### Proof

1. True, since is an orthonormal matrix.

2. True, since is an orthonormal matrix.

3. Note that where is a diagonal non-negative matrix.

4.  (βpv⊥p)′(βqv⊥q) = (wp−αpvp)′(wq−αpvq) = w′pwq−vp′wq−v′qwp+vp′vq=0.
5.  (βpv⊥p)′vq = (wp−αpvp)′vq = w′pvq−αpvp′vq = δpqαp−δpqαp=0.

Using the relation between the principal vectors, we can compare the description of the neighborhood in the local PCA coordinations and its description in the tangent space coordinations. Here we need to exploit two main facts. The first fact is that the local PCA projection of a neighborhood is the best approximation, in the sense, to the original neighborhood. Specifically, it is a better approximation than the tangent space in the sense. The second is that in a small neighborhood of , the tangent space itself is a good approximation to the original neighborhood. Formally, the first assertion means that

 k(i)∑j=1∥∥(xij−¯xi)∥∥2≥k(i)∑j=1∥∥P′i(xij−¯xi)∥∥2≥k(i)∑j=1∥∥J′i(xij−¯xi)∥∥2 (15)

while the second assertion means that

 k(i)∑j=1∥∥(xij−¯xi)∥∥2−k(i)∑j=1∥∥J′i(xij−¯xi)∥∥2=O(r4i). (16)

The proof of Eq. 16 is straightforward. First note that

 (xij−¯xi) = (xij−xi)−(¯xi−