## 1 Introduction

Many high-dimensional data in real-world applications can be modeled as data points lying close to a low-dimensional nonlinear manifold. Discovering the structure of the manifold from a set of data points sampled from the manifold possibly with noise represents a very challenging unsupervised learning problem

[2, 3, 4, 8, 9, 10, 13, 14, 15, 17, 18]. The discovered low-dimensional structures can be further used for classification, clustering, outlier detection and data visualization. Example low-dimensional manifolds embedded in high-dimensional input spaces include image vectors representing the same 3D objects under different camera views and lighting conditions, a set of document vectors in a text corpus dealing with a specific topic, and a set of 0-1 vectors encoding the test results on a set of multiple choice questions for a group of students

[13, 14, 18]. The key observation is that the dimensions of the embedding spaces can be very high (e.g., the number of pixels for each images in the image collection, the number of terms (words and/or phrases) in the vocabulary of the text corpus, or the number of multiple choice questions in the test), the intrinsic dimensionality of the data points, however, are rather limited due to factors such as physical constraints and linguistic correlations. Traditional dimension reduction techniques such as principal component analysis and factor analysis usually work well when the data points lie close to a

linear (affine) subspace in the input space [7]. They can not, in general, discover nonlinear structures embedded in the set of data points.Recently, there have been much renewed interests in developing efficient algorithms for constructing nonlinear low-dimensional manifolds from sample data points in high-dimensional spaces, emphasizing simple algorithmic implementation and avoiding optimization problems prone to local minima [14, 18]. Two lines of research of manifold learning and nonlinear dimension reduction have emerged: one is exemplified by [2, 3, 18] where pairwise geodesic

distances of the data points with respect to the underlying manifold are estimated, and the classical multi-dimensional scaling is used to project the data points into a low-dimensional space that best preserves the geodesic distances. Another line of research follows the long tradition starting with self-organizing maps (SOM)

[10], principal curves/surfaces [6] and topology-preserving networks [11]. The key idea is that the information about the global structure of a nonlinear manifold can be obtained from a careful analysis of the interactions of the overlapping local structures. In particular, the local linear embedding (LLE) method constructs a local geometric structure that is invariant to translations and orthogonal transformations in a neighborhood of each data points and seeks to project the data points into a low-dimensional space that best preserves those local geometries [14, 16].Our approach draws inspiration from and improves upon the work in [14, 16] which opens up new directions in nonlinear manifold learning with many fundamental problems requiring to be further investigated. Our starting point is not to consider nonlinear dimension reduction in isolation as merely constructing a nonlinear projection, but rather to combine it with the process of reconstruction of the nonlinear manifold, and we argue that the two processes interact with each other in a mutually reinforcing way. In this paper, we address two inter-related objectives of nonlinear structure finding: 1) to construct the so-called principal manifold [6] that goes through “the middle” of the data points; and 2) to find the global coordinate system (the natural parametrization space) that characterizes the set of data points in a low-dimensional space. The basic idea of our approach is to use the tangent space in the neighborhood of a data point to represent the local geometry, and then align those local tangent spaces to construct the global coordinate system for the nonlinear manifold.

The rest of the paper is organized as follows: in section 2, we formulate the problem of manifold learning and dimension reduction in more precise terms, and illustrate the intricacy of the problem using the linear case as an example. In section 3, we discuss the issue of learning local geometry using tangent spaces, and in section 4 we show how to align those local tangent spaces in order to learn the global coordinate system of the underlying manifold. Section 5 discusses how to construct the manifold once the global coordinate system is available. We call the new algorithm local tangent space alignment (LTSA) algorithm. In section 6, we present an error analysis of LTSA, especially illustrating the interactions among curvature information embedded in the Hessian matrices, local sampling density and noise level, and the regularity of the Jacobi matrix. In section 7, we show how the partial eigendecomposition used in global coordinate construction can be efficiently computed. We then present a collection of numerical experiments in section 8. Section 9 concludes the paper and addresses several theoretical and algorithmic issues for further research and improvements.

## 2 Manifold Learning and Dimension Reduction

We assume that a -dimensional manifold embedded in an -dimensional space can be represented by a function

where is a compact subset of with open interior. We are given a set of data points , where are sampled possibly with noise from the manifold, i.e.,

where represents noise. By dimension reduction we mean the estimation of the unknown lower dimensional feature vectors ’s from the ’s, i.e., the ’s which are data points in is (nonlinearly) projected to ’s which are points in , with we realize the objective of dimensionality reduction of the data points. By manifold learning we mean the reconstruction of from the ’s, i.e., for an arbitrary test point , we can provide an estimate of . These two problems are inter-related, and the solution of one leads to the solution of the other. In some situations, dimension reduction can be the means to an end by itself, and it is not necessary to learn the manifold. In this paper, however, we promote the notion that both problems are really the two sides of the same coin, and the best approach is not to consider each in isolation. Before we tackle the algorithmic details, we first want to point out that the key difficulty in manifold learning and nonlinear dimension reduction from a sample of data points is that the data points are unorganized, i.e., no adjacency relationship among them are known beforehand. Otherwise, the learning problem becomes the well-researched nonlinear regression problem (for a more detailed discussion, see [4] where techniques from computational geometry was used to solve error-free manifold learning problems). To ease discussion, in what follows we will call the space where the data points live the input space, and the space into which the data points are projected the feature space.

To illustrate the concepts and problems we have introduced, we consider the example of linear manifold learning and linear dimension reduction. We assume that the set of data points are sampled from a -dimensional affine subspace, i.e.,

where and represents noise. is a matrix forms an orthonormal basis of the affine subspace. Let

Then in matrix form, the data-generation model can be written as

here is an -dimensional column vector of all ones. The problem of linear manifold learning amounts to seeking and to minimize the reconstruction error , i.e.,

where

stands for the Frobenius norm of a matrix. This problem can be readily solved by the singular value decomposition (SVD) based upon the following two observations.

1) The norm of the error matrix can be reduced by removing the mean of the columns of from each column of , and hence one can assume that the optimal has zero mean. This requirement can be fulfilled if is chosen as the mean of , i.e., .

2) The low-rank matrix is the optimal rank- approximation to the centered data matrix . Hence the the optimal solution is given by the SVD of ,

i.e., where with the largest singular values of , and are the matrices of the corresponding left and right singular vectors, respectively. The optimal is then given by and the learned linear manifold is represented by the linear function

In this model, the coordinate matrix corresponding to the data matrix is given by

Ideally, the dimension of the learned linear manifold should be chosen such that .

The function is not unique in the sense that it can be reparametrized, i.e., the coordinate can be replaced by with a global affine transformation , if we change the basis matrix to

. What we are interested in with respect to dimension reduction is the low-dimensional representation of the linear manifold in the feature space. Therefore, without loss of generality, we can assume that the feature vectors are uniformly distributed. For a given data set, this amounts to assuming that the coordinate matrix

is orthonormal in row, i.e., . Hence we we can take and the linear function is now the followingFor the linear case we just discussed, the problem of dimension reduction is solved by computing the right singular vectors , and this can be done without the help of the linear function . Similarly, the construction of the linear function is done by computing which are just the largest left singular vectors of .

The case for nonlinear manifolds is more complicated. In general, the global nonlinear structure will have to come from local linear analysis and alignment [14, 17]. In [14], local linear structure of the data set are extracted by representing each point as a weighted linear combination of its neighbors, and the local weight vectors are preserved in the feature space in order to obtain a global coordinate system. In [17], a linear alignment strategy was proposed for aligning a general set of local linear structures. The type of local geometric information we use is the tangent space at a given point which is constructed from a neighborhood of the given point. The local tangent space provides a low-dimensional linear approximation of the local geometric structure of the nonlinear manifold. What we want to preserve are the local coordinates of the data points in the neighborhood with respect to the tangent space. Those local tangent coordinates will be aligned in the low dimensional space by different local affine transformations to obtain a global coordinate system. Our alignment method is similar in spirit to that proposed in [17]. In the next section we will discuss the local tangent space and global alignment which will then be applied to data points sampled with noise in Section 4.

## 3 Local Tangent Space and Its Global Alignment

We assume that is a -dimensional manifold in a -dimensional space with unknown generating function , and we are given a data set consists of -dimensional vectors generated from the following noise-free model,

where with . The objective as we mentioned before for nonlinear dimension reduction is to reconstruct ’s from the corresponding function values ’s without explicitly constructing . Assume that the function is smooth enough, using first-order Taylor expansion at a fixed , we have

(3.1) |

where is the Jacobi matrix of at . If we write the components of as

The tangent space of at is spanned by the column vectors of and is therefore of dimension at most , i.e., . The vector gives the coordinate of in the affine subspace . Without knowing the function , we can not explicitly compute the Jacobi matrix . However, if we know in terms of , a matrix forming an orthonormal basis of , we can write

Furthermore,

The mapping from to represents a local affine transformation. This affine transformation is unknown because we do not know the function . The vector , however, has an approximate that orthogonally projects onto ,

(3.2) |

provided is known at each . Ignoring the second-order term, the global coordinate satisfies

Here defines the neighborhood of . Therefore, a natural way to approximate the global coordinate is to find a global coordinate and a local affine transformation that minimize the error function

(3.3) |

This represents a nonlinear alignment approach for the dimension reduction problem (this idea will be picked up at the end of section 4).

On the other hand, a linear alignment approach can be devised as follows. If is of full column rank, the matrix should be non-singular and

The above equation shows that the affine transformation should align this local coordinate with the global coordinate for . Naturally we should seek to find a global coordinate and a local affine transformation to minimize

(3.4) |

The above amounts to matching the local geometry in the feature space. Notice that is defined by the “known” function value and the “unknown” orthogonal basis matrix of the tangent space. It turns out, however, can be approximately determined by certain function values. We will discuss this approach in the next section. Clearly, this linear approach is more readily applicable than (3.3). Obviously, If the manifold is not regular, i.e., the Jacobi matrix is not of full column rank at some points , then the two minimization problems (3.4) and (3.3) may lead to quite different solutions.

As is discussed in the linear case, the low-dimensional feature vector is not uniquely determined by the manifold . We can reparametrize using where is a smooth -to- onto mapping of to itself. The parameterization of can be fixed by requiring that has a uniform distribution over . This will come up as a normalization issue in the next section.

## 4 Feature Extraction through Alignment

Now we consider how to construct the global coordinates and local affine transformation when we are given a data set sampled with noise from an underlying nonlinear manifold,

where , with . For each , let be a matrix consisting of its -nearest neighbors including , say in terms of the Euclidean distance. Consider computing the best -dimensional affine subspace approximation for the data points in ,

where is of columns and is orthonormal, and . As is discussed in section 2, the optimal is given by , the mean of all the ’s and the optimal is given by , the left singular vectors of corresponding to its largest singular values, and is given by defined as

(4.5) |

Therefore we have

(4.6) |

where denotes the reconstruction error.

We now consider constructing the global coordinates , in the low-dimensional feature space based on the local coordinates which represents the local geometry. Specifically, we want to satisfy the following set of equations, i.e., the global coordinates should respect the local geometry determined by the ,

(4.7) |

where is the mean of . In matrix form,

where and is the local reconstruction error matrix, and we write

(4.8) |

To preserve as much of the local geometry in the low-dimensional feature space, we seek to find and the local affine transformations to minimize the reconstruction errors , i.e.,

(4.9) |

Obviously, the optimal alignment matrix that minimizes the local reconstruction error for a fixed , is given by

where is the Moor-Penrose generalized inverse of . Let and be the - selection matrix such that . We then need to find to minimize the overall reconstruction error

where , and with

(4.10) |

To uniquely determine , we will impose the constraints , it turns out that the vector

of all ones is an eigenvector of

(4.11) |

corresponding to a zero eigenvalue, therefore, the optimal is given by the eigenvectors of the matrix , corresponding to the nd to +1st smallest eigenvalues of .

Remark. We now briefly discuss the nonlinear alignment idea mentioned in (3.3). In particular, in a neighborhood of a data point consisting of data points , by first order Taylor expansion, we have

Let be the neighborhood selection matrix as defined before, we seek to find and to minimize

where . The LTSA algorithm can be considered as an approach to find an approximate solution to the above minimization problem. We can, however, seek to find the optimal solution of using an alternating least squares approach: fix minimize with respect to , and fix minimize with respect to , and so on. As an initial value to start the alternating least squares, we can use the obtained from the LTSA algorithm. The details of the algorithm will be presented in a separate paper.

Remark. The minimization problem (4.9) needs certain constraints (i.e., normalization conditions) to be well-posed, otherwise, one can just choose both and to be zero. However, there are more than one way to impose the normalization conditions. The one we have selected, i.e., , is just one of the possibilities. To illustrate the issue we look at the following minimization problem,

The approach we have taken amounts to substituting , and minimize with the normalization condition . However,

and we can minimize the above by imposing the normalization condition

This nonuniqueness issue is closely related to to nonuniqueness of the parametrization of the nonlinear manifold , which can reparametrized as with a 1-to-1 mapping .

## 5 Constructing Principal Manifolds

Once the global coordinates are computed for each of the data points , we can apply some non-parametric regression methods such as local polynomial regression to to construct the principal manifold underlying the set of points . Here each of the component functions can be constructed separately, for example, we have used the simple loess function [19] in some of our experiments for generating the principal manifolds.

In general, when the low-dimensional coordinates are available, we can construct an mapping from the -space (feature space) to the -space (input space) as follows.

1. For each fixed , let be the nearest neighbor (i.e., for ). Define

where be the mean of the feature vectors in a neighbor to which belong.

2. Back in the input space, we define

Let us define by the resulted mapping,

(5.12) |

To distinguish the computed coordinates from the exact ones, in the rest of this paper, we denote by the exact coordinate, i.e.,

(5.13) |

Obviously, the errors of the reconstructed manifold represented by depend on the sample errors , the local tangent subspace reconstruction errors , and the alignment errors . The following result show that this dependence is linear. Let , , and . Then

Substituting into (5.12) gives

Because , we obtain that

Therefore we have

completing the proof.

In the next section, we will give a detailed error analysis to estimate the errors of alignment and tangent space approximation in terms of the noise, the geometric properties of the generating function and the density of the generating coordinates .

## 6 Error Analysis

As is mentioned in the previous section, we assume that that the data points are generated by

For each , let be a matrix consisting of its -nearest neighbors including in terms of the Euclidean distance. Similar to defined in (4.8), we denote by the corresponding local noise matrix, . The low-dimensional embedding coordinate matrix computed by the LTSA algorithm is denoted by . We first present a result that bounds in terms of

Assume satisfies . Let be the mean of , Denote and the Hessian matrix of the -th component function of . If the ’s are nonsingular, then

where is defined by

Furthermore, if each neighborhood is of size , then . First by (4.8), we have

(6.14) |

To represent in terms of the Jacobi matrix of , we assume that is smooth enough and use Taylor expansion at , the mean of the neighbors of , we have

where and represents the remainder term beyond the first order expansion, in particular, its -th components can be approximately written as (using second order approximation),

with the Hessian matrix of the -th component function of at . We have in matrix form,

with . Multiplying by the centering matrix gives

(6.15) |

Substituting (6.15) into (6.14) and denoting , we obtain that

(6.16) |

For any satisfying the orthogonal condition and any , we also have the similar expression of (6.16) for and . Note that and , , minimize the overall reconstruction error, . Setting and , we obtain the upper bound

We estimate the norm by ignoring the higher order terms, and obtain that

completing the proof.

The non-singularity of the matrix requires that the Jacobi matrix be of full column rank and the two subspaces and the largest left singular vector space are not orthogonal to each other. We now give a quantitative measurement of the non-singularity of . Let be the -th singular value of , and denote with defined in Theorem 6. Then

The proof is simple. Let be the SVD of the matrix . By (6.15) and perturbation bounds for singular subspaces [5, Theorem 8.6.5], the singular vector matrix can be expressed as

(6.17) |

with

where is the -largest singular value of . On the other hand, from the SVD of , we have , which gives

It follows that

Therefore we have

completing the proof.

The degree of non-singularity of is determined by the curvature of the manifold and the rotation of the singular subspace is mainly affected by the sample noises ’s and the neighborhood structure of ’s. The above error bounds clearly show that reconstruction accuracy will suffer if the manifold underlying the data set has singular or near-singular points. This phenomenon will be illustrated in the numerical examples in section 8. Finally, we give an error upper bound for the tangent subspace approximation. Let be the spectrum condition number of the -column matrix . Then

By (6.15), we write

with . To estimate , we use the expression (6.17) to obtain

Taking norms gives that

The result required follows. The above results show that the accurate determination of the local tangent space is dependent on several factors: curvature information embedded in the Hessian matrices, local sampling density and noise level, and the regularity of the Jacobi matrix.

## 7 Numerical Computation Issues

One major computational cost of LTSA involves the computation of the smallest eigenvectors of the symmetric positive semi-defined matrix defined in (4.11). in general will be quite sparse because of the local nature of the construction of the neighborhoods. Algorithms for computing a subset of the eigenvectors for large and/or sparse matrices are based on computing projections of onto a sequence of Krylov subspaces of the form

for some initial vectors [5]. Hence the computation of matrix-vector multiplications needs to be done efficiently. Because of the special nature of , can be computed neighborhood by neighborhood without explicitly forming ,

where as defined in (4.10),

Each term in the above summation only involves the ’s in one neighborhood.

The matrix in the right factor of is the orthogonal projector onto the subspace spanned by the rows of . If the SVD of is available, the orthogonal projector is given by , where is the submatrix of first columns of

. Otherwise, one can compute the QR decomposition

of and obtain [5]. Clearly, we have because . Then we can rewrite as
Comments

There are no comments yet.