# Avoiding unwanted results in locally linear embedding: A new understanding of regularization

We demonstrate that locally linear embedding (LLE) inherently admits some unwanted results when no regularization is used, even for cases in which regularization is not supposed to be needed in the original algorithm. The existence of one special type of result, which we call “projection pattern”, is mathematically proved in the situation that an exact local linear relation is achieved in each neighborhood of the data. These special patterns as well as some other bizarre results that may occur in more general situations are shown by numerical examples on the Swiss roll with a hole embedded in a high dimensional space. It is observed that all these bad results can be effectively prevented by using regularization.

There are no comments yet.

## Authors

• 5 publications
08/06/2008

### LLE with low-dimensional neighborhood representation

The local linear embedding algorithm (LLE) is a non-linear dimension-red...
12/16/2021

### A new locally linear embedding scheme in light of Hessian eigenmap

We provide a new interpretation of Hessian locally linear embedding (HLL...
03/29/2020

### High-dimensional Neural Feature using Rectified Linear Unit and Random Matrix Instance

We design a ReLU-based multilayer neural network to generate a rich high...
09/22/2021

### Index t-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings

t-SNE is an embedding method that the data science community has widely ...
10/07/2021

### Time Series Forecasting Using Manifold Learning

We address a three-tier numerical framework based on manifold learning f...
10/10/2018

### The Andoni--Krauthgamer--Razenshteyn characterization of sketchable norms fails for sketchable metrics

Andoni, Krauthgamer and Razenshteyn (AKR) proved (STOC 2015) that a fini...
06/10/2021

### A Variational View on Statistical Multiscale Estimation

We present a unifying view on various statistical estimation techniques ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let be a collection of points in some high dimensional space . The general goal of manifold learning (or nonlinear dimensionality reduction) is to find for a representation in some lower dimensional , under the assumption that lies on some unknown submanifold of .

Locally linear embedding (LLE) [6], due to its simplicity in idea as well as efficiency in implementation, is a popular manifold learning method, which has been studied a lot and has many variant and modified versions [4, 3, 5, 1, 9, 2, 8, 7]. The goal of this paper is to point out a fundamental problem, which seems to have not been addressed yet, and then propose a simple solution. Precisely, we will demonstrate that LLE inherently admits some unwanted results when no regularization is used, even for cases in which regularization is not supposed to be needed in the original algorithm. And the solution is just to use regularization in any case.

The existence of one special type of unwanted results, which we call “projection patterns”, will be mathematically proved in the situation that an exact local linear relation is achieved in each neighborhood of the dataset . Such projection patterns, as indicated by their name, are basically direct projections of from onto a

-dimensional hyperplane, ignoring the geometry of the data totally. Due to the using of regularization, they do not appear for most artificial datasets, which reside in some low dimensional

(mostly ). The problem is that for general data in a high dimensional , the regularization is supposed to be unnecessary and not employed. We will show by numerical examples that this practice is risky. The idea is simple: performing some embeddings of the Swiss roll with a hole into a high dimensional such that regularization is not used in the original LLE algorithm, and then comparing the results to those with regularization being used. It turns out that, if regularization is not used, the projection phenomenon always occurs as predicted when an isometric embedding is applied, while if some further perturbation is added to the embedding, more bizarre results may appear. By contrast, with regularization, all the bad results are effectively prevented.

## 2 Projection patterns in LLE

Let us begin with a quick review of the LLE procedure. As in the introduction, the dataset is a subset of and we want to find for it a representation in some lower dimensional .

1. [align=left]

2. For each , let be a -nearest neighborhood of . (For simplicity we consider to be fixed for all .)

3. Set to be a solution of the problem

 argmin(w1,…,wk)∈Rk∥xi−k∑j=1wjxij∥2s.t.k∑j=1wj=1. (P1)
4. With given from Step 2, set to be a solution of the problem

 argmin{yi}Ni=1⊂RdN∑i=1∥yi−k∑j=1w(i)jyij∥2s.t.YYT=I, (P2)

where denotes the matrix , and

is the identity matrix.

For convenience, in the following we reserve the index to be the one which runs from to , and are the indices for the corresponding -nearest points being chosen in Step 1.

The following remark suggests a possible problem of the LLE algorithm, which turns out to be what happens to the “unwanted results” that will be discussed later.

###### Remark 1.

Step 3 is stated in accordance with the original idea. It is not what is exactly executed in the algorithm. To be precise, write

 N∑i=1∥yi−k∑j=1w(i)jyij∥2=∥(I−W)YT∥2,

where is the matrix defined by

 Wis={w(i)jif  s=ij (j=1,…,k)0if  s∉{i1,…,ik},

and for matrices denotes the Frobenius norm. Then, it is known that solutions of (P2) are given by

 YT=[g1 ⋯ gd], (1)

where

are any selection of orthonormal eigenvectors of

with corresponding eigenvalues

. However, in this way one of the ’s might be , corresponding to the eigenvalue . Here

denotes the vector whose components all equal

. This eigenvector is redundant for our purpose, since the dimension reduced data given by (1) is trivially isometric to that given by . Therefore, to find an effective -dimensional representation for , we would like to exclude from our selection of eigenvectors. In the LLE algorithm, what is executed is to find out the first eigenvectors, and set to be the second to the -th ones respectively, in the anticipation that the first eigenvector should be . However, note that this approach may break down when the multiplicity of the eigenvalue is greater than one, hence there are more than one eigenvectors corresponding to it. In this situation, it is not guaranteed that the first eigenvector being found by computer is . Indeed, it is not necessary that any one of the eigenvectors been found must be .

Now we turn to the resolution of (P1). It can be rewritten as

 argminw∈Rk∥Ziw∥2s.t.1Tkw=1,

where

 Zi=[xi1−xi⋯xik−xi]∈RD×k.

Let , then . By the method of Lagrange multiplier, any minimizer of (P1) satisfies

 {Ciw=λ1k1Tkw=1 (2)

for some . If is invertible, (2) has a unique solution

 w(i)=C−1i1k1TkC−1i1k.

On the other hand, if is singular, (2) may have multiple solutions. In this situation it is suggested that we add a small regularization term to , and set

 w(i)=w(i)(ϵ)=(Ci+ϵI)−11k1Tk(Ci+ϵI)−11k. (3)

For efficiency, in the LLE algorithm whether the regularization term is used or not is not precisely determined by the invertibility or not of , but by the following rule: used if , and not used if . Note that this rule is only a convenient one but not in exact accordance with the original idea. Precisely, if , then it’s true that must be singular, but does not guarantee that is non-singular. Nevertheless, as we will advocate using regularization (namely the formula (3)) no matter is singular or not, we do not care about this problem.

A common perception of (3) is that it provides a stable way to solve (2), at the cost of a small amount of error. In fact,

 w(i)(0+):=limϵ→0+w(i)(ϵ)

exists and is an exact solution of (2). In this thread, it seems natural to expect better performance of LLE if is replaced by . However, that is not the case. In Figure 1, we give some numerical examples on the Swiss roll with a hole for different values of , and we see that when is too small, the resulting becomes far from what it ought to be. Actually, when approaches zero, we see that converges to a “projection pattern”. That is, it looks like is simply some projection of the Swiss roll onto the plane, ignoring the rolled-up nature of the original data. We are going to explain that this phenomenon is not a coincidence, and is not due to any instability problem in solving (2). In fact, such projection mappings are inherently allowed in the LLE procedure when no regularization is used.

To explain the phenomenon, we formulate precisely two assumptions:

1. [label=(A0), ref = (A0)]

2. for all .

3. , where .

Assumption 1 is the same as saying that zero is achieved as the minimum in (P1) for all . Note that this is the normal situation for cases with – the same cases for which regularization is employed in the LLE algorithm. As for 2, it is a natural assumption which holds in all cases of interest. To see it, suppose , then some columns from span all the others, and hence lies entirely on some dimensional subspace. In this case there is no point to pursue an embedding of in .

Recall that , and hence 2 implies has at least nonzero eigenvalues (counting multiplicity). Moreover, since is positive semi-definite, all of its nonzero eigenvalues are positive. With these in mind, we can now state our key observation.

###### Theorem 1.

Assume 1 and 2. Let be any collection of orthonormal eigenvectors of whose corresponding eigenvalues, denoted , are all positive. Then is a solution of (P2), where is given by

 A=⎡⎢ ⎢ ⎢ ⎢⎣λ−1/21⋱λ−1/2d⎤⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢⎣uT1⋮uTd⎤⎥ ⎥ ⎥⎦. (4)
###### Proof.

The local linear relations in 1 are preserved by applying , that is

 Axi−k∑j=1w(i)jAxij=0∀i.

As a consequence, by setting , we have

 N∑i=1∥yi−k∑j=1w(i)jyij∥2=0,

and hence is a minimizer of the cost function in Problem (P2). It remains to show that the constraint is also satisfied. Note that , and hence the constraint is

 AXXTAT=I.

The validity of this equality is easy to check from the definition (4) of , and the fact that for . ∎

Now, what does in Theorem 1 look like? Note that

 x↦d∑ℓ=1(x⋅uℓ)uℓ (5)

is the orthogonal projection of onto the -dimensional subspace . By endowing this subspace with its own coordinate system with respect to , we can regard the projection as a mapping from into , which is expressed by the coefficients in (5):

 x↦⎡⎢ ⎢⎣x⋅u1⋮x⋅ud⎤⎥ ⎥⎦=⎡⎢ ⎢ ⎢⎣uT1⋮uTd⎤⎥ ⎥ ⎥⎦x.

This is the first half of the action of . The second half is a further multiplication of the diagonal matrix , which is nothing but performing some rescalings of the coordinates. In summary, is an orthogonal projection from to followed by some rescalings of coordinates in . This explains the projection phenomenon observed.

###### Remark 2.

What Theorem 1 tells us is that, when 1 and 2 hold, some projection patterns are solutions of (P1), and hence are candidates for being the result of LLE. Logically speaking, it does not preclude the possiblity that there are other kinds of results (which we do not know).

## 3 Numerical examples for high dimensional data

The projection phenomenon does not pertain exclusively to cases with . For one thing, even if , if lies on some -dimensional hyperplane with , then 1 still holds in general (we take 2 for granted and will no longer mention it), and Theorem 1 applies. Of course, such a degenerate case is of little interest. But for another, it is conceivable that the projection effect may still have its influence when 1 is only approximately true while to a high degree. This statement can partly be supported by the third and fourth images in Figure 1, where is very small and 1 is almost true. However, those images correspond to the situation .

To acquire some direct evidences about what could happen to high dimensional data, we have performed several experiments in which the Swiss roll with a hole was first embedded in a high dimensional

, and then applied LLE to it with chosen to be smaller than . For our purpose, we only considered isometric embeddings and perturbed isometric embeddings so that the embedded datasets are basically the same Swiss roll with a hole. It turns out that for isometrically embedded data (which lie entirely on some three dimensional hyperplane in ), the projection phenomenon is obvious (see however Remark 1 at the end of this section). Somewhat surprising to us is that when rather small perturbations are added to the isometric embeddings, the outcomes become quite unpredictable but not merely perturbations of projection patterns. It is observed that all these unwanted results – projection patterns and others – are associated with the problem mentioned in Remark 1. That is, the matrix corresponding to them is very singular. And in any case, by using regularization we see this problem can be effectively avoided, and considerably improved results can be obtained. See Figure 2 for some examples, the details of which are given in the next paragraph.

Let represents a Swiss roll with a hole dataset in . Figure 2 shows, from left to right, the results of applying LLE to three different datasets and with . The top row are results of the original LLE (no regularization), and the bottom row are corresponding results for which the regularized weight vector (3) is used for each , with . , , are given as follows:

• is some randomly generated matrix whose columns form an orthonormal set of vectors. This gives rise to a linear isometric mapping from into .

• is the mapping from to obtained by perturbing in an extra dimension:

 E2x=(E1x,0.1sin(18∑j=1(E1x)j)),

where denotes the -th component of .

• is another perturbation of given by

 E3x=E1x+0.1(sin((E1x)1),sin((E1x)2),…,sin((E1x)18)).

Here are some remarks:

1. [label=(), ref = ()]

2. For isometrically embedded data, the results without regularization are not always typical projection patterns as the left top image in Figure 2. Sometimes additional deformation or distortion is also present (see Figure 3). We do not know if it reflects the fact that there are other possible results than projection patterns (cf. Remark 2), or is simply the cause of errors in numerical simulations.

3. The using of sine function in and is a random choice and bears no significance. We can still see the projection effect for without regularization (top middle image of Figure 2), while the corresponding result for (top right) is totally inexplicable. One difference between and that might be important is that the former lies on a 4-dimensional hyperplane in , while the latter does not lie on any lower dimensional hyperplane in .

4. The above examples show clearly that the condition doesn’t imply regularization is unnecessary. However, as is mentioned, the criterion based on the size relation between and is only something for convenience. How about using the original idea: performing regularization or not according to whether is singular or not? Experiments (not shown here) show that it doesn’t help either. In fact, for , the majority of the ’s are already non-singular, and even if all the singular ’s are regularized (with non-singular ’s unchanged), the result still looks as bizarre as the original one.

## 4 Conclusion

We have demonstrated that LLE inherently admits some unwanted results if no regularization is used, even for cases in which regularization is supposed to be unnecessary in the original algorithm. The true merit of regularization is hence not (or at least not merely) for solving (2) stably, at the cost of a small amount of error. On the contrary, by deliberately distorting the local linear relations, it protects LLE from some bad results. As a consequence, we suggest that one uses regularization in any case when applying LLE. Of course, our investigation is far from comprehensive. More examples, especially high dimensional real world data, should be examined. Moreover, using regularization alone is in no way a promise of good results.

## Acknowledgment

This work is supported by Ministry of Science and Technology of Taiwan under grant number MOST110-2636-M-110-005-. The author would also like to thank Chih-Wei Chen for valuable discussions.

## References

• [1] Hong Chang and Dit-Yan Yeung. Robust locally linear embedding. Pattern recognition, 39(6):1053–1065, 2006.
• [2] Jing Chen and Yang Liu. Locally linear embedding: a survey. Artificial Intelligence Review, 36(1):29–48, 2011.
• [3] Dick De Ridder, Olga Kouropteva, Oleg Okun, Matti Pietikäinen, and Robert PW Duin. Supervised locally linear embedding. In

Artificial Neural Networks and Neural Information Processing—ICANN/ICONIP 2003

, pages 333–341. Springer, 2003.
• [4] David L. Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–5596, 2003.
• [5] Olga Kouropteva, Oleg Okun, and Matti Pietikäinen. Incremental locally linear embedding. Pattern recognition, 38(10):1764–1767, 2005.
• [6] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
• [7] Xiang Wang, Yuan Zheng, Zhenzhou Zhao, and Jinping Wang. Bearing fault diagnosis based on statistical locally linear embedding. Sensors, 15(7):16225–16247, 2015.
• [8] Hau-Tieng Wu and Nan Wu.

Think globally, fit locally under the manifold setup: Asymptotic analysis of locally linear embedding.

The Annals of Statistics, 46(6B):3805–3837, 2018.
• [9] Zhenyue Zhang and Jing Wang. Mlle: Modified locally linear embedding using multiple weights. In Advances in neural information processing systems, pages 1593–1600, 2007.