 # On the convergence of maximum variance unfolding

Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing specific rates of convergence under standard assumptions. We find that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

One of the basic tasks in unsupervised learning, aka multivariate statistics, is that of dimensionality reduction. While the celebrated Principal Components Analysis (PCA) and Multidimensional Scaling (MDS) assume that the data lie near an affine subspace, modern approaches postulate that the data are in the vicinity of a submanifold. Many such algorithms have been proposed in the past decade, for example, ISOMAP

(Tenenbaum et al., 2000), Local Linear Embedding (LLE) (Roweis and Saul, 2000), Laplacian Eigenmaps (Belkin and Niyogi, 2003), Manifold Charting (Brand, 2003), Diffusion Maps (Coifman and Lafon, 2006), Hessian Eigenmaps (HLLE) (Donoho and Grimes, 2003), Local Tangent Space Alignment (LTSA) (Zhang and Zha, 2004), Maximum Variance Unfolding (Weinberger et al., 2004), and many others, some reviewed in (Van der Maaten et al., 2008; Saul et al., 2006).

Although some variants exist, the basic setting is that of a connected domain isometrically embedded in Euclidean space as a submanifold , with . We are provided with data points sampled from (or near) and our goal is to output that can be isometrically mapped to (or close to) .

A number of consistency results exist in the literature. For example, Bernstein et al. (2000) show that, with proper tuning, geodesic distances may be approximated by neighborhood graph distances when the submanifold is geodesically convex, implying that ISOMAP asymptotically recovers the isometry when is convex. When is not convex, it fails in general (Zha and Zhang, 2003). To justify HLLE, Donoho and Grimes (2003) show that the null space of the (continuous) Hessian operator yields an isometric embedding. See also (Ye and Zhi, 2012) for related results in a discrete setting. Smith et al. (2008) prove that LTSA is able to recover the isometry, but only up to an affine transformation. We also mention other results in the literature which show that, as the sample size increases, the output the algorithm converges to is an explicit continuous embedding. For instance, a number of papers analyze how well the discrete graph Laplacian based on a sample approximates the continuous Laplace-Beltrami operator on a submanifold (Belkin and Niyogi, 2005; von Luxburg et al., 2008; Singer, 2006; Hein et al., 2005; Giné and Koltchinskii, 2006; Coifman and Lafon, 2006), which is intimately related to the Laplacian Eigenmaps. However, such convergence results do not guaranty that the algorithm is successful at recovering the isometry when one exists. In fact, as discussed in detail by Goldberg et al. (2008) and Perrault-Joncas and Meila (2012), many of them fail in very simple settings.

In this paper, we analyze Maximum Variance Unfolding (MVU) in the large-sample limit. We are only aware of a very recent work of Paprotny and Garcke (2012) that establishes that, under the assumption that is convex, MVU recovers a distance matrix that approximates the geodesic distance matrix of the data. Our contribution is the following. In Section 2, we prove a convergence result, showing that the optimization problem that MVU solves converges (both in solution space and value) to a continuous version defined on the whole submanifold. The basic assumption here is that the submanifold is compact. In Section 3, we derive quantitative convergence rates, with mild additional regularity assumptions. In Section 4, we consider the solutions to the continuum limit. When is convex, we prove that MVU recovers an isometry. We also provide examples of non-convex where MVU provably fails at recovering an isometry.We also prove that MVU is robust to noise, which Goldberg et al. (2008) show to be problematic for algorithms like LLE, HLLE and LTSA. Some concluding remarks are in Section 5.

## 2 From discrete MVU to continuum MVU

In this section we state and prove a qualitative convergence result for MVU. This result applies with only minimal assumptions and its proof is relatively transparent. What we show is that the (discrete) MVU optimization problem converges to an explicit continuous optimization problem when the sample size increases. The continuous optimization problem is amenable to scrutiny with tools from analysis and geometry, and that will enable us to better understand (in Section 4) when MVU succeeds, and when it fails, at recovering an isometry to a Euclidean domain when it exists.

Let us start by recalling the MVU algorithm (Weinberger and Saul, 2006; Weinberger et al., 2004, 2005). We are provided with data points . Let denote the Euclidean norm. Let be the (random) set defined by

 Yn,r={y1,…,yn∈Rp:∥yi−yj∥≤∥xi−xj∥ when ∥xi−xj∥≤r}.

Choosing a neighborhood radius , MVU solves the following optimization problem:

Discrete MVU

 Maximize E(Y):=1n(n−1)n∑i=1∑j≠i∥yi−yj∥2, over Y=(y1,…,yn)T∈Rn×p, (1) subject to Y∈Yn,r. (2)

When the data points are sampled from a distribution with support , our main result in this section is to show that, when is sufficiently regular and sufficiently slowly, the discrete optimization problem converges to the following continuous optimization problem:

Continuum MVU

 Maximize E(f):=∫M×M∥f(x)−f(x′)∥2μ(dx)μ(dx′), over f:M→Rp, (3) subject to f is Lipschitz with ∥f∥Lip≤1, (4)

where denotes the smallest Lipschitz constant of . It is important to realize that the Lipschitz condition is with respect to the intrinsic metric on (i.e., the metric inherited from the ambient space ), defined as follows: for , let

 δM(x,x′)=inf{T:∃γ:[0,T]→M, 1-Lipschitz, % with γ(0)=x and γ(T)=x′}. (5)

When is compact, the infimum is attained. In that case, is the length of the shortest continuous path on starting at and ending at , and is a complete metric space, also called a length space in the context of metric geometry (Burago et al., 2001). Then is Lipschitz with if

 ∥f(x)−f(x′)∥≤LδM(x,x′), ∀x,x′∈M. (6)

For any , denote by the class of Lipschitz functions satisfying (6).

One of the central condition is that is sufficiently regular that the intrinsic metric on is locally close to the ambient Euclidean metric.

Regularity assumption. There is a non-decreasing function such that when , such that, for all ,

 δM(x,x′)≤(1+c(∥x−x′∥))∥x−x′∥. (7)

This assumption is also central to ISOMAP. Bernstein et al. (2000) prove that it holds when is a compact, smooth and geodesically convex submanifold (e.g., without boundary). In Lemma 4, we extend this to compact, smooth submanifolds with smooth boundary, and to tubular neighborhoods of such sets. The latter allows us to study noisy settings.

Note that we always have

 ∥x−x′∥≤δM(x,x′). (8)

Let denote the set of functions that are solutions of Continuum MVU. We state the following qualitative result that makes minimal assumptions.

###### Theorem 1.

Let

be a (Borel) probability distribution with support

, which is connected, compact and satisfying (7), and assume that are sampled independently from . Then, for sufficiently slowly, we have

 sup{E(Y):Y∈Yn,rn}→sup{E(f):f∈F1}, (9)

and for any solution of Discrete MVU,

 inff∈S1max1≤i≤n∥^yi−f(xi)∥→0, (10)

almost surely as .

Thus Discrete MVU converges to Continuum MVU in the large sample limit, if satisfies the crucial regularity condition (7) and other mild assumptions. In Section 3, we provide explicit quantitative bounds for the convergence results (9) and (10) at the very end, under some additional (though natural) assumptions. In Section 4, we focus entirely on Continuum MVU, with the goal of better understanding the functions that are solutions to that optimization problem. Because of (10), we know that the output of Discrete MVU converges in a strong sense to one of these functions.

The rest of the section is dedicated to proving Theorem 1. We divide the proof into several parts which we discuss at length, and then assemble to prove the theorem.

### 2.1 Coverings and graph neighborhoods

For , let denote the undirected graph with nodes and an edge between and if . This is the -neighborhood graph based on the data. It is essential that be connected, for otherwise , while is finite. The latter comes from the fact that, for any ,

 E(f)≤∫M×MδM(x,x′)2μ(dx)μ(dx′)≤diam(M)2,

where we used (6) in the first inequality, and is the intrinsic diameter of , i.e.,

 diam(M):=supx,x′∈MδM(x,x′). (11)

Recall that the only assumptions on made in Theorem 1 are that is compact, connected, and satisfies (7), and this implies that . Indeed, as a compact subset of , is bounded, hence . Reporting this in (7) immediately implies that .

That said, we ask more of than simply having connected. For , define

 Ω(η)={∀x∈M,∃i=1,…,n:∥x−xi∥≤η}, (12)

which is the event that forms an -covering of .

Connectivity requirement. in such a way that

 ∞∑n=1P(Ω(λnrn)c)<∞, for some sequence λn→0. (13)

Since is the support of , there is always a sequence that satisfy the Connectivity requirement. To see this, for , let be an -packing of of maximal size , i.e., a maximal collection of points such that for all . Recall that an -packing is also an -covering of and note that by compacity of . Let . Since is the support of , for any and any , where denotes the Euclidean ball centered at and of radius . Hence, for any . We have

 P(Ω(2η)c) = P(there exists x∈M:∀i=1,…,n,∥x−xi∥>2η ) ≤ P(there is j such that B(zj,η) is empty of data points) ≤ Nη∑j=1P(B(zj,η) is empty of data points) ≤ Nη(1−pη)n.

Let ; the sequence is chosen here for the simplicity of the exposition, but more general sequence can be considered, as will become apparent at the end of the paragraph.

Since for all , . To see this, let . Clearly, for all , , which implies that the set of such that is non-empty. In particular, for all , we have . Now, let be fixed. Since , there exists an integer such that for all , so that for all . Since is arbitrary, this proves that the sequence converges to 0 as tends to infinity.

With such a choice of , we have . Therefore, if we take , it satisfies the Connectivity requirement. In Section 3.2 we derive a quantitative bound on that guaranty (13) under additional assumptions. Note that the sequence in the definition of can be replaced by any summable decreasing sequence.

The rationale behind the requirement on is the same as in (Bernstein et al., 2000): it allows to approximate each curve on with a path in of nearly the same length. We utilize this in the following subsection.

### 2.2 Interpolation

Assuming that the sampling is dense enough that

holds, we interpolate a set of vectors

with a Lipschitz function . Formally, we have the following.

###### Lemma 1.

Assume that holds . Then any vector is of the form for some .

We prove this result. The first step is to show that this is at all possible in the sense that

 ∥yi−yj∥≤(1+6η/r)δM(xi,xj), ∀i,j. (14)

This shows that the map defined by for all , is Lipschitz (for and the Euclidean metrics) with constant . We apply a form of Kirszbraun’s Extension — (Lang and Schroeder, 1997, Th. B) or (Brudnyi and Brudnyi, 2012, Th. 1.26) — to extend to the whole into .

Therefore, let’s turn to proving (14). The arguments are very similar to those in (Bernstein et al., 2000). If , then, by (8), , which implies that

 ∥yi−yj∥≤∥xi−xj∥≤δM(xi,xj).

Now suppose that . Let be a path in connecting to of minimal length . Split into arcs of lengths plus one arc of length , so that

 ll1−1≤N≤ll1.

Denote by the extremities of the arcs along .

For , let . On , for all , so that

 ∥xtk−xtk−1∥≤δM(xtk,xtk−1)≤δM(x′k,x′k−1)+2η≤l1+2η≤r/2+2(r/4)=r.

Hence, because ,

 ∥ytk−ytk−1∥≤l1+2η.

Similarly, for the last arc, recalling that , we have , and therefore

 ∥ytN+1−ytN∥≤lN+1+η.

Consequently,

 ∥yi−yj∥ ≤ N(l1+2η)+(lN+1+η) = Nl1+lN+1+(2N+1)η = l+(2N+1)η.

We have

 (2N+1)η≤(2ll1+1)η≤l3ηl1=l6ηr,

and so (14) holds.

### 2.3 Bounds on the energy

We call the energy functional. For a function , let . Assume that holds . Then Lemma 1 implies that any is equal to for some . Hence,

 supY∈Yn,rE(Y)≤supf∈F1+6η/rE(Yn(f)). (15)

Recall the function introduced in (7), and assume that is small enough that . For , and for any such that , we have

 ∥f(xi)−f(xj)∥≤(1−c(r))δM(xi,xj)≤(1−c(r))(1+c(∥xi−xj∥))∥xi−xj∥.

Since the function is non-decreasing, , and so

 ∥f(xi)−f(xj)∥≤(1−c(r)2)∥xi−xj∥≤∥xi−xj∥.

Consequently, , implying that

 supY∈Yn,rE(Y)≥supf∈F1−c(r)E(Yn(f)). (16)

As a result of (15) and (16), we have

 ∣∣supY∈Yn,rE(Y)−supf∈F1E(f)∣∣≤sup1−c(r)≤L≤1+6η/r∣∣supf∈FLE(Yn(f))−supf∈F1E(f)∣∣. (17)

We have

 ∣∣supf∈FLE(Yn(f))−supf∈FLE(f)∣∣≤supf∈FL∣∣E(Yn(f))−E(f)∣∣,

and applying the triangle inequality, we arrive at

 ∣∣supf∈FLE(Yn(f))−supf∈F1E(f)∣∣≤supf∈FL∣∣E(Yn(f))−E(f)∣∣+∣∣supf∈FLE(f)−supf∈F1E(f)∣∣.

Since and , we have

 ∣∣supf∈FLE(f)−supf∈F1E(f)∣∣≤|L2−1|supf∈F1E(f)≤|L2−1|diam(M)2,

and

 (18)

Consequently,

 ∣∣supf∈FLE(Yn(f))−supf∈F1E(f)∣∣≤L2supf∈F1∣∣E(Yn(f))−E(f)∣∣+|L2−1|diam(M)2.

Reporting this inequality in (17) on the event with , we have

 ∣∣supY∈Yn,rE(Y)−supf∈F1E(f)∣∣≤(1+6η/r)2supf∈F1∣∣E(Yn(f))−E(f)∣∣+β(r,η)(2+β(r,η))diam(M)2, (19)

where .

Finally, we show that is continuous (in fact Lipschitz) on for the supnorm. For any and in , and any and in , we have:

 ∣∣∥f(x)−f(x′)∥2−∥g(x)−g(x′)∥2∣∣ ≤ ∥f(x)−f(x′)−g(x)+g(x′)∥∥f(x)−f(x′)+g(x)−g(x′)∥ ≤ [∥f(x)−g(x)∥+∥f(x′)−g(x′)∥] ×[∥f(x)−f(x′)∥+∥g(x)−g(x′)∥] ≤ 4∥f−g∥∞diam(M).

The first inequality is that of Cauchy-Schwarz. Hence,

 ∣∣E(f)−E(g)∣∣≤4∥f−g∥∞diam(M), (20)

and

 ∣∣E(Yn(f))−E(Yn(g))∣∣≤4∥f−g∥∞diam(M). (21)

### 2.4 More coverings and the Law of Large Numbers

The last step is to show that the supremum of the empirical process (18) converges to zero. For this, we use a packing (covering) to reduce the supremum over

to a maximum over a finite set of functions. We then apply the Law of Large Numbers to each difference in the maximization.

Fix and define

 F01={f∈F1:f(x0)=0}.

Note that if, and only if, , and by the fact that for any function or vector and any constant , we have

The reason to use is that it is bounded in supnorm. Indeed, for , we have

 ∥f(x)∥=∥f(x)−f(x0)∥≤δM(x,x0)≤diam(M), ∀x∈M.

Let denote the covering number of for the supremum norm, i.e., the minimal number of balls that are necessary to cover , and let be an -covering of of minimal size . Since is equicontinuous and bounded, it is compact for the topology of the supremum norm by the Arzelà-Ascoli Theorem, so that for any .

Fix and let be such that . By (20) and (21), we have

 |E(Yn(f))−E(f)| ≤ |E(Yn(f))−E(Yn(fk))|+|E(Yn(fk))−E(fk)|+|E(fk)−E(f)| ≤ 8diam(M)∥f−fk∥∞+|E(Yn(fk))−E(fk)| = 8diam(M)ε+|E(Yn(fk))−E(fk)|.

Thus,

 supf∈F1∣∣E(Yn(f))−E(f)∣∣≤8diam(M)ε+max{|E(Yn(fk))−E(fk)|:k=1,…,N∞(F01,ε)}. (22)

The Law of Large Numbers (LLN) imply that, for any bounded , , almost surely as . Indeed,

 E(Yn(f)) = n2n(n−1)1n2∑i,j∥f(xi)−f(xj)∥2 = 2nn−1⎡⎣1n∑i∥f(xi)∥2−∥∥ ∥∥1n∑if(xi)∥∥ ∥∥2⎤⎦ → 2E∥f(x)∥2−2∥Ef(x)∥2=E(f), almost surely as n→∞,

by the LLN applied to each term. Therefore, when is fixed, the second term in (22) tends to zero almost surely, and since is arbitrary, we conclude that

 supf∈F1∣∣E(Yn(f))−E(f)∣∣→0,in probability, as n→∞. (23)

### 2.5 Large deviations of the sample energy

To show an almost sure convergence in (23), we need to refine the bound on the supremum of the empirical process (18). For this, we apply Hoeffding’s Inequality for U-statistics (Hoeffding, 1963), which is a special case of (de la Peña and Giné, 1999, Thm. 4.1.8).

###### Lemma 2 (Hoeffding’s Inequality for U-statistics).

Let be a bounded measurable map, and let

be a sequence of i.i.d. random variables with values in

. Assume that and that , and let . Then, for all ,

 P⎡⎣1n(n−1)∑1≤i≠j≤nϕ(xi,xj)>t⎤⎦≤exp(−nt25σ2+3bt).

Let . To bound the deviations of , we apply this result with . Then,

 E(Yn(f))−E(f)=1n(n−1)∑i≠jϕ(xi,xj).

By construction, . Since is Lipschitz with constant 1, for any and in , and . Hence , and . Applying Lemma 2 (twice), we deduce that, for any ,

 P(|E(Yn(f))−E(f)|>ε)≤2exp(−nε25diam(M)4+3diam(M)2ε). (24)

Using (24) in (22), coupled with the union bound, we get that

 P(supf∈F1∣∣E(Yn(f))−E(f)∣∣>9εdiam(M))≤N∞(F01,ε)⋅2exp(−nε25diam(M)2+3ε). (25)

Clearly, the RHS is summable for every fixed, so the convergence in (23) happens in fact with probability one, that is,

 supf∈F1∣∣E(Yn(f))−E(f)∣∣→0, almost surely, as n→∞. (26)

### 2.6 Convergence in value: proof of (9)

Assume satisfies the Connectivity requirement, and that is large enough that . When holds, by (19), we have

 ∣∣supY∈Yn,rE(Y)−supf∈F1E(f)∣∣≤(1+6λn)2supf∈F1∣∣E(Yn(f))−E(f)∣∣+3max(c(rn),6λn)diam(M)2,

while when does not hold, since the energies are bounded by , we have

 ∣∣supY∈Yn,rE(Y)−supf∈F1E(f)∣∣≤2diam(M)2.

Combining these inequalities, we deduce that

 ∣∣supY∈Yn,rE(Y)−supf∈F1E(f)∣∣ ≤3max(c(rn),6λn)diam(M)21IΩ(λnrn)+2diam(M)21IΩ(λnrn)c (27) +(1+6λn)2supf∈F1∣∣E(Yn(f))−E(f)∣∣.

Almost surely, the sum of the first two terms on the RHS tends to 0 by the fact that when , and (13) since satisfies the Connectivity requirement. The third term tends to 0 by (23). Hence, (9) is established.

### 2.7 Convergence in solution: proof of (10)

Assume satisfies the Connectivity requirement, and that is large enough that . Let denote any solution of Discrete MVU. When holds, there is such that . Note that the existence of the interpolating function holds on for each fixed , and that this does not imply the existence of an interpolating sequence . That said, for each in the event , there exists a sequence and an integer such that for all , i.e., the sequence is interpolating a solution of Discrete MVU for all large enough. In addition, when satisfies the Connectivity requirement, then by the Borel-Cantelli lemma. Hence the event holds with probability one.

In fact, without loss of generality, we may assume that . Since is equicontinuous and bounded, it is compact for the topology of the supnorm by the Arzelà-Ascoli Theorem. Hence, any subsequence of admits a subsequence that converges in supnorm. And since increases with and , any accumulation point of is in .

In fact, if we define , then all the accumulation points of are in . Indeed, we have

 E(^fn)=E(^fn)−E(Yn(^fn))+E(Yn(^fn)),

with

 ∣∣E(^fn)−E(Yn(^fn))∣∣≤supf∈F1∣∣E(Yn(f))−E(f)∣∣→0,

by (23), and

 E(Yn(^fn))=supY∈Yn,rnE(Y)→supf∈F1E(f),

by (9), almost surely as . Hence, if , by continuity of on , we have

 E(f∞)=limkE(^fnk)=sup