Perturbation Bounds for Procrustes, Classical Scaling, and Trilateration, with Applications to Manifold Learning

10/22/2018 ∙ by Ery Arias-Castro, et al. ∙ 0

One of the common tasks in unsupervised learning is dimensionality reduction, where the goal is to find meaningful low-dimensional structures hidden in high-dimensional data. Sometimes referred to as manifold learning, this problem is closely related to the problem of localization, which aims at embedding a weighted graph into a low-dimensional Euclidean space. Several methods have been proposed for localization, and also manifold learning. Nonetheless, the robustness property of most of them is little understood. In this paper, we obtain perturbation bounds for classical scaling and trilateration, which are then applied to derive performance bounds for Isomap, Landmark Isomap, and Maximum Variance Unfolding. A new perturbation bound for procrustes analysis plays a key role.



There are no comments yet.


page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multidimensional scaling (MDS) can be defined as the task of embedding an itemset as points in a (typically) Euclidean space based on some dissimilarity information between the items in the set. Since its inception, dating back to the early 1950’s if not earlier [41]

, MDS has been one of the main tasks in the general area of multivariate analysis, a.k.a., unsupervised learning.

One of the main methods for MDS is called classical scaling, which consists in first double-centering the dissimilarity matrix and then performing an eigen-decomposition of the obtained matrix. This is arguably still the most popular variant, even today, decades after its introduction at the dawn of this literature. (For this reason, this method is often referred to as MDS, and we will do the same on occasion.) Despite its wide use, its perturbative properties remain little understood. The major contribution on this question dates back to the late 1970’s with the work of Sibson [26], who performs a sensitivity analysis that resulted in a Taylor development for the classical scaling to the first nontrivial order. Going beyond Sibson [26]’s work, our first contribution is to derive a bonafide perturbation bound for classical scaling (Theorem 2).

Classical scaling amounts to performing an eigen-decomposition of the dissimilarity matrix after double-centering. Only the top eigenvectors are needed if an embedding in dimension is desired. Using iterative methods such as the Lanczos algorithm, classical scaling can be implemented with a complexity of , where is the number of items (and therefore also the dimension of the dissimilarity matrix). In applications, particularly if the intent is visualization, the embedding dimension tends to be small. Even then, the resulting complexity is quadratic in the number of items to be embedded. There has been some effort in bringing this down to a complexity that is linear in the number of items. The main proposals [10, 35, 8] are discussed by Platt [23], who explains that all these methods use a Nyström approximation. The procedure proposed by de Silva and Tenenbaum [8], which they called landmark MDS (LMDS) and which according to Platt [23] is the best performing methods among these three, works by selecting a small number of items, perhaps uniformly at random from the itemset, and embedding them via classical scaling. These items are used as landmark points to enable the embedding of the remaining items. The second phase consists in performing trilateration, which aims at computing the location of a point based on its distances to known (landmark) points. Note that this task is closely related to, but distinct, from triangulation, which is based on angles instead. If items are chosen as landmarks in the first step (out of items in total), then the procedure has complexity . Since can in principle be chosen on the order of , and always, the complexity is effectively , which is linear in the number of items. A good understanding of the robustness properties of LMDS necessitates a good understanding of the robustness properties of not only classical scaling (used to embed the landmark items), but also of trilateration (used to embed the remaining items). Our second contribution is a perturbation bound for trilateration (Theorem 3). There are several closely related method for trilateration, and we study on the method proposed by de Silva and Tenenbaum [8], which is rather natural. We refer to this method simply as trilateration in the remaining of the paper.

de Silva and Tenenbaum [8] build on the pioneering work of Sibson [26] to derive a sensitivity analysis of classical scaling. They also derive a sensitivity analysis for their trilateration method following similar lines. In the present work, we instead obtain bonafide perturbation bounds, for procrustes analysis (Section 2), for classical scaling (Section 3), and for the same trilateration method (Section 4). In particular, our perturbation bounds for procrustes analysis and classical scaling appear to be new, which may be surprising as these methods have been in wide use for decades. (The main reason for deriving a perturbation bound for procrustes analysis is its use in deriving a perturbation bound for classical scaling, which was our main interest.) These results are applied in Section 5 to Isomap, Landmark Isomap, and also Maximum Variance Unfolding (MVU). These may be the first performance bounds of any algorithm for manifold learning in its ‘isometric embedding’ variant, even as various consistency results have been established for Isomap [42], MVU [2], and a number of other methods [9, 40, 13, 28, 3, 34, 27, 15, 6]. (As discussed in [14], Local Linear Embedding, Laplacian Eigenmaps, Hessian Eigenmaps, and Local Tangent Space Alignment, all require some form of normalization which make them inconsistent for the problem of isometric embedding.) In Section 6 we discuss the question of optimality in manifold learning and also the choice of landmarks. The main proofs are gathered in Section 7.

2 A perturbation bound for procrustes

The orthogonal procrustes problem is that of aligning two point sets (of same cardinality) using an orthogonal transformation. In formula, given two point sets, and in , the task consists in solving


where denotes the orthogonal group of

. (Here and elsewhere, when applied to a vector,

will denote the Euclidean norm.)

In matrix form, the problem can be posed as follows. Given matrices and in , solve


where denotes the Frobenius norm (in the appropriate space of matrices). As stated, the problem is solved by choosing , where and are -by-

orthogonal matrices obtained by a singular value decomposition of

, where is the diagonal matrix with the singular values on its diagonal [24, Sec 5.6]. Algorithm 1 describes the procedure.

  Input: point sets and in
  Output: an orthogonal transformation of
  1: store the point sets in and
  2: compute and its singular value decomposition
  Return: the matrix
Algorithm 1 Procrustes (Frobenius norm)

In matrix form, the problem can be easily stated using any other matrix norm in place of the Frobenius norm. There is no closed-form solution in general, even for the operator norm (as far as we know), although some computational strategies have been proposed for solving the problem numerically [36]. In what follows, we consider an arbitrary Schatten norm. For a matrix , let denote the Schatten -norm, where is assumed fixed:


with the singular values of . Note that coincides with the Frobenius norm. We also define to be the usual operator norm, i.e., the maximum singular value of . Henceforth, we will also denote the operator norm by , on occasion. We denote by the pseudo-inverse of (see Section 7.1).

Theorem 1.

Consider two tall matrices and of same size, with having full rank, and set . If , then


The proof is in Section 7.2

. Interestingly, to establish the upper bound we use an orthogonal matrix constructed from the singular value decomposition of

. This is true regardless of , which may be surprising since a solution for the Fronenius norm (corresponding to the case where ) is based on a singular value decomposition of instead.

Remark 1.

The case where and are orthogonal is particularly simple, at least when or , based on what is already known in the literature. Indeed, from [30, Sec II.4] we find that, in that case,


where is the diagonal matrix made of the principal angles between the subspaces defined by and , while


Using the elementary inequality , valid for , we get


Note that, in this case, , and compare with our (upper) bound.

Remark 2.

Our result applies when is sufficiently small relative to the inverse of the condition number of . In our proof, we obtain a sharper bound in (57) which does not strictly require that, but even then the bound becomes essentially useless if is large. As it turns out, the result as stated in Theorem 1 will be enough for our purposes, but we do conjecture that there is a smooth transition between a bound in and a bound in as increases to infinity (and therefore degenerates to a singular matrix).

3 A perturbation bound for classical scaling

In multidimensional scaling, we are given a matrix, , storing the dissimilarities between a set of items (which will remain abstract in this paper). In particular, gives the level of dissimilarity between items . We will assume throughout that the dissimilarities are non-negative. Given a positive integer , we seek a configuration, meaning a set of points, , such that is close to over all . The itemset is thus embedded as in the -dimensional Euclidean space . Algorithm 2 describes classical scaling, the most prominent method for solving this problem. In the description, is the centering matrix in dimension , where

is the identity matrix and

is the matrix of ones. The method is widely attributed to Torgerson [32].

  Input: dissimilarity matrix , embedding dimension
  Output: set of points
  1: compute the matrix
  2: compute such that
  Return: the row vectors of
Algorithm 2 Classical Scaling

The rationale behind the classical scaling is the following identity regarding a configuration and the corresponding squared distance matrix :


Consider the situation where the dissimilarity matrix is exactly realizable in dimension , meaning that there is a set of points such that . It is worth noting that, in that case, the set of points that perfectly embed in dimension are rigid transformations of each other. It is well-known that classical scaling provides such a set of points which happens to be centered at the origin (see Eq. 8).

We perform a perturbation analysis of classical scaling, by studying the effect of perturbing the dissimilarities on the embedding that the algorithm returns. This sort of analysis helps quantify the degree of robustness of a method to noise, and is particularly important in applications where the dissimilarities are observed with some degree of inaccuracy, which is the case in the context of manifold learning (Section 5.1). Recall that denotes the orthogonal group of matrices of a proper size that will be clear from the context.

Theorem 2.

Consider a centered and full-rank configuration with dissimilarity matrix . Let denote another dissimilarity matrix and set . If it holds that , then classical scaling with input dissimilarity matrix and dimension returns a centered configuration satisfying


We note that , after using the fact that since

has one zero eigenvalue and

eigenvalues equal to one.


We have


Noting that and , we simply apply Theorem 1, which we can do since has full rank by assumption, to conclude. ∎

Remark 3.

This perturbation bound is optimal in how it depends on . Indeed, suppose without loss of generality that . (All the Schatten norms are equivalent modulo constants that depend on and .) Consider a configuration as in the statement, and define where . On the one hand, we have [24, Sec 5.6]


On the other hand,


so that the right-hand side in (9) is bounded by , using and . We therefore conclude that the ratio of the left-hand side to the right-hand side in (9) is at least


using the fact that . Therefore, our bound (9) is tight up to a multiplicative factor depending on the condition number of the configuration .

We now translate this result in terms of point sets instead of matrices. For a centered point set , stored in the matrix

, define its radius as the largest standard deviation along any direction in space (therefore corresponding to the square root of the top eigenvalue of the covariance matrix). We denote this by

and note that


We define its half-width as the smallest standard deviation along any direction in space (therefore corresponding to the square root of the bottom eigenvalue of the covariance matrix). We denote this by and note that it is strictly positive if and only if the point set spans the whole space, in which case


It is well-known that the half-width quantifies the best affine approximation to the point set, in the sense that


where the minimum is over all affine hyperplanes

, and for a subspace , denotes the orthogonal projection onto . We note that is the aspect ratio of the point set.

Corollary 1.

Consider a centered point set with radius , and with half-width , and with pairwise dissimilarities . Consider another set of dissimilarities and set . If , then classical scaling with input dissimilarities and dimension returns a point set satisfying


This corollary follows from Theorem 2. We refer to Section 7.4 for its proof.

4 A perturbation bound for trilateration

The problem of trilateration is that of positioning a point, or set of points, based on its (or their) distances to a set of points, which in this context serve as landmarks. In detail, given a set of landmark points and a set of dissimilarities , the goal is to find such that is close to over all . Algorithm 3 describes the trilateration method of de Silva and Tenenbaum [8] simultaneously applied to multiple points to be located. The procedure is shown in [8] to be exact when the point set spans the whole space, but we provide a more succinct proof of this in the Section A.7.

  Input: centered point set , dissimilarities
  Output: points
  1: compute , where
  2: compute the pseudo-inverse of
  3: compute
  Return: the row vectors of , denoted
Algorithm 3 Trilateration

We perturb both the dissimilarities and the landmark points, and qualitatively characterize how it will affect the returned positions by trilateration. (In principle, the perturbed point set need not have the same mean as the original point set, but we assume this is the case, for simplicity and because it suffices for our application of this result in Section 5.) For a configuration , define its max-radius as


and note that . We content ourselves with a bound in Frobenius norm.444 All Schatten norms are equivalent here up to a multiplicative constant that depends on , since the matrices that we consider have rank of order .

Theorem 3.

Consider a centered configuration that spans the whole space, and for a given configuration , let denote the matrix of dissimilarities between and . Let be another centered configuration that spans the whole space, and let be an arbitrary matrix of dissimilarities. Then, trilateration with inputs and returns satisfying

Remark 4.

The right-hand side of (19) can be upper bounded using




as per Lemma 2. Also, a simple application of Mirksky’s inequality (43) implies that, when spans the whole space then so does whenever .

The proof is in Section 7.3. We now derive from this result another one in terms of point sets instead of matrices.

Corollary 2.

Consider a centered point set with radius , max-radius , and half-width . For a point set with radius , set . Also, let denote another centered point set, and let denote another set of dissimilarities. Set and . If , trilateration with inputs and returns satisfying


where is a universal constant.

Corollary 2 follows from Theorem 3 and its proof is given in Section 7.5.

5 Applications to manifold learning

Consider a set of points in a possibly high-dimensional Euclidean space, that lie on a smooth Riemannian manifold. Isometric manifold learning (or embedding) is the problem of embedding these points into a lower-dimensional Euclidean space, and do as while preserving as much as possible the Riemannian metric. There are several variants of the problem under other names, such as nonlinear dimensionality reduction.

Remark 5.

Manifold learning is intimately related to the problem of embedding items with only partial dissimilarity information, which practically speaking means that some of the dissimilarities are missing. We refer to this problem as graph embedding below, although it is known under different names such as graph realization, graph drawing, and sensor localization. This connection is due to the fact that, in manifold learning, the short distances are nearly Euclidean, while the long distances are typically not. In fact, the two methods for manifold learning that we consider below can also be used for graph embedding. The first one, Isomap [31], coincides with MDS-MAP [25] (see also [21]), although the same method was suggested much earlier by Kruskal and Seery [20]; the second one, Maximum Variance Unfolding, was proposed as a method for graph embedding by the same authors [39], and is closely related to other graph embedding methods [5, 18, 29].

5.1 A performance bound for (Landmark) Isomap

Isomap is a well-known method for manifold learning, suggested by Tenenbaum, de Silva, and Langford [31]. Algorithm 4 describes the method. (There, we use the notation to denote the matrix with entries .)

  Input: data points , embedding dimension , neighborhood radius
  Output: embedding points
  1: construct the graph on with edge weights
  2: compute the shortest-path distances in that graph
  3: apply classical scaling with inputs and , resulting in points
  Return: the points
Algorithm 4 Isomap

There are two main components to Isomap: 1) Form the -ball neighborhood graph based on the data points and compute the shortest-path distances; 2) Pass the obtained distance matrix to classical scaling (together with the desired embedding dimension) to obtain an embedding. The algorithm is known to work well when the underlying manifold is isometric to a convex domain in . Indeed, assuming an infinite sample size, so that the data points are in fact all the points of the manifold, as , the shortest-path distances will converge to the geodesic distances on the manifold, and thus, in that asymptote (infinite sample size and infinitesimal radius), an isometric embedding in is possible under the stated condition. We will assume that this condition, that the manifold is isometric to a convex subset of , holds.

In an effort to understand the performance of Isomap, Bernstein et al. [4] study how well the shortest-path distances in the -ball neighborhood graph approximate the actual geodesic distances. Before stating their result we need to state a definition.

Definition 1.

The reach of a subset in some Euclidean space is the supremum over such that, for any point at distance at most from , there is a unique point among those belonging to that is closest to . When is a submanifold, its reach is known to bound its radius of curvature from below [11].

Assume that the manifold has reach at least , and the data points are sufficiently dense in that


where denote the metric on (induced by the surrounding Euclidean metric). If is sufficiently small in that , then Bernstein et al. [4] show that


where is the graph distance, is the geodesic distance between and , and is a universal constant. (In fact, Bernstein et al. [4] derive such a bound under the additional condition that is geodesically convex, although the result can be generalized without much effort [1].)

We are able to improve the upper bound in the restricted setting considered here, where the underlying manifold is assumed to be isometric to a convex domain.

Proposition 1.

In the present situation, there is a universal constant such that, if ,


Thus, if we set , and it happens that , we have


Armed with our perturbation bound for classical scaling, we are able to complete the analysis of Isomap, obtaining the following performance bound.

Corollary 3.

In the present context, let denote a possible (exact and centered) embedding of the data points , and let and denote the max-radius and half-width of the embedded points, respectively. If , then Isomap returns satisfying

(a) Data points and the -ball neighborhood graph
(b) Exact embedding onto the low-dimensional space
(c) Returned locations by Isomap
Figure 1: Schematic representation of exact locations , data points , returned locations by Isomap . Note that is the geodesic distance between and because is an exact isometric embedding of data points . Also the distances are computed as shortest path distances between and on the -ball neighborhood graph.

Before we provide the proof, we refer to Figure 1 for a schematic representation of exact locations , data points , returned locations by Isomap , as well as graph distances and geodesic distances .

The proof itself is a simple consequence of Corollary 1. Indeed, with (26) it is straightforward to obtain (with defined in Corollary 1 and and as above),


using the fact that by the assumed upper bound (and the fact that ). In particular, fulfills the conditions of Corollary 1 under the stated bound , so we may conclude by applying that corollary and simplifying. ∎

If is a domain in that is isometric to , then the radius of the embedded points ( above) can be bounded from above by the radius of , and under mild assumptions on the sampling, the half-width of the embedded points ( above) can be bounded from below by a constant times the half-width of , in which case and should be regarded as fixed. Similarly, should be considered as fixed, so that the bound is of order , optimized at

. If the points are well spread-out, for example if the points are sampled iid from the uniform distribution on the manifold, then

is on the order of , and the bound (with optimal choice of radius) is in .

Landmark Isomap

Because of the relatively high computational complexity of Isomap, and also of classical scaling, de Silva and Tenenbaum [8, 7] proposed a Nyström approximation (as explained in [23]). Seen as a method for MDS, it starts by embedding a small number of items, which effectively play the role of landmarks, and then embedding the remaining items by trilateration based on these landmarks. Seen as a method for manifold learning, the items are the points in space, and the dissimilarities are the squared graph distances, which are not provided and need to be computed. Algorithm 5 details the method in this context. The landmarks may be chosen at random from the data points, although other options are available, and we discuss some of them in Section 6.2.

  Input: data points , embedding dimension , neighborhood radius , number of landmarks
  Output: embedding points
  1: construct the graph on with edge weights
  2: select of size according to one of the methods in Section 6.2
  3: compute the shortest-path distances in that graph for
  4: apply classical scaling with inputs and , resulting in (landmark) points in
  5: for each , apply trilateration based on and to obtaining
  Return: the points
Algorithm 5 Landmark Isomap

With our work, we are able to provide a performance bound for Landmark Isomap.

Corollary 4.

In the same context as in Corollary 3, if one uses a set of landmarks with embedded half-width , and , then Landmark Isomap returns satisfying


where is a universal constant.

The result is a direct consequence of applying Corollary 3, which allows us to control the accuracy of embedding the landmarks using classical scaling, followed by applying Corollary 2, which allows us to control the accuracy of embedding using trilateration. The proof is given in Section 7.6.

We note that for the set of (embedded) landmarks to have positive half-width, it is necessary that they span the whole space, which compels . In Section 6.2

we show that choosing the landmarks at random performs reasonably well in that, with probability approaching 1 very quickly as

increases, their (embedded) half-width is at least half that of the entire (embedded) point set.

5.2 A performance bound for Maximum Variance Unfolding

Maximum Variance Unfolding is another well-known method for manifold learning, proposed by Weinberger and Saul [38, 37]. Algorithm 6 describes the method, which relies on solving a semidefinite relaxation.

  Input: data points , embedding dimension , neighborhood radius
  Output: embedding points
  1: set if , and otherwise
  2: solve the following semidefinite program

center a solution set and embed it using principal component analysis

  Return: the embed point set, denoted by
Algorithm 6 Maximum Variance Unfolding (MVU)

Although MVU is broadly regarded to be more stable than Isomap, Arias-Castro and Pelletier [2] show that it works as intended under the same conditions required by Isomap, namely, that the underlying manifold is geodesically convex. Under these conditions, in fact, under the same conditions as in Corollary 3, where in particular (26) is assumed to hold with sufficiently small, Paprotny and Garcke [22] show that MVU returns an embedding, , with dissimilarity matrix , , satisfying


where , (the correct underlying distances), and for a matrix , . Based on that, and on our work in Section 3, we are able to provide the following performance bound for MVU, which is morally identical to the bound we obtained for Isomap.

Corollary 5.

In the same context as in Corollary 3, if instead , then Maximum Variance Unfolding returns satisfying


As in (28), we have


so that, in combination with (31), we have


In particular, the conditions of Corollary 1 are met under the stated bound on . Therefore, we may apply that corollary to conclude. ∎

6 Discussion

6.1 Optimality considerations

The performance bounds that we derive for Isomap and Maximum Variance Unfolding are the same up to a universal multiplicative constant. This may not be surprising as they are known to be closely related, since the work of Paprotny and Garcke [22]. Based on our analysis of classical scaling, we believe that the bound for Isomap is sharp up to a multiplicative constant. But one may wonder if Maximum Variance Unfolding, or a totally different method, can do strictly better.

This optimality problem can be formalized as follows:

Consider the class of isometries , one-to-one, such that its domain is a convex subset of with max-radius at most and half-width at least , and its range is a submanifold with reach at least . To each such isometry , we associate the uniform distribution on its range , denoted by . We then assume that we are provided with iid samples of size from , for some unknown isometry in that class. If the sample is denoted by , with for some , the goal is to recover up to a rigid transformation, and the performance is measured in average squared error. Then, what is the optimal achievable performance?

Despite some closely related work on manifold estimation, in particular work of

Genovese et al. [12] and of Kim and Zhou [19], we believe the problem remains open. Indeed, while in the setting in dimension the two problems are particularly close, in dimension

the situation here appears more delicate here, as it relies on a good understanding of the interpolation of points by isometries.

6.2 Choosing landmarks

In this subsection we discuss the choice of landmarks. We consider the two methods originally proposed by de Silva and Tenenbaum [8]:

  • Random. The landmarks are chosen uniformly at random from the data points.

  • MaxMin. After choosing the first landmark uniformly at random from the data points, each new landmark is iteratively chosen from the data points to maximize the minimum distance to the existing landmarks.

(For both methods, de Silva and Tenenbaum [8] recommend using different initializations.)

The first method is obviously less computationally intensive compared to the second method, but the hope in the more careful (and also more costly) selection of landmarks in the second method is that it would require fewer landmarks to be selected. In any case, de Silva and Tenenbaum [8] observe that the random selection is typically good enough in practice, so we content ourselves with analyzing this method.

In view of our findings (Corollary 2), a good choice of landmarks is one that has large (embedded) half-width, ideally comparable to, or even larger than that of the entire dataset. In that light, the problem of selecting good landmarks is closely related, if not identical, to problem of selecting rows of a tall matrix in a way that leads to a submatrix with good condition number. In particular, several papers have established bounds for various ways of selecting the rows, some of them listed in [16, Tab 2]. Here the situation is a little different in that the dissimilarity matrix is not directly available, but rather, rows (corresponding to landmarks) are revealed as they are selected.

The Random

method, nonetheless, has been studied in the literature. Rather than fetch existing results, we provide a proof for the sake of completeness. As everyone else, we use random matrix concentration

[33]. We establish a bound for a slightly different variant where the landmarks are selected with replacement, as it simplifies the analysis. Related work is summarized in [16, Tab 3], although for the special case where the data matrix (denoted earlier) has orthonormal columns (an example of paper working in this setting is [17]).

Proposition 2.

Suppose we select landmarks among points in dimension , with half-width and max-radius , according to the Random method, but with replacement. Then with probability at least , the half-width of the selected landmarks is at least .

The proof of Proposition 2 is given in Section A.9. Thus, if , then with probability at least the landmark set has half-width at least . Consequently, if the dataset is relatively well-conditioned in that its aspect ratio, , is relatively small, then Random (with replacement) only requires the selection of a few landmarks in order to output a well-conditioned subset (with high probability).

7 Proofs

7.1 Preliminaries

We start by stating a number of lemmas pertaining to linear algebra and end the section with a result for a form of procrustes analysis, a well-known method for matching two sets of points in a Euclidean space.

Schatten norms

For a matrix555 All the matrices and vectors we consider are real, unless otherwise specified. , we let denote its singular values. Let denote the following Schatten quasi-norm,


which is a true norm when . When it corresponds to the Frobenius norm (which will also be denoted by ) and when it corresponds to the usual operator norm (which will also be denoted by ). We mention that each Schatten quasi-norm is unitary invariant, and satisfies


any matrices of compatible sizes, and it is sub-multiplicative if it is a norm (). In addition, and , due to the fact that


and if and are positive semidefinite satisfying , where denotes the Loewner order, then , due to the fact that for all in that case. Unless otherwise specified, will be fixed in . Note that, for any fixed matrix , whenever , and


Moore-Penrose pseudo-inverse

The Moore-Penrose pseudo-inverse of a matrix is defined as follows [30, Thm III.1]. Let be a -by- matrix, where , with singular value decomposition , where is -by- orthogonal, is -by- orthogonal, and is -by- diagonal with diagonal entries , so that the ’s are the nonzero singular values of and has rank . The pseudo-inverse of is defined as , where . If the matrix is tall and full rank, then . In particular, if a matrix is square and non-singular, its pseudo-inverse coincides with its inverse.

Lemma 1.

Suppose that is a tall matrix with full rank. Then is non-singular, and for any other matrix of compatible size,


This simply comes from the fact that (since is tall and full rank), so that


by (36). ∎

Lemma 2.

Let and be matrices of same size. Then, for ,


A result of Wedin [30, Thm III.3.8] gives


Assuming has exactly nonzero singular values, using Mirksky’s inequality [30, Thm IV.4.11], namely


we have


By combining Equations (42) and (44), we get


from which the result follows. ∎

Some elementary matrix inequalities

The following lemmas are elementary inequalities involving Schatten norms.

Lemma 3.

For any two matrices and of same size such that or ,


Assume without loss of generality that . In that case, , which is not smaller than or in the Loewner order. Therefore,


applying several of the properties listed above for Schatten (quasi)norms. ∎

Lemma 4.

For any positive semidefinite matrix ,


where denotes the identity matrix.


By the spectral theorem, we may diagonalize , which reduces the situation where is diagonal, , say, with , since a Schatten norm is unitary invariant. Assuming , we have


The case where can be solved similarly, or by continuity based on (38). ∎

7.2 Proof of Theorem 1

Let be the orthogonal projection onto the range of , which can be expressed as