# Convex Optimization Learning of Faithful Euclidean Distance Representations in Nonlinear Dimensionality Reduction

Classical multidimensional scaling only works well when the noisy distances observed in a high dimensional space can be faithfully represented by Euclidean distances in a low dimensional space. Advanced models such as Maximum Variance Unfolding (MVU) and Minimum Volume Embedding (MVE) use Semi-Definite Programming (SDP) to reconstruct such faithful representations. While those SDP models are capable of producing high quality configuration numerically, they suffer two major drawbacks. One is that there exist no theoretically guaranteed bounds on the quality of the configuration. The other is that they are slow in computation when the data points are beyond moderate size. In this paper, we propose a convex optimization model of Euclidean distance matrices. We establish a non-asymptotic error bound for the random graph model with sub-Gaussian noise, and prove that our model produces a matrix estimator of high accuracy when the order of the uniform sample size is roughly the degree of freedom of a low-rank matrix up to a logarithmic factor. Our results partially explain why MVU and MVE often work well. Moreover, we develop a fast inexact accelerated proximal gradient method. Numerical experiments show that the model can produce configurations of high quality on large data points that the SDP approach would struggle to cope with.

## Authors

• 4 publications
• 1 publication
• ### Matrix optimization based Euclidean embedding with outliers

Euclidean embedding from noisy observations containing outlier errors is...
12/23/2020 ∙ by Qian Zhang, et al. ∙ 18

• ### On the Efficient Implementation of the Matrix Exponentiated Gradient Algorithm for Low-Rank Matrix Optimization

Convex optimization over the spectrahedron, i.e., the set of all real n×...
12/18/2020 ∙ by Dan Garber, et al. ∙ 0

• ### Localization from Incomplete Euclidean Distance Matrix: Performance Analysis for the SVD-MDS Approach

Localizing a cloud of points from noisy measurements of a subset of pair...
11/30/2018 ∙ by Huan Zhang, et al. ∙ 0

• ### Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions

We analyze a class of estimators based on convex relaxation for solving ...
02/23/2011 ∙ by Alekh Agarwal, et al. ∙ 0

• ### Optimal Bounds for Johnson-Lindenstrauss Transformations

In 1984, Johnson and Lindenstrauss proved that any finite set of data in...
03/14/2018 ∙ by Michael Burr, et al. ∙ 0

• ### Perturbation Bounds for Procrustes, Classical Scaling, and Trilateration, with Applications to Manifold Learning

One of the common tasks in unsupervised learning is dimensionality reduc...
10/22/2018 ∙ by Ery Arias-Castro, et al. ∙ 0

• ### Multiscale geometric feature extraction for high-dimensional and non-Euclidean data with application

A method for extracting multiscale geometric features from a data cloud ...
11/26/2018 ∙ by Gabriel Chandler, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The chief purpose of this paper is to find a complete set of faithful Euclidean distance representations in a low-dimensional space from a partial set of noisy distances, which are supposedly observed in a higher dimensional space. The proposed model and method thus belong to the vast field of nonlinear dimensionality reduction. Our model is strongly inspired by several high-profile Semi-Definite Programming (SDP) models, which aim to achieve a similar purpose, but suffer two major drawbacks: (i) theoretical guarantees yet to be developed for the quality of recovered distances from those SDP models and (ii) the slow computational convergence, which severely limits their practical applications even when the data points are of moderate size. Our distinctive approach is to use convex optimization of Euclidean Distance Matrices (EDM) to resolve those issues. In particular, we are able to establish theoretical error bounds of the obtained Euclidean distances from the true distances under the assumption of uniform sampling, which has been widely used in modelling social networks. Moreover, the resulting optimization problem can be efficiently solved by an accelerated proximal gradient method. In the following, we will first use social network to illustrate how initial distance information is gathered and why the uniform sampling is a good model in understanding them. We then briefly discuss several SDP models in nonlinear dimensionality reduction and survey relevant error-bound results from matrix completion literature. They are included in the first three subsections below and collectively serve as a solid motivation for the current study. We finally summarize our main contributions with notation used in this paper.

### 1.1 Distances in Social Network and Their Embedding

The study of structural patterns of social network from the ties (relationships) that connect social actors is one of the most important research topics in social network analysis (Wasserman, 1994). To this end, measurements on the actor-to-actor relationships (kinship, social roles, affective, transfer of information, etc) are collected or observed by different methods (questionnaires, direct observation, experiments, written records, etc) and the measurements on the relational information are referred as the network composition. In other words, without appropriate network measurements, we are not able to study any structural property. The measurement data usually can be presented as an measurement matrix, where the rows and the columns both refer to the studied actors. Each entry of these matrices indicates the social relationship measurement (e.g., presence/absence or similarity/dissimilarity measure) between the row and column actors. In this paper, we are only concerned with symmetric relationships, i.e., the relationship from actor to actor is the same as that from actor to actor . Furthermore, there exist standard ways to convert the measured relationships into Euclidean distances, see (Cox and Cox, 2000, Section 1.3.5) and (Borg and Groenen, 2007, Chapter 6).

However, it is important to note that in practice, only partial relationship information are observed, which means the measurement matrix is usually incomplete and noisy. The observation processes are often assumed to follow certain random graph model. One simple but wildly used model is the Bernoulli random graph model (Solomonoff and Rapoport, 1951; Erdős and Rényi, 1959). Let

labeled vertices be given. The Bernoulli random graph is obtained by connecting each pair (or not) of vertices independently with the common probability

(or ) and it reproduces well some principal features of the real-world social network such as the “small-world” effect (Milgram, 1967; de Sola Pool and Kochen, 1979). Other properties such as the degree distribution, the connectivity, the diameters of the Bernoulli random graph, can be found in (e.g., Bollobás, 2001; Janson et al., 2011). For more details on the Bernoulli as well as other random models, one may refer to the review paper (Newman, 2003) and references therein. In this paper, we mainly focus on the Bernoulli random graph model. Consequently, the observed measurement matrix follows the uniform sampling rule, which will be described in Section 2.3.

In order to examine the structural patterns of a social network, the produced images (e.g., embedding in or dimensional space for visualization) should preserve the structural patterns as much as possible, as highlighted by Freeman (2005), “the points in a visual image should be located so the observed strengths of the inter-actor ties are preserved.” In other words, the designed dimensional reduction algorithm has to assure that the embedding Euclidean distances between points (nodes) fit in the best possible way the observed distances in a social space. Therefore, the problem now reduces to whether one can effectively find the best approximation to the true social measurement matrix, which has a low embedding dimension, from the observed incomplete and noisy data. The classical Multidimensional Scaling (cMDS) (see Section 2.1) provides one of the most often used embedding methods in using distance information. However, cMDS alone is often not adequate to produce satisfactory embedding, as rightly observed in several high-profile embedding methods in manifold learning.

### 1.2 Embedding Methods in Manifold Learning

The cMDS and its variants have found many applications in data dimension reduction and have been well documented in the monographs (Cox and Cox, 2000; Borg and Groenen, 2007; Pȩkalska and Duin, 2005). When the distance matrix (or dissimilarity measurement matrix) is close to a true Euclidean Distance Matrix (EDM) with the targeted embedding dimension, cMDS often works very well. Otherwise, a large proportion of unexplained variance has to be cut off or it may even yield negative variances, resulting in what is called embedding in a pseudo-Euclidean space and hence creating the problem of unconventional interpretation of the actual embedding (see, e.g., Pȩkalska et al., 2002).

cMDS has recently motivated a number of high-profile numerical methods, which all try to alleviate the issue mentioned above. For example, the ISOMAP of Tenenbaum et al. (2000) proposes to use the shortest path distances to approximate the EDM on a low-dimensional manifold. The Maximum Variance Unfolding (MVU) of Weinberger and Saul (2006) through SDP aims for maximizing the total variance and the Minimum Volume Embedding (MVE) of Shaw and Jebara (2007) also aims for a similar purpose by maximizing the eigen gap of the Gram matrix of the embedding points in a low-dimensional space. The need for such methods comes from the fact that the initial distances either are in stochastic nature (e.g., containing noises) or cannot be measured (e.g., missing values). The idea of MVU has also been used in the refinement step of the celebrated SDP method for sensor network localization problems (Biswas et al., 2006).

It was shown in (Tenenbaum et al., 2000; Bernstein et al., 2000)

that ISOMAP enjoys the elegant theory that the shortest path distances (or graph distances) can accurately estimate the true geodesic distances with a high probability if the finite points are chosen randomly from a compact and convex submanifold following a Poisson distribution with a high density, and the pairwise distances are obtained by the

-nearest neighbor rule or the unit ball rule (see Section 2.3 for the definitions). However, for MVU and MVE (both have enjoyed a numerical success), there exist no theoretical guarantee as to how good the obtained Euclidean distances are. At this point, it is important to highlight two observations. (i) The shortest path distance or the distance by the -nearest neighbor or the unit-ball rule is often not suitable in deriving distances in social network. This point has been emphasized in the recent study on E-mail social network by Budka et al. (2013). (ii) MVU and MVE models only depend on the initial distances and do not depend on any particular ways in obtaining them. They then rely on SDP to calculate the best fit distances. From this point of view, they can be applied to social network embedding. This is also pointed out in Budka et al. (2013). Due to the space limitation, we are not able to review other leading methods in manifold learning, but refer to (Burges, 2009, Chapter 4) for a guide.

Inspired by their numerical success, our model will inherit the good features of both MVU and MVE. Moreover, we are able to derive theoretical results in guaranteeing the quality of the obtained Euclidean distances. Our results are the type of error bounds, which have attracted growing attention recently. We review the relevant results below.

### 1.3 Error Bounds in Low-Rank Matrix Completion and Approximation

As mentioned in the preceding section, our research has been strongly influenced by the group of researches that related to the MVU and MVE models, which have natural geometric interpretations and use SDP as their major tool. Their excellent performance in data reduction calls for theoretical justification, which in any type seems in nonexistence. Our model also enjoys a similar geometric interpretation, but departs from the two models in that we deal with EDM directly rather than reformulating it as SDP. This key departure puts our model in the category of matrix approximation problems, which have attracted much attention recently from machine learning community and motivated our research.

The most popular approach to recovering a low-rank matrix solution of a linear system is via the nuclear norm minimization (Mesbahi, 1998; Fazel, 2002), which is of SDP. What makes this approach more exciting and important is that it has a theoretically guaranteed recoverability (recoverable with a high probability). The first such a theoretical result was obtained by Recht et al. (2010) by employing the Restricted Isometric Property (RIP). However, for the matrix completion problem the sample operator does not satisfy the RIP (see, e.g., Candès and Plan, 2010). For the noiseless case, Candès and Recht (2009) proved that a low-rank matrix can be fully recovered with high probability provided that a small number of its noiseless observations are uniformly sampled. See Candès and Tao (2010) for an improved bound and the near-optimal bound on the sample number. By adapting the techniques from quantum information theory developed in Gross et al. (2010) to the matrix completion problem, a short and intelligible analysis of the recoverability was recently proposed by Recht (2011).

The matrix completion with noisy observations was studied by Candès and Plan (2010). Recently, the noisy case was further studied by several groups of researchers including Koltchinskii et al. (2011), Negahban and Wainwright (2012) and Klopp (2014), under different settings. In particular, the matrix completion problem with fixed basis coefficients was studied by Miao et al. (2012), who proposed a rank-corrected procedure to generate an estimator using the nuclear semi-norm and established the corresponding non-asymmetric recovery bounds.

Very recently, Javanmard and Montanari (2013) proposed a SDP model for the problem of (sensor network) localization from an incomplete set of noisy Euclidean distances. Using the fact that the squared Euclidean distances can be represented by elements from a positive semidefinite matrix:

 ∥xi−xj∥2=∥xi∥2+∥xj∥2−2⟨xi,xj⟩=Xii+Xjj−2Xij,

where are embedding points and defined by is the Gram matrix of those embedding points, the SDP model aims to minimize (the nuclear norm of ). Equivalently, the objective is to minimize the total variance of the embedding points because it is commonly assumed that the embedding points are already centered. This objective obviously contradicts the main idea of MVU and MVE, which aim to make the total variance as large as possible. It is important to point out that making the variance as big as possible seems to be indispensable for SDP to produce high quality of localization. This has been numerically demonstrated in Biswas et al. (2006).

The impressive result in Javanmard and Montanari (2013) roughly states that the error bound reads as containing an undesirable term , where is the radius used in the unit ball rule, is the embedding dimension, is the bound on the measurement noise and is the number of embedding points. As pointed out by Javanmard and Montanari (2013) that the numerical performance suggested the error seems to be bounded by , which does not match the derived theoretical bound. This result also shows tremendous technical difficulties one may have to face in deriving similar bounds for EDM recovery as those for general matrix completion problems, which have no additional constraints other than being low rank.

To summarize, most existing error bounds are derived from the nuclear norm minimization. When translating to the Euclidean distance learning problem, minimizing the nuclear norm is equivalent to minimizing the variance of the embedding points, which contradicts the main idea of MVU and MVE in making the variance as large as possible. Hence, the excellent progress in matrix completion/approximation does not straightforwardly imply useful bounds about the Euclidean distance learning in a low-dimensional space. Actually one may face huge difficulty barriers in such extension. In this paper, we propose a convex optimization model to learn faithful Euclidean distances in a low-dimensional space. We derive theoretically guaranteed bounds in the spirit of matrix approximation and therefore provide a solid theoretical foundation in using the model. We briefly describe the main contributions below.

### 1.4 Main Contributions

This paper makes two major contributions to the field of nonlinear dimensionality reduction. One is on building a convex optimization model with guaranteed error bounds and the other is on a computational method.

• Building a convex optimization model and its error bounds. Our departing point from the existing SDP models is to treat EDM (vs positive semidefinite matrix in SDP) as a primary object. The total variance of the desired embedding points in a low-dimensional space can be quantitatively measured through the so-called EDM score. The higher the EDM score is, the more the variance is explained in the embedding. Therefore, both MVU and MVE can be regarded as EDM score driven models. Moreover, MVE, being a nonconvex optimization model, is more aggressive in driving the EDM score up. However, MVU, being a convex optimization model, is more computationally appealing. Our model strikes a balance between the two models in the sense that it inherits the appealing features from them. It results in a convex optimization model whose objective consists of three parts (see Subsection 3.1 for more details).

Each part in the objective has its own purpose. The first part is a least-square term that governs the deviation from the observed distances. The second is the nuclear norm minimization term, which, as we have argued before, contradicts the idea of maximizing the total variance. Hence, the second term is corrected by the third term, which involves a set of orthogonal axes of approximating the true embedding space. The last term is crucial to our analysis and is also to accommodate situations where the initial (valuable) information is available about the embedding space. To illustrate such a situation, we may simply think that the leading eigenvectors of the distance matrix estimator from ISOMAP form a good approximation to the true embedding space. When the three parts are combined, the model drives the EDM score up.

What makes our model more important is that it yields guaranteed error bounds under the uniform sampling rule. More precisely, we show that for the unknown Euclidean distance matrix with the embedding dimension and under mild conditions, the average estimation error is controlled by with high probability, where is the sample size and is a constant independent of , and . It follows from this error bound result that our model will produce an estimator with high accuracy as long as the sample size is of the order of , which is roughly the degree of freedom of a symmetric hollow matrix with rank up to a logarithmic factor in the matrix size. It is interesting to point out that with special choice of model parameters, our model reduces to one of the subproblems solved by MVE. Hence, our theoretical result partially explains why the MVE often leads to configurations of high quality. To our knowledge, it is the first such theoretical result that shed lights on the MVE model.

• An efficient computational method. Treating EDM as a primary object not only benefits us in deriving the error-bound results, but also leads to an efficient numerical method. Previously, both MVU and MVE models have numerical difficulties when the data points are beyond . They may even have difficulties with a few hundreds of points when their corresponding slack models are to be solved. This probably explains why most of publications in using the two models do not report cpu time. On the contrary, our model allows us to develop a very fast inexact accelerated proximal gradient method (IAPG) even for problems with a few thousands of data points. Our method fully takes advantages of recent advances in IAPG, saving us from developing the corresponding convergence results. We are also able to develop theoretical optimal estimates of the model parameters. This gives a good indication how we should set the parameter values in our implementation. Numerical results both on social networks and the benchmark test problems in manifold learning show that our method can fast produce embeddings of high quality.

### 1.5 Organization and Notation

The paper is organized as follows. Section 2 provides necessary background with a purpose to cast the MVU and MVE models as EDM-score driven models. This viewpoint will greatly benefit us in understanding our model, which is described in Section 3 with more detailed interpretation. We report our error bound results in Section 4, where only the main results are listed with all the technical proofs being moved to Appendix. Section 5 contains an inexact accelerated proximal gradient method as well as the theoretical optimal estimates of the model parameters. We report our extensive numerical experiments in Section 6 and conclude the paper in Section 7.

Notation. Let

be the real vector space of

real symmetric matrices with the trace inner product for and its induced Frobenius norm . Denote the symmetric positive semidefinite matrix cone. We use to represent the by identity matrix and to represent the vector of all ones. Let , be the vector with the -th entry being one and the others being zero. For a given , we let denote the vector formed from the diagonal of .

Below are some other notations to be used in this paper:

• For any , we denote by the -th entry of . We use to denote the set of all by orthogonal matrices.

• For any , we use to represent the -th column of , . Let be an index set. We use to denote the sub-matrix of obtained by removing all the columns of not in .

• Let and be two index sets. For any , we use to denote the sub-matrix of obtained by removing all the rows of not in and all the columns of not in .

• We use to denote the Hardamard product between matrices, i.e., for any two matrices and in the -th entry of is .

• For any , let be the spectral norm of , i.e., the largest single value of , and be the nuclear norm of , i.e., the sum of single values of . The infinity norm of is denoted by .

## 2 Background

This section contains three short parts. We first give a brief review of cMDS, only summarizing some of the key results that we are going to use. We then describe the MVU and MVE models, which are closely related to ours. Finally, we explain three most commonly used distance-sampling rules.

### 2.1 cMDS

cMDS has been well documented in Cox and Cox (2000); Borg and Groenen (2007). In particular, Section 3 of Pȩkalska et al. (2002) explains when it works. Below we only summarize its key results for our future use. A matrix is called Euclidean distance matrix (EDM) if there exist points in such that for , where is called the embedding space and is the embedding dimension when it is the smallest such .

An alternative definition of EDM that does not involve any embedding points can be described as follows. Let be the hollow subspace of , i.e.,

 Snh:={X∈Sn∣diag(X)=0}.

Define the almost positive semidefinite cone by

 Kn+:={A∈Sn∣xTAx≥0, x∈1⊥}, (1)

where . It is well-known (Schoenberg, 1935; Young and Householder, 1938) that is EDM if and only if

 −D∈Snh∩Kn+.

Moreover, the embedding dimension is determined by the rank of the doubly centered matrix , i.e.,

 r=rank(JDJ)andJ:=I−11T/n,

where is known as the centering matrix.

Since is positive semidefinite, its spectral decomposition can be written as

 −12JDJ=Pdiag(λ1,…,λn)PT,

where and

are the eigenvalues in the nonincreasing order. Since

, we must have for all . Let be the submatrix consisting of the first columns (eigenvectors) in . One set of the embedding points are

 ⎛⎜ ⎜⎝pT1⋮pTn⎞⎟ ⎟⎠=P1diag(√λ1,…,√λr). (2)

cMDS is built upon the above result. Suppose a pre-distance matrix (i.e., and ) is known. It computes the embedding points by (2). Empirical evidences have shown that if the first eigenvalues are positive and the absolute values of the remaining eigenvalues (they may be negative as may not be a true EDM) are small, then cMDS often works well. Otherwise, it may produce mis-leading embedding points. For example, there are examples that show that ISOMAP might cut off too many eigenvalues, hence failing to produce satisfactory embedding (see e.g., Teapots data example in Weinberger and Saul (2006)). Both MVU and MVE models aim to avoid such situation.

The EDM score has been widely used to interpret the percentage of the total variance being explained by the embedding from leading eigenvalues. The EDM score of the leading eigenvalues is defined by

 EDMscore(k):=k∑i=1λi/n∑i=1λi,k=1,2,…,n.

It is only well defined when is a true EDM. The justification of using EDM scores is deeply rooted in the classic work of Gower (1966)

, who showed that cMDS is a method of principal component analysis, but working with EDMs instead of correlation matrices.

The centering matrix plays an important role in our analysis. It is the orthogonal projection onto the subspace and hence . Moreover, we have the following. Let be the geometric center subspace in :

 Snc:={Y∈Sn | Y1=0}. (3)

Let denote the orthogonal projection onto . Then we have That is, the doubly centered matrix

, when viewed as a linear transformation of

, is the orthogonal projection of onto . Therefore, we have

 ⟨JXJ,X−JXJ⟩=0. (4)

It is also easy to verify the following result.

For any , we have

 X−JXJ=12(diag(−JXJ)1T+1diag(−JXJ)T).

### 2.2 MVU and MVE Models

The input of MVU and MVE models is a set of partially observed distances

 {d2ij: (i,j)∈Ω0}andΩ0⊆Ω:={(i,j): 1≤i

Let denote the desired embedding points in . They should have the following properties. The pairwise distances should be faithful to the observed ones. That is,

 ∥pi−pj∥2≈d2ij∀ (i,j)∈Ω0 (5)

and those points should be geometrically centered in order to remove the translational degree of freedom from the embedding:

 n∑i1pi=0. (6)

Let be the Gram matrix of the embedding points. Then the conditions in (5) and (6) are translated to

 Kii−2Kij+Kjj≈d2ij∀ (i,j)∈Ω0

and

 ⟨11T,K⟩=0.

To encourage the dimension reduction, MVU argues that the variance, which is , should be maximized. In summary, the slack model (or the least square penalty model) of MVU takes the following form:

 (7)

where is the penalty parameter that balances the trade-off between maximizing variance and preserving the observed distances. See also Weinberger et al. (2007); Sun et al. (2006) for more variants of this problem.

The resulting EDM from the optimal solution of (7) is defined to be

 Dij=Kii−2Kij+Kjj,

and it satisfies Empirical evidence shows that the EDM scores of the first few leading eigenvalues of are often large enough to explain high percentage of the total variance.

MVE seeks to improve the EDM scores in a more aggressive way. Suppose the targeted embedding dimension is . MVE tries to maximize the eigen gap between the leading eigenvalues of and the remaining eigenvalues. This gives rise to

 max∑ri=1λi(K)−∑ni=r+1λi(K)s.t.Kii−2Kij+Kjj≈d2ij∀ (i,j)∈Ω0⟨11T,K⟩=0andK⪰0.

There are a few standard ways in dealing with the constraints corresponding to . We are interested in the MVE slack model:

 (8)

where . The MVE model (8) often yields higher EDM scores than the MVU model (7). However, (7) is a SDP problem while (8) is nonconvex, which can be solved by a sequential SDP method (see Shaw and Jebara, 2007).

### 2.3 Distance Sampling Rules

In this part, we describe how the observed distances indexed by are selected in practice. We assume that those distances are sampled from unknown true Euclidean distances in the following fashion.

 dij=¯¯¯dij+ηξij,(i,j)∈Ω0, (9)

where are i.i.d. noise variables with , and is a noise magnitude control factor. We note that in (9) it is the true Euclidean distance (rather than its squared quantity) that is being sampled. There are three commonly used rules to select .

• Uniform sampling rule. The elements are independently and identically sampled from with the common probability .

• nearest neighbors -NN) rule. if and only if belongs to the first smallest distances in .

• Unit ball rule. For a given radius , if and only if .

The -NN and the unit ball rules are often used in low-dimensional manifold learning in order to preserve the local structure of the embedding points, while the uniform sampling rule is often employed in some other dimensionality reductions including embedding social network in a low-dimensional space.

## 3 A Convex Optimization Model for Distance Learning

Both MVU and MVE are trusted distance learning models in the following sense. They both produce a Euclidean distance matrix, which is faithful to the observed distances and they both encourage high EDM scores from the first few leading eigenvalues. However, it still remains a difficult (theoretical) task to quantify how good the resulting embedding is. In this part, we will propose a new learning model, which inherit the good properties of MVU and MVE. Moreover, we are able to quantify by deriving error bounds of the resulting solutions under the uniform sampling rule. Below, we first describe our model, followed by detailed interpretation.

### 3.1 Model Description

In order to facilitate the description of our model and to set the platform for our subsequent analysis, we write the sampling model (9) as an observation model. Define two matrices and respectively by

 ¯¯¯¯¯D:=(¯¯¯d2ij)and¯¯¯¯¯D(1/2):=(¯¯¯dij).

A sampled basis matrix has the following form:

 X:=12(eieTj+ejeTi)for some (i,j)∈Ω.

For each , there exists a corresponding sampling basis matrix. We number them as .

Define the corresponding observation operator by

 O(A):=(⟨X1,A⟩,…,⟨Xm,A⟩)T,A∈Sn. (10)

That is, samples all the elements specified by . Let be its adjoint, i.e.,

 O∗(z)=m∑l=1zlXl,z∈Rm.

Thus, the sampling model (9) can be re-written as the following compact form

 y=O(¯¯¯¯¯D(1/2))+ηξ, (11)

where and are the observation vector and the noise vector, respectively.

Since , we may assume that has the following single values decomposition (SVD):

 −J¯¯¯¯¯DJ=¯¯¯¯PDiag(¯¯¯λ)¯¯¯¯PT, (12)

where

is an orthogonal matrix,

is the vector of the eigenvalues of arranged in nondecreasing order, i.e., .

Suppose that is a given initial estimator of the unknown matrix , and it has the following single value decomposition

 −J˜DJ=˜PDiag(˜λ)˜PT,

where . In this paper, we always assume the embedding dimension . Thus, for any given orthogonal matrix , we write with and . For the given parameters and , we consider the following convex optimization problem

 minD∈Sn12m∥y∘y−O(D)∥2+ρ1(⟨I,−JDJ⟩−ρ2⟨˜P1˜PT1,−JDJ⟩)s.t.D∈Snh,−D∈Kn+. (13)

This problem has EDM as its variable and this is in contrast to MVU, MVE and other learning models (e.g., Javanmard and Montanari, 2013) where they all use SDPs. The use of EDMs greatly benefit us in deriving the error bounds in the next section. Our model (13) tries to accomplish three tasks as we explain below.

### 3.2 Model Interpretation

The three tasks that model (13) tries to accomplish correspond to the three terms in the objective function. The first (quadratic) term is nothing but

 ∑(i,j)∈Ω0(d2ij−Dij)2

corresponding to the quadratic terms in the slack models (7) and (8). Minimizing this term is essentially to find an EDM that minimizes the error rising from the sampling model (11).

The second term is actually the nuclear norm of . Recall that in cMDS, the embedding points in (2) come from the spectral decomposition of . Minimizing this term means to find the smallest embedding dimension. However, as argued in both MVU and MVE models, minimizing the nuclear norm is against the principal idea of maximizing variance. Therefore, to alleviate this confliction, we need the third term .

In order to motivate the third term, let us consider an extreme case. Suppose the initial EDM is close enough to

in the sense that the leading eigenspaces respectively spanned by

and by coincide. That is . Then,

 ⟨˜P1˜PT1,−JDJ⟩=r∑i=1λi.

Hence, minimizing the third term is essentially maximizing the leading eigenvalues of . Over the optimization process, the third term is likely to push the quantity

 t:=r∑i=1λi

up, and the second term (nuclear norm) forces the remaining eigenvalues

 s:=n∑i=r+1λi

down. The consequence is that the EDM score

 EDMscore(r)=f(t,s):=tt+s

gets higher. This is because

 f(t2,s2)>f(t1,s1)∀ t2>t1ands2

Therefore, the EDM scores can be controlled by controlling the penalty parameters and

. The above heuristic observation is in agreement with our extensive numerical experiments.

Model (13) also covers the MVE model as a special case. Let and to be one of the iterates in the MVE SDP subproblems. The combined term

 ⟨I,JDJ⟩−2⟨˜P1˜PT1, −JDJ⟩

is just the objective function in the MVE SDP subproblem. In other words, MVE keeps updating by solving the SDP subproblems. Therefore, our error-bound results also justify why MVE often leads to higher EDM scores.

Before we go on to derive our promised error-bound results, we summarize the key points for our model (13). It is EDM based rather than SDP based as in the most existing research. The use of EDM enables us to establish the error-bound results in the next section. It inherits the nice properties in MVU and MVE models. We will also show that this model can be efficiently solved.

## 4 Error Bounds Under Uniform Sampling Rule

Suppose that are independent and identically distributed (i.i.d.) random matrices over with the common111This assumption can be replaced by any positive probability . But it would complicate the notation used. probability , i.e., for any ,

 P(Xl=12(eieTj+ejeTi))=1|Ω|,l=1,…,m.

Thus, for any , we have

 E(⟨A,X⟩2)=12|Ω|∥A∥2. (14)

Moreover, we assume that the i.i.d. noise variables in (9

) have the bounded fourth moment, i.e., there exists a constant

such that .

Let be the unknown true EDM. Suppose that the positive semidefinite matrix

has the singular value decomposition (

12) and with . We define the generalized geometric center subspace in by (compare to (3))

 T:={Y∈Sn | Y¯¯¯¯P1=0}.

Let be its orthogonal subspace. The orthogonal projections to the two subspaces can hence be calculated respectively by

 PT(A):=¯¯¯¯P2¯¯¯¯P2A¯¯¯¯P2¯¯¯¯PT2andPT⊥(A):=¯¯¯¯P1¯¯¯¯PT1A+A¯¯¯¯P1¯¯¯¯PT1−¯¯¯¯P1¯¯¯¯PT1A¯¯¯¯P1¯¯¯¯PT1.

It is clear that we have the following orthogonal decomposition

 A=PT(A)+PT⊥(A)and⟨PT(A),PT⊥(B)⟩=0∀A,B∈Sn. (15)

Moreover, we know from the definition of that for any ,

 PT⊥(A)=¯¯¯¯P1¯¯¯¯PT1A+¯¯¯¯P2¯¯¯¯PT2A¯¯¯¯P1¯¯¯¯PT1,

which implies that . This yields for any

 ∥PT⊥(A)∥∗≤√2r∥A∥. (16)

For given , define

 α(ρ2):=1√2r∥¯¯¯¯P1¯¯¯¯P1−ρ2˜P1˜PT1∥. (17)

Let be the random vector defined by

 ζ=2O(¯¯¯¯¯D(1/2))∘ξ+η(ξ∘ξ). (18)

The non-commutative Bernstein inequality provides the probability bounds of the difference between the sum of independent random matrices and its mean under the spectral norm (see, e.g., Recht, 2011; Tropp, 2012; Gross, 2011). The following Bernstein inequality is taken from (Negahban and Wainwright, 2012, Lemma 7), where the independent random matrices are bounded under the spectral norm or bounded under the

Orlicz norm of random variables, i.e.,

 ∥x∥ψ1:=inf{t>0∣Eexp(|x|/t)≤e}.

Let be independent random symmetric matrices with mean zero. Suppose that there exists , for all , or . Denote . Then, we have for any ,

 P(∥∥1mm∑l=1Zl∥∥2≥t)≤2nmax{exp(−mt24σ2),exp(−mt2M)}.

Now we are ready to study the error bounds of the model (13). Denote the optimal solution of (13) by . The following result represents the first major step to derive our ultimate bound result. It contains two bounds. The first bound (19) is on the norm-squared distance betwen and under the observation operator . The second bound (20) is about the nuclear norm of . Both bounds are in terms of the Frobenius norm of .

Let be the random vector defined in (18) and be given. Suppose that and , where is the adjoint operator of . Then, we have

 12m∥O(D∗−¯¯¯¯¯D)∥2≤(α(ρ2)+2κ)ρ1√2r∥D∗−¯¯¯¯¯D∥ (19)

and

 ∥D∗−¯¯¯¯¯D∥∗≤κκ−1(α(ρ2)+2)√2r∥D∗−¯¯¯¯¯D∥. (20)

The second major technical result below shows that the sampling operator satisfies the following restricted strong convexity (Negahban and Wainwright, 2012) in the set for any , where

 C(τ):={A∈Snh∣∥A∥∞=1√2, ∥A∥∗≤√τ∥A∥, E(⟨A,X⟩2)≥√256log(2n)mlog(2)}.

Let be given. Suppose that , where is a constant. Then, there exists a constant such that for any , the following inequality holds with probability at least .

 1m∥O(A)∥2≥12E(⟨A,X⟩2)−256C2τ|Ω|log(2n)nm.

Next, combining Proposition 4 and Lemma 4 leads to the following result.

Assume that there exists a constant such that . Let be given. Suppose that and . Furthermore, assume that for some constant . Then, there exists a constant such that with probability at least ,

This bound depends on the model parameters and . In order to establish an explicit error bound, we need to estimate ( will be estimated later), which depends on the quantity , where with , are i.i.d. random variables given by (18). To this end, from now on, we always assume that the i.i.d. random noises , in the sampling model (11) satisfy the following sub-Gaussian tail condition.

###### Assumption 1

There exist positive constants and such that for all ,

 P(|ξl|≥t)≤K1exp(−t2/K2).

By applying the Bernstein inequality (Lemma 4), we have

Let be the random vector defined in (18). Assume that the noise magnitude control factor satisfies . Suppose that there exists such that . Then, there exists a constant such that with probability at least ,

 ∥∥∥1mO∗(ζ)∥∥∥2≤C3ω√log(2n)nm. (21)

This result suggests that can take the particular value:

 ρ1=κηωC3√log(2n)mn, (22)

where . Our final step is to combine Proposition 4 and Proposition 4 to get the following error bound.

Suppose that there exists a constant such that , and the noise magnitude control factor satisfies . Assume the sample size satisfies for some constant . For any given , let be given by (22) and . Then, there exists a constant such that with probability at least ,

 ∥D∗−¯¯¯¯¯D∥2|Ω|≤C4((κα(ρ2)+2)2η2ω2+κ2(κ−1)2(α(ρ2)+2)2b2)r|Ω|log(2n)nm. (23)

The only remaining unknown parameter in (23) is though . It follows from (17) that

 (α(ρ2))2=12r(∥¯¯¯¯P1¯¯¯¯PT1∥2−2ρ2⟨¯¯¯¯P1¯¯¯¯PT1,˜P1˜PT1⟩+ρ22∥˜P1˜PT1∥2). (24)

Since and , we can bound by

 (α(ρ2))2≤r2(1+ρ22).

This bound seems to suggest that (corresponding to the nuclear norm minimization) would lead to a lower bound than other choices. In fact, there are better choices. The best choice for is when it minimizes the right-hand side bound and is given by (25) in Subsection 5.1, where we will show that is a better choice than both and (see Proposition 5.1).

The major message from Theorem 4 is as follows. We know that if the true Euclidean distance matrix is bounded, and the noises are small (less than the true distances), in order to control the estimation error, we only need samples with the size of the order , since . Note that, is usually small ( or ). Therefore, the sample size is much smaller than , the total number of the off-diagonal entries. Moreover, since the degree222We know from Lemma 2.1 that the rank of the true EDM . of freedom of by symmetric hollow matrix with rank is , the sample size is close to the degree of freedom if the matrix size is large enough.

## 5 Model Parameter Estimation and the Algorithm

In general, the choice of model parameters can be tailored to a particular application. A very useful property about our model (13) is that we can derive a theoretical estimate, which serves as a guideline for the choice of the model parameters in our implementation. In particular, We set by (22) and prove that is a better choice than both the case (correponding to the nuclear norm minimization) and (MVE model). The first part of this section is to study the optimal choice of and the second part proposes an inexact accelerated proximal gradient method (IAPG) that is particularly suitable to our model.

### 5.1 Optimal Estimate of ρ2

It is easy to see from the inequality (23) that in order to reduce the estimation error, the best choice of is the minimum of . We obtain from (24) that and

 ρ∗2=1r⟨¯¯¯¯P1¯¯¯¯PT1,˜P1˜PT1⟩=1+1r⟨¯¯¯¯P1¯¯¯¯PT1,˜P1˜PT1−¯¯¯¯P1¯¯¯¯PT1⟩. (25)

The key technique that we are going to use to estimate is the Löwner operator. We express both the terms and as the values from the operator. We then show that the Löwner operator admits a first-order apprximation, which will indicate the magnitude of . The technique is extensively used by Miao et al. (2012). We briefly describe it below.

Denote . Assume that . Define the scalar function by

 ϕ(x)=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩1if x≥¯¯¯λr−δ,x−δ¯¯¯λr−2δif δ≤x≤¯¯¯λr−δ,0if x≤δ. (26)

Let be the corresponding Löwner operator with respect to , i.e.,

 Φ(A)=PDiag(ϕ(λ1(A)),…,ϕ(λn(A)))PT,A∈Sn, (27)

where comes from the eigenvalue decomposition . Immediately we have . We show it is also true for .

It follows the perturbation result of Weyl for eigenvalues of symmetric matrices (Bhatia, 1997, p. 63) that

 ∥¯¯¯λi−˜λi∥≤∥J(¯¯¯¯¯D−˜D)J∥≤∥¯¯¯¯¯D−˜D∥,i=1,…,n.

We must have

 ˜λi≥¯¯¯λr−δfor i=1,…,rand˜λi≤δfor i=r+1,…,n.

We therefore have .

As a matter of fact, the scalar function defined by (26) is twice continuously differentiable (actually, is analytic) on . Therefore, we know from (Bhatia, 1997, Exercise V.3.9) that is twice continuously differentiable near (actually, is analytic near ). Therefore, under the condition that