Computing low-rank approximations of matrices is a classic computational problem, with a remarkable number of applications in science and engineering. Given an matrix , and a parameter , the goal is to compute a rank- matrix that minimizes the approximation loss
. Such an approximation can be found by computing the singular value decomposition of. However, since the input matrix is often very large, faster approximate algorithms for computing low-rank approximations have been studied extensively (see the surveys [Mah11, Woo14] and references therein). In particular, it is known for that for several interesting special classes of matrices, one can find an approximate low-rank solution using only a sublinear amount of time or samples from the input matrix. This includes algorithms for incoherent matrices [CR09], positive semidefinite matrices [MW17] and distance matrices [BW18]. Note that sub-linear time or sampling bounds are not achievable for general matrices, as a single large entry in a matrix can significantly influence the output, and finding such an entry could take time.
In this paper we focus on computing low-rank approximations of distance matrices, i.e., matrices whose entries are induced by distances between points in some metric space. Formally:
Definition (distance matrix).
A matrix is called a distance matrix if there is an associated metric space with and , such that for every .
Distance matrices occur in many applications, such as learning image manifolds [WS06], image understanding [TDSL00], protein structure analysis [HS93], and more. The recent survey [DPRV15] provides a comprehensive list. Common software packages such as Julia, MATLAB or R include operations specifically design to produce or process such matrices.
Motivated by these applications, a recent paper by Bakshi and Woodruff [BW18] introduced the first sub-linear time approximation algorithm for distance matrices. Their algorithm computes a rank- approximation of a distance matrix in time , where is an error parameter and is an arbitrarily small constant. Specifically, it outputs matrices and , that satisfy an additive approximation guarantee of the form:
where is the optimal low-rank approximation of . The result of [BW18] raises the question whether even faster algorithms for low-rank approximations of distance matrices are possible. This is the problem we address in this paper.
1.1 Our results
In this paper we present an algorithm for low-rank approximation of distance matrices that is both simpler and more efficient than the prior work. Specifically, we show:
Theorem 1.1 (upper bound).
There is a randomized algorithm that given a distance matrix , reads entries of , runs in time , and computes matrices that with probability
that with probabilitysatisfy
We complement the sample complexity of our algorithm with a matching lower bound on the number of entries of the input matrix that must be read by any algorithm.
Theorem 1.2 (lower bound).
Let and be such that . Any randomized and possibly adaptive algorithm that given a distance matrix , computes that satisfy , must read at least entries of in expectation. The lower bound holds even for symmetric distance matrices.
We include an empirical evaluation of our algorithm on synthetic and real data. The results validate that our approach attains good approximation with faster running time than existing methods.
1.2 Our techniques
On a high level, our algorithm follows the approach of [BW18]. The main idea is to use the result of Frieze, Kannan and Vempala [FKV04], which shows how to compute a solution satisfying Equation 1 in time, assuming the ability to sample a row (or a column) of the matrix with the probability at least proportional to its squared-norm.111Formally, the probability of selecting a given row should be .
Thus the main challenge is to estimate row and column norms in sub-linear time. Although this cannot be done for general matrices, distance matrices have additional structure (imposed by the triangle inequality), which makes the problem easier. Specifically, estimating column norms in distance matrices corresponds to computing, for all, the sum of distances (squared) from all points in to . [BW18] gives a sampling algorithm that computes an -approximation of those sums in sub-linear time. Since the approximation is pretty rough, they need to sample many more columns than the original algorithm of [FKV04], and then they apply the algorithm recursively to the sampled columns. The recursive nature of the algorithm makes the procedure and its analysis quite complex.
To avoid this issue, one could design an algorithm that computes a constant-factor approximation to row and column norms. In the symmetric case , this problem has been already studied in [Ind99]. Specifically, the latter paper developed an approximate comparator, which enables determining whether one row norm is approximately greater than another. Using standard sorting algorithms, one can approximately sort the rows by their norms using roughly comparisons, where each comparison involves sampling roughly entries of . Together with fully computing the norms of landmark rows, this approximate sorting yields sufficiently approximations of all row norms. Unfortunately, this approach does not immediately generalize to the asymmetric case, and exceeds the optimal number of samples by a few logarithmic factors.
Our solution is to estimate row and column norms up to a constant factor with a one-sided guarantee. Specifically, we construct estimations that (a) do not underestimate the true values and (b) the total sum of the estimations is comparable to the sum of the true values. This is sufficient to support a reduction to [FKV04], while making the estimation procedure much simpler. In the symmetric case, our procedure samples a random point , and then estimates the sum of the distances from to as a sum of the distance from to and the average distance from to . A simple application of the the triangle inequality shows this estimation provides the desired guarantee. We note that this idea was inspired by the construction of core-sets from [Che09], although the technical development is quite different (and much simpler).
After executing the algorithm of [FKV04], we still need one more step to compute the solution (as their method reports but not ). This amounts to a regression problem that can be solved by standard techniques. To obtain a tight sampling bound that avoids any logarithmic factor in , we use a recent solver of Chen and Price [CP17].
First let us note that an lower bound can be easily obtained. It is not hard to see that any matrix with entries in
is an (asymmetric) distance matrix, since the triangle inequality is satisfied trivially. If we choose a uniformly random matrix from, then any algorithm that computes a matrix satisfying Equation 1 with , must match a fraction of the entries exactly, yielding the lower bound.
Our lower bound is considerably more involved, and uses tools from communication complexity and random matrix theory. For simplicity, let us describe our techniques in the case . Consider the problem of reporting the majority bit of a random binary string of length . This requires reading of the input bits. If we stack together instances into an random binary matrix , then reporting the majority bit for a large fraction of rows requires reading input bits. This is our target lower bound.
The reduction proceeds by first shifting the values in from to , so that it becomes an (asymmetric) distance matrix. A naïve rank- approximation would be to replace each entry with , yielding a total squared Frobenius error of . However, the optimal rank- approximation is (essentially) to replace each row by its true mean value instead of
. By anti-concentration of the binomial distribution, in most rows the majority bit appearstimes more often than the minority bit. A simple calculation shows this leads to a constant additive advantage per row, and advantage over the whole matrix, of the optimal rank- approximation over the naïve one. Since , any algorithm that satisfies Equation 1 must attain a similar advantage.
By spectral properties of random matrices, the optimal rank- approximation of is essentially unique. In particular, the largest singular value of is much larger than the second-largest one. For technical reasons we need to sharpen this separation even further. We accomplish this by augmenting the matrix with an extra row with very large values, which corresponds to augmenting the metric space with an extra very far point. As a result, any algorithm that satisfies Equation 1 must approximately recover the mean values for a large fraction of the rows. This allows us to solve the majority problem, by reporting whether each row mean in the rank- approximation matrix is smaller or larger than . This yields the desired lower bound for asymmetric distance matrices. The result for symmetric distance matrices, and for general values of , builds on similar techniques.
Consider a distance matrix induced by two point sets, and , as defined in Section 1. If and for every ,222Throughout we use to denote for an integer . then we call a symmetric distance matrix. Otherwise we call it a bipartite distance matrix. Throughout, denotes the optimal rank- approximation of .
As mentioned earlier, our algorithm uses two sub-linear time algorithms as subroutines. They are formalized in the following two theorems. The first reduces low-rank approximation to sampling proportionally to row (or column) norms. We use to denote the th row of .
Theorem 2.1 ([Fkv04]).
Let be any matrix. Let be a sample of rows according to a probability distribution
rows according to a probability distributionthat satisfies for every . Then, in time we can compute from a matrix , that with probability satisfies
The second result approximately solves a regression problem while reading only a small number of columns of the input matrix.
Theorem 2.2 ([Cp17]).
There is a randomized algorithm that given matrices and , reads only columns of , runs in time , and returns that with probability satisfies
Since our sampling procedure evaluates the sum of squared distances (rather than just distances), we need the following approximate version of the triangle inequality.
For every in a metric space , .
By the triangle inequality, . By the inequality of means, . ∎
In this section we prove Theorem 1.1. The algorithm is stated in Algorithm 1. The main step in the analysis is to provide guarantees for the sampling probabilities computed in Steps 1 and 2 of the algorithm. They are specified by the following theorem.
There is a randomized algorithm that given a distance matrix , runs in time , reads entries of , and outputs sampling probabilities , that with probability satisfy for every .
Let be the metric space associated with . Let and be the pointsets associated with its rows and its columns, respectively. Choose a uniformly random and a uniformly random . For every , the output sampling probabilities are given by
All ’s can be computed in time and by reading entries of , since they only involve distances between to and between to . For every ,
On the other hand, in expectation over and we have , and , and . Thus,
By Markov’s inequality, with probability . Normalizing the ’s by their sum yields the theorem. ∎
We remark that if is a symmetric distance matrix, i.e., , the sampling probabilities can be simplified to choosing a single uniformly at random, and letting . The proof is similar to the above.
Proof of Theorem 1.1.
Consider Algorithm 1. By Theorem 3.1, the probabilities computed in Steps 1–2 are suitable for invoking Theorem 2.1. This ensures that the matrix computed in Steps 3–4 satisfies Equation 2. Theorem 2.2 guarantees that the matrix computed in Step 5 satisfies Equation 3. Putting these together, we have
and we can scale by a constant. This proves Equation 1. For the query complexity bound, observe that Theorem 2.1 reads rows and Theorem 2.2 reads columns of the matrix, yielding a total of queries. Finally, the running time is the sum of runnings times of Theorems 2.2, 2.1 and 3.1. ∎
4 Lower Bound
For a clearer presentation, in this section we prove the lower bound in the special case , for distance matrices that can be asymmetric (called bipartite in Definition 1). This case encompasses the main ideas. The full proof of Theorem 1.2 appears in the appendix. For concreteness, let us formally state the special case that will be proven in this section.
Let be such that and . Any randomized algorithm that given a distance matrix , computes that with probability satisfy , must read at least entries of in expectation.
4.1 Hard Distribution over Distance Matrices
By Yao’s principle, it suffices to construct a distribution over distance matrices, and prove the sampling lower bound for any deterministic algorithm that operates on inputs from that distribution. We begin by defining a suitable distribution over distance matrices and proving some useful properties.
In the majority problem, the goal is to compute the majority bit of an input bitstring. We will show the hardness of low-rank approximation via reduction from solving multiple random instances of the majority problem. The sample-complexity hardness of this problem is well-known, and is stated in the following lemma. The proof is included in the appendix.
Let be integers. Any deterministic algorithm that gets a uniformly random matrix as input, and outputs such that for every , , must read in expectation at least entries of .
Given and , let be constants that will be chosen later. ( will be sufficiently small and sufficiently large.) Let , and assume w.l.o.g. this is an integer by letting be sufficiently smaller. Note that in Lemma 4.1, we can symbolically replace the majority alphabet with any alphabet of size , and here we will use . Let be a uniformly random matrix. Let be its rows. We call each of its rows an instance (of the majority problem). Thus is an instance of the random multi-instance majority problem from Lemma 4.1 (with ). We begin by establishing some of its probabilistic properties.
Our goal is to solve via reduction to rank- approximation of distance matrices. To obtain a distribution over distance matrices, we first take and randomly permute its rows to obtain a matrix . The random permutation is denoted by .333The random permutation is for a technical reason and does not change the distribution. Specifically, it is to prevent the algorithm in Lemma 4.1 from focusing on a few fixed instances , , and never attempt the rest. Then, we add an additional th row to , whose entries are all equal . The matrix with the added row is denoted by .
First we show that is indeed a (bipartite) distance matrix.
Every supported is a distance matrix.
Consider a symbolic pointset where , such that corresponds to the rows of , and to the columns of . Our goal is to define a metric on such that for every and . We need to set the rest of the distances such that is indeed a metric – that is, such that satisfies the triangle inequality. For every we set . For every we set . Finally we need to set the distances from . By construction of we already have for every . We set all the remaining distances, for every , to also be .
We need to verify that for all distinct triplets , . Indeed, all distances are in . If then the inequality holds for any setting of and . Otherwise , hence necessarily either or , and in both cases as needed. ∎
By anti-concentration of the binomial distribution, it is known that in a random length- bistring, the majority bit is likely appear times more than the other bit.
Let . Let be a uniformly random majority instance. Then, for , the majority element of appears in it at least times with probability at least .
We call an instance in typical if its majority element appears in it at least times, where is the constant from Lemma 4.1. Otherwise, we call the instance atypical.
Let denote the event there are at least typical instances. By Markov’s inequality, .
We will also require some facts from random matrix theory about the spectrum of . Let denote the all-
’s vector in. Let denote the orthogonal projection on the subspace spanned by . The proofs of the following two lemmas are given in the appendix.
Suppose . With probability , .
In the next lemma and throughout, denotes the spectral norm of a matrix .
Let . Let be a rank- matrix such that . Then with probability , .
Bounds on low-rank approximation
Finally we give upper bounds on the approximation error allowed by Equation 1. For every instance , let denote its mean.
Let be the matrix in which each row equals times the mean of the corresponding row of . Then , where the first inequality is since has rank (each of its rows is a multiple of ). Note that the sum ranges only up to and not , since in the th row all entries are equal (to ) and thus it contributes to . This bounds the first summand in the lemma. To bound the second summand, note that each entry in the first rows of is at most , thus contributing in total to . The final row contributes . Recalling that , we have . ∎
For every , its mean minimizes the sum of squared differences from a single value, namely . In particular, . Furthermore, since , we have . Hence , and the corollary follows from Lemma 4.1. ∎
4.2 Invoking the Algorithm
Suppose we have a deterministic algorithm that given , returns , where and , such that
Let denote the restriction of to the first entries. By scaling (i.e., multiplying by a constant and by its reciprocal), we can assume w.l.o.g. that . Since the th row of equals , we have
This facts yields the following two lemmas, whose full proofs appear in the appendix.
We have , where is a constant that can be made arbitrarily small by choosing sufficiently large.
4.3 Solving Majority
Let denote the random instances in . Recall that we assigned them to rows of by a uniformly random permutation , that is, the th row of equals .
We use to solve the majority problem as follows. For each , if then we output that the majority is , and otherwise we output that the majority is . We say that solves the instance if the output is correct. Due to being random, the probability that solves any instance is identical. Denote this probability by . We need to show that .
Assume by contradiction that . By Markov’s inequality, with probability at least we have at least unsolved instances. Since by there are only atypical instances, we have at least unsolved typical instances. Denote by the set of unsolved typical instances. Consider such instance . Suppose its majority element is . Then, since it is typical,
On the other hand, since is unsolved then . Hence by Lemma 4.2, . Therefore, noting that , we have
On the other hand, by Equation 7,
Together, . Let us upper-bound both terms. For the first term, we simply use . For the second term, note that . Together with Lemma 4.2, . Thus by Lemma 4.1, . By Equation 5, the latter is . Plugging both upper bounds,
Since is fixed, choosing sufficiently small and sufficiently large leads to a contradiction.
Thus , meaning the reduction solves each instance in the majority problem with probability at least . Accounting for the conditioning on and , the determined low-rank approximation algorithm from Section 4.2 solves a random instance of Lemma 4.1 with probability at least (the constants can be scaled without changing the lower bound). Hence, it requires reading at least bits from the matrix, which proves Theorem 4.1.
In this section, we evaluate the empirical performance of Algorithm 1 compared to the existing methods in the literature: conventional SVD, the algorithm of [BW18] (BW), and the input-sparsity time algorithm of Clarkson and Woodruff [CW17] (IS). For SVD we use numpy’s linear algebra package.444https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.linalg.html. This performs full SVD. The iterative SVD algorithms built into MATLAB and Python yielded errors larger by a few orders of magnitude than the reported methods, so they are not included. The experimental setup is analogous to that in [BW18]. Specifically, we consider two datasets:
Synthetic clustering dataset. This data set is generated using the scikit-learn package. We generate points with features and partition the points into clusters. As observed in our experiments, the dataset is expected to have a good rank- approximation.
MNIST dataset. The dataset contains handwritten characters, and each is considered a point. We subsample points.
For each dataset we construct a symmetric distance matrix . We use four distances : Manhattan (), Euclidean (), Chebyshev () and Canberra555The Canberra distance between vectors is defined as . (). Figures 1 and 2 show the approximation error for each distance on each dataset, for varying values of the rank . Note that SVD achieves the optimal approximation error. Table 1 lists the running times for . Figure 3 shows the running time of our algorithm for MNIST subsampled to varying sizes, for .
|Metric||SVD||IS||BW||This Work||SVD||IS||BW||This Work|
P. Indyk, A. Vakilian and T. Wagner were supported by funds from the MIT-IBM Watson AI Lab, NSF, and Simons Foundation. D. Woodruff was supported partly by the National Science Foundation under Grant No. CCF-1815840, and this work was done partly while he was visiting the Simons Institute for the Theory of Computing. The authors would also like to thank Ainesh Bakshi for implementing the algorithm in this paper and producing our empirical results.
- [BGPW16] Mark Braverman, Ankit Garg, Denis Pankratov, and Omri Weinstein, Information lower bounds via self-reducibility, Theory of Computing Systems 59 (2016), no. 2, 377–396.
- [BR14] Mark Braverman and Anup Rao, Information equals amortized communication, IEEE Transactions on Information Theory 60 (2014), no. 10, 6058–6069.
- [BW18] Ainesh Bakshi and David Woodruff, Sublinear time low-rank approximation of distance matrices, Advances in Neural Information Processing Systems, 2018, pp. 3786–3796.
On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications, SIAM Journal on Computing 39 (2009), no. 3, 923–947.
- [CP17] Xue Chen and Eric Price, Condition number-free query and active learning of linear families, arXiv preprint arXiv:1711.10051 (2017).
- [CR09] Emmanuel J Candès and Benjamin Recht, Exact matrix completion via convex optimization, Foundations of Computational mathematics 9 (2009), no. 6, 717.
- [CW17] Kenneth L Clarkson and David P Woodruff, Low-rank approximation and regression in input sparsity time, Journal of the ACM (JACM) 63 (2017), no. 6, 54, (first appeared in STOC’13).
- [DPRV15] Ivan Dokmanic, Reza Parhizkar, Juri Ranieri, and Martin Vetterli, Euclidean distance matrices: essential theory, algorithms, and applications, IEEE Signal Processing Magazine 32 (2015), no. 6, 12–30.
- [FKV04] Alan Frieze, Ravi Kannan, and Santosh Vempala, Fast monte-carlo algorithms for finding low-rank approximations, Journal of the ACM (JACM) 51 (2004), no. 6, 1025–1041.
Ohad N Feldheim and Sasha Sodin,
A universality result for the smallest eigenvalues of certain sample covariance matrices, Geometric And Functional Analysis 20 (2010), no. 1, 88–123.
- [HS93] Liisa Holm and Chris Sander, Protein structure comparison by alignment of distance matrices, Journal of molecular biology 233 (1993), no. 1, 123–138.
- [HW03] Alan J Hoffman and Helmut W Wielandt, The variation of the spectrum of a normal matrix, Selected Papers Of Alan J Hoffman: With Commentary, World Scientific, 2003, pp. 118–120.
- [Ind99] Piotr Indyk, A sublinear time approximation scheme for clustering in metric spaces, Foundations of Computer Science, 1999. 40th Annual Symposium on, IEEE, 1999, pp. 154–159.
Michael W Mahoney, Randomized algorithms for matrices and data
, Foundations and Trends® in Machine Learning3 (2011), no. 2, 123–224.
- [MW17] Cameron Musco and David P Woodruff, Sublinear time low-rank approximation of positive semidefinite matrices, Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, IEEE, 2017, pp. 672–683.
- [RV10] Mark Rudelson and Roman Vershynin, Non-asymptotic theory of random matrices: extreme singular values, Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited Lectures, World Scientific, 2010, pp. 1576–1602.
- [Tao10] Terence Tao, 254a, notes 3a: Eigenvalues and sums of hermitian matrices, https://terrytao.wordpress.com/2010/01/12/254a-notes-3a-eigenvalues-and-sums-of-hermitian-matrices, 2010.
- [TDSL00] Joshua B Tenenbaum, Vin De Silva, and John C Langford, A global geometric framework for nonlinear dimensionality reduction, science 290 (2000), no. 5500, 2319–2323.
- [Tho76] RC Thompson, The behavior of eigenvalues and singular values under perturbations of restricted rank, Linear Algebra and its Applications 13 (1976), no. 1-2, 69–78.
- [Woo14] David P Woodruff, Sketching as a tool for numerical linear algebra, Foundations and Trends® in Theoretical Computer Science 10 (2014), no. 1–2, 1–157.
Kilian Q Weinberger and Lawrence K Saul, Unsupervised learning of image
manifolds by semidefinite programming
, International journal of computer vision70 (2006), no. 1, 77–90.
Appendix A Deferred Proofs from Section 4
a.1 Preliminary Lemmas
The following lemmas are known and we include their proofs for completeness.
Lemma (hardness of majority, Lemma 4.1 restated).
Let be integers. Any deterministic algorithm that gets a uniformly random matrix as input, and outputs such that for every , , must read in expectation at least entries of .
The reduction is from the distributional Gap-Hamming communication problem, which is defined as follows. Alice has a bit string in