Inference tasks encountered in natural language processing, graph inference and manifold learning employ the singular value decomposition (SVD) as a first step to reduce dimensionality while retaininguseful
structure in the input. Such spectral embeddings go under various guises: Principle Component Analysis (PCA), Latent Semantic Indexing (natural language processing), Kernel Principal Component Analysis, commute time and diffusion embeddings of graphs, to name a few. In this paper, we present acompressive approach for accomplishing SVD-based dimensionality reduction, or embedding, without actually performing the computationally expensive SVD step.
The setting is as follows. The input is represented in matrix form. This matrix could represent the adjacency matrix or the Laplacian of a graph, the probability transition matrix of a random walker on the graph, a bag-of-words representation of documents, the action of a kernel on a set ofpoints (kernel PCA) such as
where denotes the indicator function or matrices derived from -nearest-neighbor graphs constructed from . We wish to compute a transformation of the rows of this matrix which succinctly captures the global structure of via euclidean distances (or similarity metrics derived from the -norm, such as normalized correlations). A common approach is to compute a partial SVD of , , and to use it to embed the rows of into a -dimensional space using the rows of , for some function . The embedding of the variable corresponding to the -th row of the matrix is the -th row of . For example, corresponds to Principal Component Analysis (PCA): the -dimensional rows of are projections of the -dimensional rows of along the first principal components, . Other important choices include used to cut graphs  and for commute time embedding of graphs . Inference tasks such as (unsupervised) clustering and (supervised) classification are performed using -based pairwise similarity metrics on the embedded coordinates (rows of ) instead of the ambient data (rows of ).
Beyond the obvious benefit of dimensionality reduction from to , embeddings derived from the leading partial-SVD can often be interpreted as denoising, since the “noise” in matrices arising from real-world data manifests itself via the smaller singular vectors of (e.g., see , which analyzes graph adjacency matrices). This is often cited as a motivation for choosing PCA over “isotropic” dimensionality reduction techniques such as random embeddings, which, under the setting of the Johnson-Lindenstrauss (JL) lemma, can also preserve structure.
The number of singular vectors needed to capture the structure of an matrix grows with its size, and two bottlenecks emerge as we scale: (a) The computational effort required to extract a large number of singular vectors using conventional iterative methods such as Lanczos or simultaneous iteration or approximate algorithms like Nystrom ,  and Randomized SVD  for computation of partial SVD becomes prohibitive (scaling as , where is the number of non-zeros in ) (b) the resulting -dimensional embedding becomes unwieldy for use in subsequent inference steps.
Approach and Contributions: In this paper, we tackle these scalability bottlenecks by focusing on what embeddings are actually used for: computing -based pairwise
similarity metrics typically used for supervised or unsupervised learning. For example, K-means clustering uses pairwise Euclidean distances, and SVM-based classification uses pairwise inner products. We therefore ask the following question: “Is it possible to compute an embedding which captures the pairwise euclidean distances between the rows of the spectral embedding, while sidestepping the computationally expensive partial SVD?” We answer this question in the affirmative by presenting a compressive algorithm which directly computes a low-dimensional embedding.
There are two key insights that drive our algorithm:
By approximating by a low-order () polynomial, we can compute the embedding iteratively using matrix-vector products of the form or .
The iterations can be computed compressively: by virtue of the celebrated JL lemma, the embedding geometry is approximately captured by a small number of randomly picked starting vectors.
The number of passes over , and time complexity of the algorithm are , and respectively. These are all independent of the number of singular vectors whose effect we wish to capture via the embedding. This is in stark contrast to embedding directly based on the partial SVD. Our algorithm lends itself to parallel implementation as a sequence of matrix-vector products interlaced with vector additions, run in parallel across randomly chosen starting vectors. This approach significantly reduces both computational complexity and embedding dimensionality relative to partial SVD. A freely downloadable Python implementation of the proposed algorithm that exploits this inherent parallelism can be found in .
2 Related work
As discussed in Section 3.1, the concept of compressive measurements forms a key ingredient in our algorithm, and is based on the JL lemma . The latter, which provides probabilistic guarantees on approximate preservation of the Euclidean geometry for a finite collection of points under random projections, forms the basis for many other applications, such as compressive sensing .
We now mention a few techniques for exact and approximate SVD computation, before discussing algorithms that sidestep the SVD as we do. The time complexity of the full SVD of an matrix is (for ). Partial SVDs are computed using iterative methods for eigen decompositions of symmetric matrices derived from such as and . The complexity of standard iterative eigensolvers such as simultaneous iteration and the Lanczos method scales as , where denotes the number of non-zeros of .
The leading singular value, vector triplets minimize the matrix reconstruction error under a rank constraint: they are a solution to the optimization problem , where denotes the Frobenius norm. Approximate SVD algorithms strive to reduce this error while also placing constraints on the computational budget and/or the number of passes over . A commonly employed approximate eigendecomposition algorithm is the Nystrom method ,  based on random sampling of columns of , which has time complexity . A number of variants of the Nystrom method for kernel matrices like (1) have been proposed in the literature. These aim to improve accuracy using preprocessing steps such as -means clustering  or random projection trees . Methods to reduce the complexity of the Nystrom algorithm to ,  enable Nystrom sketches that see more columns of . The complexity of all of these grow as . Other randomized algorithms, involving iterative computations, include the Randomized SVD . Since all of these algorithms set out to recover
-leading eigenvectors (exact or otherwise), their complexity scales as.
We now turn to algorithms that sidestep SVD computation. In , , vertices of a graph are embedded based on diffusion of probability mass in random walks on the graph, using the power iteration run independently on random starting vectors, and stopping “prior to convergence.” While this approach is specialized to probability transition matrices (unlike our general framework) and does not provide explicit control on the nature of the embedding as we do, a feature in common with the present paper is that the time complexity of the algorithm and the dimensionality of the resulting embedding are independent of the number of eigenvectors captured by it. A parallel implementation of this algorithm was considered in ; similar parallelization directly applies to our algorithm. Another specific application that falls within our general framework is the commute time embedding on a graph, based on the normalized adjacency matrix and weighing function , . Approximate commute time embeddings have been computed using Spielman-Teng solvers ,  and the JL lemma in . The complexity of the latter algorithm and the dimensionality of the resulting embedding are comparable to ours, but the method is specially designed for the normalized adjacency matrix and the weighing function . Our more general framework would, for example, provide the flexibility of suppressing small eigenvectors from contributing to the embedding (e.g, by setting ).
Thus, while randomized projections are extensively used in the embedding literature, to the best of our knowledge, the present paper is the first to develop a general compressive framework for spectral embeddings derived from the SVD. It is interesting to note that methods similar to ours have been used in a different context, to estimate theempirical distribution
of eigenvalues of a large hermitian matrix, . These methods use a polynomial approximation of indicator functions and random projections to compute an approximate histogram of the number of eigenvectors across different bands of the spectrum: .
We first present the algorithm for a symmetric matrix . Later, in Section 3.5, we show how to handle a general matrix by considering a related symmetric matrix. Let denote the eigenvalues of sorted in descending order and their corresponding unit-norm eigenvectors (chosen to be orthogonal in case of repeated eigenvalues). For any function , we denote by the symmetric matrix . We now develop an algorithm to compute a dimensional embedding which approximately captures pairwise euclidean distances between the rows of the embedding .
Rotations are inconsequential: We first observe that rotation of basis does not alter -based similarity metrics. Since satisfies , pairwise distances between the rows of are equal to corresponding pairwise distances between the rows of . We use this observation to compute embeddings of the rows of rather than those of .
3.1 Compressive embedding
Suppose now that we know . This constitutes an -dimensional embedding, and similarity queries between two “vertices” (we refer to the variables corresponding to rows of as vertices, as we would for matrices derived from graphs) requires operations. However, we can reduce this time to by using the JL lemma, which informs us that pairwise distances can be approximately captured by compressive projection onto dimensions.
Specifically, for , let denote an matrix with i.i.d. entries drawn uniformly at random from . According to the JL lemma, pairwise distances between the rows of approximate pairwise distances between the rows of with high probability. In particular, the following statement holds with probability at least : , for any two rows of .
The key take-aways are that (a) we can reduce the embedding dimension to , since we are only interested in pairwise similarity measures, and (b) We do not need to compute . We only need to compute . We now discuss how to accomplish the latter efficiently.
3.2 Polynomial approximation of embedding
Direct computation of from the eigenvectors and eigenvalues of , as would suggest, is expensive (). However, we now observe that computation of is easy when is a polynomial. In this case, for some , so that can be computed as a sequence of matrix-vector products interlaced with vector additions run in parallel for each of the columns of . Therefore, they only require flops. Our strategy is to approximate by , where is an -th order polynomial approximation of . We defer the details of computing a “good” polynomial approximation to Section 3.4. For now, we assume that one such approximation is available and give bounds on the loss in fidelity as a result of this approximation.
3.3 Performance guarantees
The spectral norm of the “error matrix” satisfies , where the spectral norm of a matrix , denoted by refers to the induced -norm. For symmetric matrices, , where are the eigenvalues of . Letting denote the unit vector along the -th coordinate of , the distance between the -th rows of can be written as
Similarly, we have that . Thus pairwise distances between the rows of approximate those between the rows of . However, the distortion term is additive and must be controlled by carefully choosing , as discussed in Section 4.
Applying the JL lemma  to the rows of , we have that when with i.i.d. entries drawn uniformly at random from , the embedding captures pairwise distances between the rows of up to a multiplicative distortion of with high probability:
Let denote an -th order polynomial such that: and an matrix with entries drawn independently and uniformly at random from , where is an integer satisfying . Let denote the mapping from the -th row of to the -th row of . The following statement is true with probability at least :
for any two rows of . Furthermore, there exists an algorithm to compute each of the columns of in flops independent of its other columns which makes passes over ( is the number of non-zeros in ).
3.4 Choosing the polynomial approximation
We restrict attention to matrices which satisfy , which implies that . We observe that we can trivially center and scale the spectrum of any matrix to satisfy this assumption when we have the following bounds: and via the rescaling and centering operation given by: and by modifying to .
In order to compute a polynomial approximation of , we need to define the notion of “good” approximation. We showed in Section 3.3 that the errors introduced by the polynomial approximation can be summarized by furnishing a bound on the spectral norm of the error matrix : Since , what matters is how well we approximate the function at the eigenvalues of . Indeed, if we know the eigenvalues, we can minimize by minimizing . This is not a particularly useful approach, since computing the eigenvalues is expensive. However, we can use our prior knowledge of the domain from which the matrix comes from to penalize deviations from differently for different values of . For example, if we know the distribution of the eigenvalues of , we can minimize the average error
. In our examples, for the sake of concreteness, we assume that the eigenvalues are uniformly distributed overand give a procedure to compute an -th order polynomial approximation of that minimizes .
A numerically stable procedure to generate finite order polynomial approximations of a function over with the objective of minimizing is via Legendre polynomials . They satisfy the recursion and are orthogonal: . Therefore we set where . We give a method in Algorithm 1 that uses the Legendre recursion to compute using matrix-vector products and vector additions. The coefficients are used to compute by adding weighted versions of .
3.5 Embedding general matrices
We complete the algorithm description by generalizing to any matrix (not necessarily symmetric) such that . The approach is to utilize Algorithm 1 to compute an approximate -dimensional embedding of the symmetric matrix . Let be an SVD of (). Consider the following spectral mapping of the rows of to the rows of and the columns of to the rows of . It can be shown that the unit-norm orthogonal eigenvectors of take the form and , , and their corresponding eigenvalues are and respectively. The remaining eigenvalues of are equal to . Therefore, we call FastEmbedEIG with and is an matrix (entries drawn independently and uniformly at random from ). Let and denote the first and last rows of . From Theorem 1, we know that, with overwhelming probability, pairwise distances between any two rows of approximates those between corresponding rows of . Similarly, pairwise distances between any two rows of approximates those between corresponding rows of .
4 Implementation considerations
We now briefly go over implementation considerations before presenting numerical results in Section 5.
Spectral norm estimates In order to ensure that the eigenvalues of are within as we have assumed, we scale the matrix by its spectral norm (). To this end, we obtain a tight lower bound (and a good approximation) on the spectral norm using power iteration ( iterates on randomly chosen starting vectors), and then scale this up by a small factor () for our estimate (typically an upper bound) for .
Polynomial approximation order : The error in approximating by , as measured by is a non-increasing function of the polynomial order . Reduction in often corresponds to a reduction in that appears as a bound on distortion in Theorem 1. “Smooth” functions generally admit a lower order approximation for the same target error , and hence yield considerable savings in algorithm complexity, which scales linearly with .
Polynomial approximation method: The rate at which decreases as we increase depends on the function used to compute (by minimizing ). The choice yields the Legendre recursion used in Algorithm 1, whereas corresponds to the Chebyshev recursion, which is known to result in fast convergence. We defer to future work a detailed study of the impact of alternative choices for on .
Denoising by cascading In large-scale problems, it may be necessary to drive the contribution from certain singular vectors to zero. In many settings, singular vectors with smaller singular values correspond to noise. The number of such singular values can scale as fast as . Therefore, when we place nulls (zeros) in , it is desirable to ensure that these nulls are pronounced after we approximate by . We do this by computing , where is an -th order approximation of . The small values in the polynomial approximation of which correspond to (nulls which we have set) get amplified when we pass them through the non-linearity.
5 Numerical results
While the proposed approach is particularly useful for large problems in which exact eigendecomposition is computationally infeasible, for the purpose of comparison, our results are restricted to smaller settings where the exact solution can be computed. We compute the exact partial eigendecomposition using the ARPACK library (called from MATLAB). For a given choice of weighing function , the associated embedding is compared with the compressive embedding returned by Algorithm 1. The latter was implemented in Python using the Scipy’s sparse matrix-multiplication routines and is available for download from .
We consider two real world undirected graphs in  for our evaluation, and compute embeddings for the normalized adjacency matrix (, where is a diagonal matrix with row sums of the adjacency matrix ; the eigenvalues of lie in ) for graphs. We study the accuracy of embeddings by comparing pairwise normalized correlations between -th rows of given by with those predicted by the approximate embedding ( is short-hand for the -th row of ).
DBLP collaboration network  is an undirected graph on vertices with edges. We compute the leading eigenvectors of the normalized adjacency matrix . The smallest of the five hundred eigenvalues is , so we set and in Algorithm 1 and compare the resulting embedding with . We demonstrate the dependence of the quality of the embedding returned by the proposed algorithm on two parameters: (i) number of random starting vectors , which gives the dimensionality of the embedding and (ii) the boosting/cascading parameter using this dataset.
Dependence on the number of random projections : In Figure (0(a)), ranges from to and plot the 1-st, 5-th, 25-th, 50-th, 75-th, 95-th and 99-th percentile values of the deviation between the compressive normalized correlation (from the rows of ) and the corresponding exact normalized correlation (rows of ). The deviation decreases with increasing , corresponding to -norm concentration (JL lemma), but this payoff saturates for large values of as polynomial approximation errors start to dominate. From the -th and -th percentile curves, we see that a significant fraction (90%) of pairwise normalized correlations in lie within of their corresponding values in when . For Figure (0(a)), we use matrix-vector products for each randomly picked starting vector and set cascading parameter for the algorithm in Section 4.
Dependence on cascading parameter : In Section 4 we described how cascading can help suppress the contribution to the embedding of the eigenvectors whose eigenvalues lie in regions where we have set . We illustrate the importance of this boosting procedure by comparing the quality of the embedding for and (keeping the other parameters of the algorithm in Section 4 fixed: matrix-vector products for each of randomly picked starting vectors). We report the results in Figure (0(b)) where we plot percentile values of compressive normalized correlation (from the rows of ) for different values of the exact normalized correlation (rows of ). For , the polynomial approximation of does not suppress small eigenvectors. As a result, we notice a deviation (bias) of the 50-percentile curve (green) from the ideal dotted line drawn (Figure 0(b) left). This disappears for (Figure 0(b) right).
The running time for our algorithm on a standard workstation was about two orders of magnitude smaller than partial SVD using off-the-shelf sparse eigensolvers (e.g., the dimensional embedding of the leading eigenvectors of the DBLP graph took 1 minute whereas their exact computation took minutes). A more detailed comparison of running times is beyond the scope of this paper, but it is clear that the promised gains in computational complexity are realized in practice.
Application to graph clustering for the Amazon co-purchasing network  : This is an undirected graph on vertices with edges. We illustrate the potential downstream benefits of our algorithm by applying -means clustering on embeddings (exact and compressive) of this network. For the purpose of our comparisons, we compute the first 500 eigenvectors for explicitly using an exact eigensolver, and use an -dimensional compressive embedding which captures the effect of these, with , where is the 500th eigenvalue. We compare this against the usual spectral embedding using the first eigenvectors of : . We keep the dimension fixed at in the comparison because -means complexity scales linearly with it, and quickly becomes the bottleneck. Indeed, our ability to embed a large number of eigenvectors directly into a low dimensional space () has the added benefit of dimensionality reduction within the subspace of interest (in this case the largest eigenvectors).
We consider instances of -means clustering with throughout, reporting the median of a commonly used graph clustering score, modularity  (larger values translate to better clustering solutions). The median modularity for clustering based on our embedding is . This is significantly better than that for , which yields median modularity of . In addition, the computational cost for is one-fifth that for (1.5 minutes versus 10 minutes). When we replace the exact eigenvector embedding with approximate eigendecomposition using Randomized SVD  (parameters: power iterates and excess dimensionality ), the time taken reduces from 10 minutes to seconds, but this comes at the expense of inference quality: median modularity drops to . On the other hand, the median modularity increases to when we consider exact partial SVD embedding with eigenvectors. This indicates that our compressive embedding yields better clustering quality because it is able to concisely capture more eigenvectors( in this example, compared to and with conventional partial SVD). It is worth pointing out that, even for known eigenvectors, the number of dominant eigenvectors that yields the best inference performance is often unknown a priori, and is treated as a hyper-parameter. For compressive spectral embedding , an elegant approach for implicitly optimizing over is to use the embedding function , with as a hyper-parameter.
We have shown that random projections and polynomial expansions provide a powerful approach for spectral embedding of large matrices: for an matrix , our algorithm computes an -dimensional compressive embedding that provably approximates pairwise distances between points in the desired spectral embedding. Numerical results for several real-world data sets show that our method provides good approximations for embeddings based on partial SVD, while incurring much lower complexity. Moreover, our method can also approximate spectral embeddings which depend on the entire SVD, since its complexity does not depend on the number of dominant vectors whose effect we wish to model. A glimpse of this potential is provided by the example of -means based clustering for estimating sparse-cuts of the Amazon graph, where our method yields much better performance (using graph metrics) than a partial SVD with significantly higher complexity. This motivates further investigation into applications of this approach for improving downstream inference tasks in a variety of large-scale problems.
This work is supported in part by DARPA GRAPHS (BAA-12-01) and by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.
B. Schölkopf, A. Smola, and K.-R. Müller, “Kernel principal
component analysis,” in
Artificial Neural Networks — ICANN’97, ser. Lecture Notes in Computer Science, W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud, Eds. Springer Berlin Heidelberg, 1997, pp. 583–588.
-  S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel PCA and de-noising in feature spaces,” in Advances in Neural Information Processing Systems, 1999.
S. White and P. Smyth, “A spectral clustering approach to finding communities in graph.” inSDM, vol. 5. SIAM, 2005.
-  F. Göbel and A. A. Jagers, “Random walks on graphs,” Stochastic Processes and their Applications, 1974.
-  R. R. Nadakuditi and M. E. J. Newman, “Graph spectra and the detectability of community structure in networks,” Physical Review Letters, 2012.
-  C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral grouping using the Nyström method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, 2004.
P. Drineas and M. W. Mahoney, “On the Nyström Method for Approximating
a Gram Matrix for Improved Kernel-Based Learning,”
Journal on Machine Learning Resources, 2005.
-  N. Halko, P. G. Martinsson, and J. A. Tropp, “Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions,” SIAM Rev., 2011.
-  “Python implementation of FastEmbed.” [Online]. Available: https://bitbucket.org/dineshkr/fastembed/src/NIPS2015
-  D. Achlioptas, “Database-friendly random projections,” in Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, ser. PODS ’01, 2001.
-  E. Candes and M. Wakin, “An introduction to compressive sampling,” Signal Processing Magazine, IEEE, March 2008.
-  L. N. Trefethen and D. Bau, Numerical Linear Algebra. SIAM, 1997.
-  S. F. McCormick and T. Noe, “Simultaneous iteration for the matrix eigenvalue problem,” Linear Algebra and its Applications, vol. 16, no. 1, pp. 43–56, 1977.
-  K. Zhang, I. W. Tsang, and J. T. Kwok, “Improved Nyström Low-rank Approximation and Error Analysis,” in Proceedings of the 25th International Conference on Machine Learning, ser. ICML ’08. ACM, 2008.
-  D. Yan, L. Huang, and M. I. Jordan, “Fast Approximate Spectral Clustering,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’09. ACM, 2009.
-  M. Li, J. T. Kwok, and B.-L. Lu, “Making Large-Scale Nyström Approximation Possible.” in ICML, 2010.
-  S. Kumar, M. Mohri, and A. Talwalkar, “Ensemble Nyström method,” in Advances in Neural Information Processing Systems, 2009.
-  F. Lin and W. W. Cohen, “Power iteration clustering,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010.
F. Lin, “Scalable methods for graph-based unsupervised and semi-supervised learning,” Ph.D. dissertation, Carnegie Mellon University, 2012.
-  W. Yan, U. Brahmakshatriya, Y. Xue, M. Gilder, and B. Wise, “PIC: Parallel power iteration clustering for big data,” Journal of Parallel and Distributed Computing, 2013.
-  L. Lovász, “Random walks on graphs: A survey,” Combinatorics, Paul erdos is eighty, vol. 2, no. 1, pp. 1–46, 1993.
D. A. Spielman and S.-H. Teng, “Nearly-linear time algorithms for graph
partitioning, graph sparsification, and solving linear systems,” in
Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, ser. STOC ’04. New York, NY, USA: ACM, 2004.
-  D. Spielman and S. Teng, “Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems,” SIAM Journal on Matrix Analysis and Applications, vol. 35, Jan. 2014.
-  D. Spielman and N. Srivastava, “Graph sparsification by effective resistances,” SIAM Journal on Computing, 2011.
-  R. N. Silver, H. Roeder, A. F. Voter, and J. D. Kress, “Kernel polynomial approximations for densities of states and spectral functions,” Journal of Computational Physics, vol. 124, no. 1, pp. 115–130, Mar. 1996.
-  E. Di Napoli, E. Polizzi, and Y. Saad, “Efficient estimation of eigenvalue counts in an interval,” arXiv:1308.4275 [cs], Aug. 2013.
-  J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” in 2012 IEEE 12th International Conference on Data Mining (ICDM), Dec. 2012.
-  S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3-5, Feb. 2010.