Singular Value Decomposition (SVD) is used to obtain basis vectors in a variety of data-driven and reduced order modelling techniques. Techniques like proper orthogonal decomposition (POD) use samples of the solution of the system at various time points to obtain a dense data matrix. This is followed by an SVD of the matrix to obtain the basis vectors. A similar procedure is used in frequency weighted methods for model order reduction (MOR), which use samples of the solution in the frequency domain to obtain the data matrix. In methods like balanced truncation, the SVD of the product of the Cholesky factors of the controllability and observability Gramian is required. Similarly, parameterised MOR techniques based on sampling the solution also require an SVD to determine the basis. In all these cases, the matrices are large, dense and inherently low rank matrices, which are not always “tall and skinny” or “short and fat”. Besides this, dimensionality reduction is a key step in many data-driven algorithms such as feature detection, latent semantic indexing, collaborative filtering etc., which are used in the evolving data-driven design-automation algorithms.
For an , matrix, the complexity of computing the SVD is . Since this is super-linear in the size of the data, it becomes computationally expensive for large data sets. However, if we have a low rank matrix, we would need only basis vectors, where . One way of computing the rank approximation is to compute the SVD of the full matrix and retain only the largest singular values and vectors. It can be shown that this is the best rank approximation with respect to any unitarily invariant norm. The cost of computing this approximation using an SVD followed by truncation turns out to be expensive, especially if the matrices are nearer square matrices. Moreover, in this “Bigdata” era, it is entirely possible that the data set resides on physically different servers and bandwidth and memory constraints on each machine make it impossible to transfer all the data to a single machine to do the analysis. Ideally, what is required is a truly distributed algorithm, where all the computation is done in-situ with minimal data transfer and the results of the computation could also reside on several machines.
What we would like therefore, is a decomposable algorithm, which can run simultaneously on several machines. This has been attempted in [2, 1, 10]. These algorithms assume a “tall and skinny” data matrix, which is a good assumption for many MOR problems. They break up the matrix into blocks, each block containing a small subset of the rows. If the matrix has columns, they show that an approximate PCA/SVD can be computed with communication cost. A drawback of the method proposed in  is that it requires the reconstruction of the low rank approximation of each block of data, followed by accumulation of all these matrices and an SVD of a matrix. In 
, they perform a hierarchical QR decomposition, while propagating only thematrix. In this case, the cost computing and propagating the turns out to be significant. In , they suggest doing an SVD of each block, followed by stacking the truncated of each block on top of each other and then doing a global SVD. Doing the global SVD can still be quite expensive. They also do a randomised SVD of each block to reduce the cost of the block SVDs.
Methods for incremental SVDs have been proposed in [5, 6, 7, 8, 3]. These algorithms incrementally compute the SVD of a matrix when new row/columns are added to the matrix. They use a combination of QR and the SVD of a smaller matrix to get the new SVD. The algorithms proposed are essentially sequential algorithms and are directed towards online re-computation of the SVD, as and when new rows and columns are added.
In this paper, we propose a block based SVD algorithm combining the advantages of the methods proposed in [2, 1, 5, 7]. It is suitable for low rank matrices of arbitrary size. As in , the SVD of the data matrix is computed using approximate SVDs of the smaller blocks. However, unlike , we do not need to reconstruct and accumulate a large number of approximate covariance matrices. In , they do a tree-based merge of the various matrices obtained from a QR decomposition of the blocks. Instead, we merge the approximate SVDs of each block to get the SVD of each “super-block”. We adapt the incremental SVD method in  to find the SVD of the merged blocks, using the SVDs of the individual blocks. Each merge is followed by a truncation, where singular values and vectors that are less than a fraction of the largest singular of the block are discarded. We also use a tree based algorithm, where the (approximate) truncated SVD of the full matrix is computed using repeated merge-and-truncate (MAT) operations on the blocks. Independently, a tree-based algorithm using approximate SVDs has also been proposed recently in  to find the SVD of “short and fat” matrices. However, the basic proof as well details of our algorithm are quite different. Unlike [2, 1, 9], our algorithm is not limited to tall and skinny/short and fat matrices and it is possible to partition the matrix into blocks, both row-wise and column-wise. We also propose a iterative method for reducing the error in the approximation.
2 Incremental and Block-based Algorithms
2.1 Algorithms for distributed data-sets
. It is targeted towards performing a principal component analysis (PCA) of massive distributed data-sets which reside on several machines and computation is not possible on a single server. The aim is to minimise communication costs. The assumption here is that thematrix , is tall and skinny with . Each block of data ()contains a subset of the rows of the matrix. The steps involved are
Perform an approximate PCA of locally in each machine. Let . This is an matrix.
The matrices transferred to the central server and summed i.e., .
Perform a PCA of to get an approximate truncated SVD.
The algorithm proposed in , performs a QR decomposition of each block instead of an approximate SVD. The resultant “R” matrices from each block are combined in a tree-based structure to obtain the “R” matrix corresponding to the full matrix . The SVD of the final “R” gives the correct of .
2.2 Subspace tracking algorithm
FAST, proposed by , is an incremental SVD algorithm targeted to subspace tracking. Let be an matrix. If a new column is added to and is removed, the goal is to find the new SVD incrementally, rather than by re-doing the entire computation. The idea behind this and similar algorithms is to find the component of that is orthogonal to the subspace . This is done by subtracting out the projection of onto to obtain the orthogonal component as follows.
is an matrix and its singular value decomposition is inexpensive. If , the SVD of can be written as
where and and .
2.3 Online incremental Algorithm
A generalisation of this algorithm is proposed in [6, 7]. Given , the author uses a similar technique to find the SVD of . Here and can have an arbitrary number of columns and rows respectively. The entries in and reflect additions/changes to . Let and be the QR decomposition of the component of orthogonal to and the component of orthogonal to . This implies
Substituting this in , we get
As in the previous case, the new SVD can be computed using the SVD of the smaller matrix .
Often only a low rank approximation to the subspace is required. is then an matrix, where . In general, if additional columns are added, the complexity of the computation is + + corresponding to computation of the orthogonal component, QR decomposition of the orthogonal component and SVD of .
3 Proposed implementation
We first start with a simple proof for computing the SVD using either the row and column-wise split, instead of using the covariance matrix based proof in [2, 9, 10]. This proof also extends in a straightforward manner to having simultaneous row and column splits that are needed for large square matrices.
Assume the matrix is “tall and skinny” and is split into two sets of rows and . Let and . The SVD of can be written as
If , we get
is the product of two orthogonal matrices and hence is also an orthogonal matrix.
Assuming and are and matrices, with , and have dimensions and . Therefore the combined (weight) matrix is an matrix. By splitting it up and doing three SVDs, we end up actually increasing the computational complexity! However, since only a low rank approximation to the matrix is required, we do the merge after truncating the individual SVDs so that only singular values up to are retained, where is the largest singular value of the block. The resultant merged SVD is also truncated using the same criteria. Hence , and , indicating a rank , and approximation. The algorithm is therefore a merge-and-truncate (MAT) algorithm, rather than a simple merge. This results in a reduction in the computational complexity as will be seen in the next section.
When the matrix is split into several blocks, the formation of the matrix and the MAT can be done pairwise in a tree based algorithm as indicated in Figure 1. Note that the “U” (left singular-vectors) matrix is not required for the merge and it need not be propagated. This is useful since the size of this matrix increases with each merge.
This algorithm is efficient, if the aim is to obtain the singular values and right singular vectors. However, once all the merges are over, we need to compute , which are the actual POD or basis vectors. Clearly propagation and multiplication of the matrices is not an option. Another way to do it is to compute the left singular vectors using the final as . However, the vectors obtained need not be orthogonal, since is approximate due to the truncation at various levels. Instead, we propose the following. We know that is an orthogonal matrix (assuming a rank approximation of ). Let be the projection of onto the dimensional subspace of the row space. If , then where , and . This is the exact SVD of the projection of onto the dimensional subspace spanned by and an approximation to the best rank- truncated SVD of .
If the matrix is “short and fat”, it is more efficient to split it column-wise as . In this case, merging can be done as follows.
Once again, the SVDs of the two blocks are truncated before the merge and the final SVD is also truncated to get a low rank approximation. Also in this case, the right singular vectors are not required for the merge and hence are not propagated through the tree structure. As in the previous case, after obtaining the final approximate , is projected on to the space spanned by and an SVD is performed i.e. . Therefore where , and
However, if / is large in the first/second case, the SVDs involved in merging the blocks itself take a significant amount of computational effort. This can be made more efficient if the orthogonal complement is merged, which can be done using a combination of QR decomposition and an SVD of a smaller matrix using a method similar to . If and , we can find the component of orthogonal to as . If ,
If , we get
where , and . This is advantageous since we now have to perform an SVD of a much smaller matrix. Although it requires an additional QR decomposition, the total number of operations required is lower for QR than for SVD [11, 12].
Finally, if the matrix is split both row and column-wise, it is easy to see that the blocks can be merged row-wise and column-wise. Let the matrix be partitioned as . A column-wise merge of and is followed by a computation of the corresponding matrices. Finally, the two blocks of rows can be merged using of each block. This is illustrated in Figure 2. However, this is not the most ideal case. Obviously, an interleaved row and column MAT is more efficient.
The steps involved in the algorithm are indicated in Algorithm 1. It is assumed that the input matrix is partitioned into blocks as
Once this is complete, we can follow it up with an iteration to improve accuracy. Algorithm 2 details the steps, assuming column partitioning. It is essentially equivalent to a power iteration. In practice, we have observed that only those singular values close to the cut-off value require correction. In all the cases we have seen, two iterations proved to be sufficient.
Let the matrix be an “tall and skinny” matrix, partitioned row-wise into blocks, each containing rows, where . The first step is the SVD of each block, which has complexity . Since there are blocks, this step is . Instead of having a truncation based on the magnitude of the singular values, assume that each SVD is once again truncated to get a rank matrix. Therefore, at each level tree the merge cost includes finding the orthogonal complement (), QR decomposition () and SVD of a matrix (. Therefore, this step is also . Assume that the MAT operations are done in a binary tree based structure with levels. The total computational cost, counting for all levels of the binary tree, is therefore . Since , it is possible get an improvement in the run time even without any parallelization. Since the MAT operations can be run independently of each other, there is significant scope for further improvement in the run-time, when run in parallel.
We have carried out a number of experiments on matrices with varying sizes. All the matrices were dense, but inherently low-rank. One set of matrices were obtained using the MNA benchmarks , the aim being to obtain a reduced order model using a frequency weighted method. The others were from a fluid dynamics data set for which a proper orthogonal decomposition is to be performed. The code was written in Python and the SVD routine in LAPACK (available in Scipy), along with the openBLAS library, was used for algorithm. The code was run on a core i7 (4 core, 8 threads) machine running Linux.
Since the methods proposed in the paper are only useful when the matrices are inherently low rank, the examples that we looked at were in the context of reduced order modelling of interconnects and proper orthogonal decomposition of a CFD dataset.
One of the steps in frequency weighted balanced truncation techniques is a low rank SVD of a data matrix containing samples of real and imaginary parts of the solutions to the system at various frequencies [13, 14]
. The resultant matrix is a dense matrix and often cannot be classified as “tall and skinny” or “fat and short”. We obtained this matrix for the benchmark MNA5 and computed a low rank SVD of the matrix using the various techniques. Table 1 contains a comparison of the run-times. The times reported are the average of ten runs. The number of columns is varied by changing the number of frequency points at which the solution is obtained. The number of rows for all cases is 10,913 (size of the state matrices). The matrix was split column-wise with eighteen columns in each block. For these runs, singular values below were discarded. Both the MAT algorithms resulted in the same final rank. The rank keeps increasing with the number of frequency domain solutions. This is because for this benchmark, the singular values decay relatively slowly in this range. Clearly doing a QR to find the orthogonal complement and computing SVD of a smaller matrix (MAT-QR+SVD) is more efficient. As the matrix approaches a square matrix, it gives nearly an order of magnitude speed-up in comparison to doing a full SVD followed by truncation and about 25% speedup in comparison to a simple merge using SVD (MAT-SVD). Computing the SVD using QR decompositions followed by a merge of the matrices  performs the worst, poorer than even the SVD of the full matrix. However, it is to be noted that in this case, insignificant singular are discarded only at the end after obtaining all the singular values. The algorithm itself was proposed to reduce communication costs rather than computational costs.
Figure 3 shows the variation in the run-times as a function of the number of rows/columns in each block for four different matrices. Over this range, it is seen to be relatively insensitive the number of rows/columns in a block. However, the row-wise split is definitely better than the column-wise split for the matrix (containing density data from a CFD dataset). This is expected for a tall and skinny matrix.
Next, we tried a combined row and column split for the matrix. The results for the speedup and error are contained in Tables 2 and 3. The number of rows in each block are indicated in blue and the number of columns in green. The speedup is the ratio of the run time of our algorithm and the run time of doing a full SVD and then truncating. It is seen that the speedup increases as the number of rows in each block is increased to some extent. But once the row size is much larger than the column size, the speedup drops again. Independent of the block size, we always get a speedup, but there seems to be some optimal block sizes which result in about speedup. It is not clear how to obtain the optimal split.
The error is computed as , where is the matrix computed using our algorithm (MAT-QR+SVD) and is the best rank approximation obtained by truncating the SVD of the full matrix. The percentage error is also small (of the order of ), except when the block is tall and extremely skinny (). This is possibly because there are only 8/16 singular values in each block and truncation removes information from a much larger fraction of the rows. With two iterations of the iterative improvement algorithm (Algorithm 2), the percentage error drops by an order of magnitude. For two iterations, the cost of the iterative improvement is about 10% of the cost of the MAT. Note that for a significant range of block sizes, both the error and speedup have only a mild dependence on the size.
In this paper, we have presented a block-based merge-and-truncate algorithm to compute a low rank SVD of a matrix. It is suitable for use in reduced order modelling where the matrices are inherently low rank, but can be quite large and dense. Unlike previous algorithms, it allows for a simultaneous row and column split. We get significant speedup over doing a full SVD followed by truncation of small singular values. The percentage error is is less than 1%. The algorithm is very easy to parallelise and can give considerable improvement in the run-time.
Some more work is needed to get error bounds as well as optimal block sizes for various matrices. We have used the SVD routine in LAPACK to obtain the SVD of each block. However, it is possible to use any of the randomised algorithms to obtain an approximate SVD of each block.
-  Z.bai, R.Chan and F.Luk, “Principal Component Analysis for distributed data Sets with Updating”, J. Cao, W. Nejdl, and M. Xu (Eds.): Advanced Parallel processing Technologies 2005, LNCS 3756, pp.471-483, 2005.
-  Y. M. Qu, G. Ostrouchov, N. Samatova, and A. Geist, Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets, Proceedings to the Second SIAM International Conference on Data Mining, April 2002.
-  C.G.Baker, K.A.Gallivan and P.Van Doreen,”Low-rank incremental methods for computing dominant singular subspaces”, Linear Algebra and its Applications, 436, pp 2866-2888, 2012.
-  A.Menon and Charles Elkan, “Fast Algorithms for approximating singular value decomposition”, ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 2, Article 13, Publication date: February 2011.
-  D. W. Tufts, E. C. Real, and J. Cooley, “Fast approximate subspace tracking (FAST),” in Proceedings, IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, 1997.
-  A. Levy, M. Lindenbaum, Sequential Karhunen-Loeve basis extraction and its application to images, IEEE Trans. Image Process. vol. 9, no.8, pp 1371-1374, 2000.
-  M. Brand, “Fast online SVD revisions for lightweight recommender systems,” in Proceedings, SIAM International Conference on Data Mining, 2003, pp. 37-46.
-  M.Brand, “Fast low-rank modifications of the thin singular value decomposition”, Linear Algebra and its Applications 415, pp 20-30, 2006.
-  M.A.Iwen and B.W.Ong, “A distributed and incremental SVD algorithm for agglomorative data analysis on large networks”, arXiv:1601.07010 [math.NA]
-  Y. Liang, M. Balcan, V. Kanchanapally and D. Woodruff, “Improved Distributed Principal Component Analysis”, Advances in Neural Information Processing Systems 27, pp 3113-3121, 2014
-  T.F.Chan, “An Improved Algorithm for Computing the Singular Value Decomposition. ACM Trans. Math. Softw. vol. 8, no. 1, pp 84-88, Mar. 1982.
-  G.Golub and Van Loan, “Matrix Computations”, Third edition, Hindustan Book Agency.
-  J.R.Phillips and L.M.Silveira, “ Poor man’s TBR: A simple model reduction scheme IEEE Trans. Computer-Aided Design, vol. 24, no.1, pp 43-55, Jan. 2005.
-  V.Vasudevan and M.Ramakrishna, "An Efficient Algorithm for Frequency-Weighted Balanced Truncation of VLSI Interconnects in descriptor form", Proc. Des. Automation Conf., 2015
-  Available online at http://www.icm.tu-bs.de/NICONET/benchmodred.html