We consider the problem of finding a small number,
, of extreme singular values and corresponding left and right singular vectors of a large sparse matrix(). These singular values and vectors satisfy,
where the singular values are labeled in ascending order, . The matrices and have orthonormal columns. is called a singular triplet of
. The singular triplets corresponding to largest (smallest) singular values are referred as largest (smallest) singular triplets. The Singular Value Decomposition (SVD) can always be computed.
The SVD is one of the most widely used computational kernels in various scientific and engineering areas. The computation of a few of the largest singular triplets plays a critical role in machine learning for computing a low-rank matrix approximation[43, 46, 8] and the nuclear norm , in data mining for latent semantic indexing 
, in statistics for principal component analysis
, and in signal processing and pattern recognition as an important filtering tool. Calculating a few smallest singular triplets plays an important role in linear algebra applications such as computing pseudospectra, and determining the range, null space, and rank of a matrix[14, 42], in machine learning for the total least squares problems , and as a general deflation mechanism, e.g., for computing the trace of the matrix inverse .
For large scale problems, memory demands and computational complexity necessitate the use of iterative methods. Over the last two decades, iterative methods for the SVD have been developed [12, 36, 26, 9, 11, 3, 2, 23, 24, 20, 45] to effectively compute a few singular triplets under limited memory. In particular, the computation of the smallest singular triplets presents challenges both to the speed of convergence and to the accuracy of iterative methods, even for problems with moderate conditioning. Many recent research efforts attempt to address this challenge with new or modified iterative methods [3, 2, 23, 24, 20]. These methods still lack the necessary robustness and display irregular or slow convergence in cases of clustered spectra. Most importantly, they are only available in MATLAB research implementations that cannot solve large-scale problems. In our previous work , we presented a hybrid, two-stage method that achieves both efficiency and accuracy for both largest and smallest singular values under limited memory. In addition, the method can take full advantage of preconditioning to significantly accelerate the computation, which is the key towards large-scale SVD computation for certain real-world problems. Our previous results showed substantial improvements in efficiency and robustness over other methods.
With all this algorithmic activity it is surprising that there is a lack of good quality software for computing the partial SVD, especially with preconditioning. SVDPACK  and PROPACK  can efficiently compute largest singular triplets but they are either single-threaded or only support shared-memory parallel computing. In addition, they are slow and unreliable for computing the smallest singular triplets and cannot directly take advantage of preconditioning. SLEPc offers a more robust implementation of the thick restarted Lanczos bidiagonalization method (LBD)  which can be used in both distributed and shared memory environments. SLEPc also offers some limited functionality for computing the partial SVD through its eigensolvers . However, neither approach is tuned for accurate computation of smallest singular triplets, and only the eigensolver approach can use preconditioning which limits the desired accuracy or the practical efficiency of the package. There is a clear need for high quality, high performance SVD software that addresses the above shortcomings of current software and that provides a flexible interface suitable for both black-box and advanced usage.
In this work we address this need by developing a high quality SVD library, PRIMME_SVDS, based on the state-of-the-art package PRIMME (PReconditioned Iterative MultiMethod Eigensolver ). We highlight three main contributions:
We present the modifications and functionality extensions of PRIMME that are required to implement our two stage PHSVDS method  so that it achieves full accuracy and high-performance, for both smallest and largest singular values and with or without preconditioning.
We provide intuitive user interfaces in C, MATLAB, Python, and R for both ordinary and advanced users to fully exploit the power of PRIMME_SVDS. We also discuss distributed and shared memory interfaces.
We demonstrate with numerical experiments on large scale matrices that PRIMME_SVDS is more efficient and significantly more robust than the two most widely used software packages, PROPACK and SLEPc, even without a preconditioner. We also demonstrate its good parallel scalability.
Since this is a software paper, the presentation of our method is given in a more abstract way that relates to the implementation. Further theoretical and algorithmic details can be found in .
2 Related work
The partial SVD problem can be solved as a Hermitian eigenvalue problem or directly using bidiagonalization methods. We introduce these approaches and discuss the resulting state-of-the-art SVD methods. Next, we review the state-of-practice software, both eigensolver and dedicated SVD software.
2.1 Approaches and Methods for Computing SVD Problems
The first approach computes eigenpairs of the normal equations matrix (or if ). The eigenpairs of correspond to the squares of the singular values and the right singular vectors of , . If , the corresponding left singular vectors are obtained as . Although extreme Hermitian eigenvalue problems can be solved very efficiently, can be ill-conditioned. Assume is a computed singular triplet corresponding to the exact triplet with residual satisfying . Then the relative errors satisfy the relations ,
where is the closest singular value to , is a constant that depends on the dimensions of , and is the machine precision. Since all singular values (except the largest ) will lose accuracy, this approach is typically followed by a second stage of iterative refinement. The singular triplets can be refined one by one [36, 11, 7] or collectively as an initial subspace for an iterative eigensolver . This is especially needed when seeking smallest singular values.
The second approach seeks eigenpairs of the augmented matrix , where . If is a basis for the orthogonal complement subspace of , where , and define the orthonormal matrix
This approach can compute all singular values accurately, i.e., with residual norm close to or relative accuracy . The main disadvantage of this approach is that when seeking the smallest singular values, the eigenvalue problem is a maximally interior one and thus Krylov iterative methods converge slowly and unreliably. Let be the Krylov subspace of dimension of matrix starting with initial vector . It is easy to see that , where is the vector 111We use MATLAB notation for column concatenation, .. Therefore, an unrestarted Krylov method on with the proper initial vector and extraction method converges twice slower than that on . In practice, restarted methods on can converge much faster than on , partly because the Rayleigh-Ritz projection for interior eigenvalues is less efficient due to presence of spurious Ritz values . Harmonic projection [33, 34] and refined projection  have been proposed to overcome this difficulty. However, near-optimal restarted methods on still converge significantly faster .
The Lanczos bidiagonalization (LBD) method [14, 12] addresses the SVD problem directly, with many variants proposed in the last ten years [28, 23, 26, 2, 3, 4, 24]. It is accepted as an accurate and more efficient method for seeking extreme singular triplets since it works on directly which avoids the numerical problems of squaring. Starting with initial unit vectors and , after steps the LBD method produces
where is the residual vector at the step , is the -th orthocanonical vector, and and are orthonormal bases of the Krylov subspaces and respectively. Although the latter Krylov space is the same as the one from Lanczos on , when LBD is restarted its singular triplet approximations often exhibit slow, irregular convergence when the smallest singular values are clustered. To address this problem, harmonic projection [26, 2], refined projection , and their combinations  have been applied to LBD. However, LBD can neither use the locally optimal restarting technique (which is often denoted as +k) , nor make full use of preconditioning. Both techniques become crucial due to the difficulty of the problem even for medium matrix sizes.
The JDSVD method  extends the Jacobi-Davidson method for singular value problems by exploiting the two search spaces of the LBD and the special structure of the augmented matrix . Let and be the bases of the left and right search spaces. Similarly to LBD, JDSVD computes a singular triplet of the projected matrix (note that is not bidiagonal) and yields as an approximate singular triplet. JDSVD expands with the orthogonalized corrections and for and , which are obtained by solving the correction equation
where . The JDSVD method can take advantage of preconditioning when solving (8) and, since it does not rely on being Krylov, can use more effective restarting techniques. Because of the two sided projection, it also overcomes the numerical problems of working with . However, an optimized eigensolver on will still be faster until the numerical issues arise.
The SVDIFP method  extends the EIGIFP inner-outer method . Given an approximate right singular vector and its singular value at the -th step of the outer method, it builds a Krylov space , where is a preconditioner for . To avoid the numerical problems of projecting on , SVDIFP computes the smallest singular values of , by using a two sided projection similarly to LBD. Because the method focuses only on the right singular vector, the left singular vectors can be quite inaccurate.
2.2 Current SVD Software
Given the availability of several SVD algorithms, it is surprising that there is a lack of corresponding good quality software, especially with preconditioning. Table 1 summarizes the state-of-the-art dedicated SVD software packages. For each package, we show the methods it implements, its programming language, its parallel computing and preconditioning capabilities, whether it can obtain a fully accurate approximation efficiently, and the main interfaces to these libraries. The top four are high performance libraries, while the rest are MATLAB research codes.
SVDPACK  and PROPACK  implement variants of Lanczos or LBD methods. In addition, PROPACK implements an implicitly restarted LBD method. Both methods work well for computing a few largest, well separated singular values, which in most cases is an easy problem. The computation of smallest singular triplets is not supported in SVDPACK and it is not efficient in PROPACK which only implements the Rayleigh-Ritz projection . In addition, neither library can use preconditioning or support message passing parallelism, although PROPACK does support shared memory multithreading. These are severe limitations for large-scale problems that need to run on supercomputers and that often converge too slowly without preconditioning.
SLEPc offers an LBD method with thick restarting , which has similar algorithmic limitations to PROPACK for computing smallest singular values. In addition, this particular SVD solver cannot be directly used with preconditioning. However, the SLEPc LBD has an efficient parallel implementation.
Despite accuracy limitations, eigenvalue iterative methods based on are widely used for computing the largest eigenpairs where the loss of accuracy is limited (see (2)) and even for low accuracy computations of the smallest singular values. For example, two popular packages in machine learning, scikit-learn222http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html and Spark’s library MLib333https://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html, use a wrapper for the popular package ARPACK (implicit restarting Arnoldi method) . Other solvers for standard Hermitian eigenvalue problems can also be used. Table 2 lists the most widely used eigensolver libraries with high-performance computing implementations.
Our hybrid PHSVDS method can leverage these eigensolver libraries to solve the partial SVD problem in full accuracy. However, to optimize for efficiency and robustness several modifications and code additions are required that were not all available in previous eigensolvers. For example, Anasazi features a robust, high performance computing implementation but does not provide the near-optimal eigenmethods that are critical for fast convergence. SLEPc relies on PETSc for basic linear algebra kernels with support for various high-performance standards for shared and distributed memory machines and GPU. It also provides the appropriate preconditioned eigensolvers . However, these are not tuned to deal with the high accuracy requirements of the first stage of PHSVDS or the need for refined projection methods for the highly interior problem in the second stage. PRIMME is designed to take advantage of all special properties of the Hermitian eigenvalue problem and therefore is a natural candidate that only required minor extensions, as described later in this paper. The goal is to produce a high quality, general purpose SVD package that can solve large-scale problems with high accuracy, using parallelism and preconditioning, and with as close to a black-box interface as possible.
We mention that in some applications (for example in data mining) the required accuracy is so low or the rank of the matrix is so small that the use of the power method  or the block power method (see Randomized PCA in ) is sufficient. These methods cannot be considered as general purpose SVD software and thus they are beyond the scope of this paper.
|HPC library||Method||Lang||Parallel||Precon.||Fast Full Acc.||Main Bindings/Ports|
|PRIMME||PHSVDS||C||M S||Y||Y||Fort, Mat, Py, R|
|SLEPc||TRLBD/KS||C||M S G||N||N||Fort, Mat, Py|
|SLEPc||JD/GD+k||C||M S G||Y||N||Fort, Mat, Py|
|–||IRLBA444The IRBLA method also has a Python port available at https://github.com/bwlewis/irlbpy and a port to R at https://github.com/bwlewis/irlba. Neither port currently supports the computation of smallest singular values. The R port can be used on parallel environments.||Mat||S||N||Y||Py, R|
|Anasazi||KS/GD/LOBPCG||C++||M S G||Y||Py|
|(P)ARPACK||Arnoldi||Fort||M S||N||Mat, Py, R, Jul|
|PRIMME||JD(QMR)/GD+k/LOBPCG||C||M S||Y||Fort, Mat, Py, R|
|SLEPc||KS/JD/GD+k||C||M S G||Y||Fort, Mat, Py|
3 PRIMME_SVDS: A High-Performance Preconditioned SVD Software in PRIMME
Our goal is to provide a high quality, state-of-the-art SVD package that enables practitioners to solve a variety of large, sparse singular value problems with unprecedented efficiency, robustness, and accuracy. In this section we firstly describe our preconditioned two-stage meta-method proposed in  for effectively and accurately computing both largest and smallest singular values. Then we illustrate in detail the parallel implementation of PRIMME_SVDS as well as various characteristics of our high-performance software.
3.1 Method for Efficient and Accurate Computation
The PHSVDS approach  relies on eigensolvers that work on the equivalent eigenvalue formulations and , switching from one to the other to obtain the best performance. The PHSVDS method starts on because without preconditioning the convergence in terms of iterations is much faster than that on , as commented in Section 2.1. Furthermore, the cost per iteration (computation of residual vectors, orthogonalization, etc.) on is up to two times cheaper than on (because of dimension versus ). We refer to the computations on as the first stage of the method. If further accuracy is required, the method switches to a second stage where the eigensolver is reconfigured to work on , but with initial guesses from the first stage. We do not consider eigensolvers such as Lanczos that accept only one initial guess. Such methods would have to work for each needed singular triplet independently which is usually not as efficient as using all the good quality initial guesses from the first stage in one search space.
The singular value solver is configured to stop when the residual norm of the required singular triplets is less than the requested tolerance , or
Let be the eigenpair approximation from the eigensolver on . Considering that will be set as , as and as , the above is translated into a convergence criterion for the eigensolver on as
The eigensolver returns when all requested triplets satisfy the convergence criterion. However, the eigensolver may reach its maximum achievable accuracy before the residual norm reduces below the above convergence tolerance. Setting this limit properly is a critical point in the robustness and efficiency of the method. If the tolerance is set below this limit the eigensolver may stagnate. If the limit is overestimated, then the number of iterations of the second stage will increase, making the whole solver more expensive. Selecting this limit depends on the numerical properties of the eigensolver, so we discuss it in Section 3.2.
In the second stage, the vectors are set as initial guesses of the eigensolver, as follows from (5). The convergence criterion directly checks (9) with set as and and set as the normalized subvectors and . Because this computation requires extra reduction operations, it is only checked after an eigenvalue residual condition is satisfied,
The above is derived by assuming that
, where Fortran notation is used for vector subranges. If the eigenvectorcorresponds to a singular triplet, then this assumption is satisfied near convergence. However, it is possible that a large discrepancy between the two norms exists, , even when (10) is satisfied. This is the case when has large components in the -dimensional null space of , and therefore it does not correspond to a singular triplet. Checking our second level criterion (9) avoids this problem. We have observed this situation when computing the smallest singular values in problems with condition number larger than .
If the smallest singular values are wanted, the eigensolver on must compute interior eigenvalues. Then, it is important that we specify where the eigensolver should look for them. First, we want only the non-negative eigenvalues of . Second, if and we ask for small eigenvalues that are very close to zero, the eigensolver will keep trying to converge on the unwanted null space. Therefore, we should only try to find eigenvalues on the right of some positive number. Third, because this number should be a lower bound to the singular value we seek, we can use the perturbation bounds of the approximations of the first stage. Specifically, we know that , where . Because the augmented approach cannot distinguish eigenvalues smaller than , we configure the eigensolver to find the smallest eigenvalue that is greater than
. This heuristic is also used in.
For interior eigenproblems, alternatives to the Rayleigh-Ritz extraction are recommended, such as the harmonic or the refined variants. As described in the next section, we employ a variant of the refined extraction. For each eigenvalue, the shift for the refined extraction is the corresponding singular value lower bound. In PRIMME, these lower bounds are also used as shifts in the Jacobi-Davidson correction equation. If the shifts are close enough to the exact eigenvalues, they accelerate the convergence of Jacobi-Davidson.
Algorithm 1 shows the specific functionality needed for the two stages of PHSVDS.
3.2 Descriptions of changes in PRIMME
To support PRIMME_SVDS, we have implemented many enhancements to PRIMME, including a user defined convergence criterion, improved numerical quality of converged eigenvectors, improved robustness to achieve convergence near machine precision, a simplified refined projection method, a different locking scheme for interior eigenvalues, a new scheme for initializing the search space, and finally a new two-stage meta-method interface.
To achieve the required accuracy at the first stage, we have to adjust the convergence tolerance at every step based on the value of the current eigenvalue. This was not possible in the original PRIMME implementation in which was set as a user input parameter. In the new version, we have added a function pointer into PRIMME main data structure to allow the user to provide their own convergence test function. Our top level SVD interface provides line 2 of Algorithm 1 as the default convergence test function.
When PRIMME uses soft locking (i.e., no locking), before exiting it now performs an additional Rayleigh-Ritz on the converged Ritz vectors to adjust the angles of the desired Ritz vectors. The resulting Ritz vectors have improved quality that has proved helpful in the second stage . This is mentioned in Line 4 of the algorithm.
As an iterative method based on matrix-vector multiplications by , the maximum accuracy that the eigensolver can obtain should be close to . We observed that in slow converging cases PRIMME eigensolvers may stagnate when the residual norm is still 10 or 100 times above that limit. A brief analysis reveals that the major propagation of error occurs when restarting the search space and the auxiliary matrix as and respectively 555As in most Davidson-type methods, PRIMME stores to allow the computation of the residual without an additional matrix-vector operation. Note also that in this section refers to the search space, not the exact singular vectors.. Despite , the operation occurs a number of times equal to the number of restarts, and thus the expected accumulated error increases by a factor of . This factor is more problematic for where the accumulated error becomes , thus preventing the residual to converge to full accuracy. This was also confirmed experimentally. Our solution was to reset both matrices, by fully reorthogonalizing and computing directly, when , where restarts is the number of restarts since last resetting. This change has returned the stagnation level to less than facilitating a very accurate solution at the first stage.
To address the interior eigenproblem of the second stage, we have implemented a refined extraction procedure in PRIMME. The refined procedure computes an eigenvector approximation in the span of that minimizes the norm of the residual . In general, should be as close as possible to the eigenvalue so most implementations set it as the Ritz or harmonic Ritz value from the current search space at every iteration [22, 21, 33]. The minimization requires the QR factorization of the tall skinny matrix , for a step cost of flops and global reductions per iteration, where is the number of columns of . This, however, would be too expensive. In our case, the from the first stage are very good eigenvalue approximations (if in (2)) so there is little gain to updating the shift at every iteration. This leads to a simplified and much more efficient implementation of the refined procedure. With constant , the cost of updating the QR factorization at every iteration is and requires only a constant number of synchronizations, the same as the cost of orthogonalization. Also, and can be restarted without communications; if is restarted as , then we compute the QR factorization and restart as and as . The factorization of involves a matrix of small dimension and can be replicated in every process.
In the second stage we force the PRIMME eigensolver to use locking. The earlier version of PRIMME locked converged eigenvectors only at restart, allowing them to improve for a few more steps in the basis. The new version of PRIMME changes this for interior problems only. When an eigenvector converges we force a restart and lock it out. This improves robustness with the Rayleigh-Ritz method since converged interior eigenvalues may become unconverged causing the method to misconverge. The refined extraction avoids this problem but it still benefits from the new change. When an eigenvector is locked, the QR factorization for the refined extraction is recomputed with the new target shift.
In PRIMME the search space is initialized as a block Krylov subspace starting from any available initial guesses or random vectors. We have extended the library’s setup options to allow for finer user control on the initialization step. Among other options, the user can now deactivate the Krylov subspace. We have found this to be helpful because of the good quality initial guesses in the second stage.
3.3 High performance characteristics of PRIMME_SVDS
Our library extension inherits the design philosophy of PRIMME with respect to performance. This is summarized below and its effects on performance in Table 3.
The user must provide as function pointers the matrix-vector product and, optionally, the preconditioner application.
PRIMME’s implementation works for both parallel and sequential runs. It follows the SPMD parallelization model, so if the user has distributed the matrix by rows onto processes, each process in PRIMME will hold the corresponding local rows of the vectors. Small objects are replicated across processes. If the code is used in parallel, in addition to parallel implementations of the matrix-vector and preconditioning operators, the user must also provide a subroutine for the global sum reduction.
PRIMME relies on third-party BLAS and LAPACK libraries to achieve single and multi-threaded performance of operations with dense matrices and vectors within each SPMD process.
The required workspace is allocated internally or may be provided by the user as a block of memory.
|Operations||Kernels or Libs||Cost per Iteration||Scalability|
|Dense algebra: MV, MM,||BLAS (e.g., MKL, ESSL,||O((m+n)*g)||Good|
|Inner Prods, Scale||OpenBlas, ACML)|
|Sparse algebra: SpMV,||User defined (eg, PETSc,||O(1) calls||Application|
|SpMM, Preconditioner||Trilinos, HYPRE, librsb)||dependent|
|Global reduction||User defined (e.g.,||O(1) calls of size O(g)||Machine|
Some libraries, such as SLEPc and Anasazi, use an object oriented abstraction for matrices and vectors, thus externalizing the control over the actual memory and the operations on it. SLEPc is based on the structures of PETSc and Anasazi defines its own structures with templates. The goal is to facilitate optimizations on different memory hierarchies and heterogeneous architectures such as accelerators. However, this design may increase overhead and induce unnecessary memory copies to conform with the given abstraction.
PRIMME handles the vectors directly as a block of memory, which may allow for fewer memory copies when the matrix-vector product and preconditioning operators receive and return the corresponding local part of the vectors, but also in other places in the code. Moreover, this memory layout allows us to group our most computationally intensive operations (such as the computation of the residuals and restarting of the basis) in special kernels that minimize the memory accesses and thus display better locality for cache performance. These new kernels are a new addition to the PRIMME library. Then, all numerical operations are performed through calls to optimized BLAS and LAPACK which are compatible with this memory model. PRIMME tries to use the highest level BLAS when this is beneficial, e.g., the use of level 3 BLAS when the block size is greater than one. The only disadvantage of this approach is that calls to accelerator (GPU) enhanced BLAS/LAPACK libraries have to transfer their arguments during every call. Future releases will allow the code to work directly on accelerator memory communicating to CPU only the small sequential tasks.
Following the SPMD (single program, multiple data) model for parallel programming, only the largest data structures in PRIMME are distributed; specifically the eigenvectors to be returned, the vectors in the search space , the auxiliary vectors , and, when refined extraction is used, the array
that holds the orthogonal matrix of the QR factorization of. The cost of small matrix operations (such as solving the small projected eigenvalue problem) is negligible and the operation is duplicated across all processes. Long vector updates are performed locally with no communication. Inner products involve a global sum reduction which is the only communication primitive required and is provided by the user. To reduce latency, PRIMME blocks as many reductions together as possible without compromising numerical stability.
Most, but not all, applications use PRIMME with the MPI framework that nowadays can be used effectively even on shared memory machines. This avoids the need to store the matrix or the preconditioner on each core. Similarly, pure shared memory parallelism or optimized sequential execution is obtained by simply linking to the appropriate libraries.
The user-provided matrix-vector and preconditioning operators must be able to perform operations with both the matrix and its transpose and implement the desired parallel distribution in each case. Note that the parallel behavior of the two stages might be very different for rectangular matrices where . The PRIMME_SVDS interface also allows the user to pass functions for performing matrix vector and preconditioning operations with and directly. This is useful as there are many optimizations that a user can perform to optimize these operations (especially on , see ). This feature is not available in other software.
3.4 Interfaces of PRIMME_SVDS
PRIMME_SVDS is part of PRIMME’s distribution with a native C interface. We provide several driver programs with calling examples as part of the documentation, ranging from a simple sequential version with a basic matrix-vector multiplication, to a fully parallel and preconditioned version with such functionality provided by the PETSc library. In addition, we offer interfaces to Matlab, Python, and R that can be used easily by both ordinary and advanced users to integrate with domain specific codes or simply experiment in an interactive environment. These interfaces expose the full functionality of PRIMME_SVDS which, depending on the supporting libraries, can include parallelism and preconditioning.
Next, we describe the most important functionality and features of PRIMME_SVDS using the C interface. Other interfaces are wrappers that call the C interface. The problem parameters and the specific method configuration are set in a C structure called primme_svds_params. The most important parameters are the following.
m and n: the number of rows and columns of the problem matrix.
matrixMatvec(x, ldx, y, ldy, blockSize, transpose, primme_svds):function pointer to the matrix-vector product with . The result of the product is stored in . If transpose is zero, the function should compute , otherwise . and are matrices with blockSize columns and leading dimensions ldx and ldy respectively. primme_svds is included to provide access to all primme_svds_params parameters.
matrix: (optional) pointer to user data for matrixMatvec.
numSVals: the number of desired singular triplets to find.
target: select which singular values to find: the smallest, the largest or the closest to some value (not discussed in this paper).
eps: the desired accuracy for wanted singular triplets, see (9).
To run in parallel SPMD mode, the matrix-vector operator must be parallel, every process should have the same values for eps, numProcs, and numSVals, and the following parameters must be set.
numProcs: the number of MPI processes (must be greater than 1).
procID: the rank of the local process (e.g., the MPI rank).
mLocal and nLocal: the number of rows and columns local to this process. In the parallel matrixMatvec, and address the corresponding local parts, with having nLocal rows and having mLocal rows.
globalSumDouble: function pointer to the global sum reduction.
These parallel environment parameters are also required in other frameworks such as PETSc (see Sec 3.1 in PETSc user manual ) and Tpetra (see class CsrMatrix ). In shared memory environments, the user may choose to run a single process and link to a threaded BLAS library such as OpenBLAS .
The following optional parameters may be used to accelerate convergence.
applyPreconditioner(x, ldx, y, ldy, blockSize, mode, primme_svds): function pointer to the preconditioning; the function applies the preconditioner to and stores it into . mode indicates which operator is the preconditioner for: (primme_svds_op_AtA), (primme_svds_op_AAt) or (primme_svds_op_augmented).
preconditioner: pointer to user data for applyPreconditioner.
maxBasisSize: the maximum number of columns in the search space basis.
minRestartSize: the minimum number of columns in the search space basis for restarting.
maxBlockSize: the maximum number of approximate eigenpairs to be corrected at every iteration. Larger block size may be helpful for multiple or highly clustered singular values and usually improves cache and communication performance.
primme: PRIMME parameter structure for first stage.
primmeStage2: PRIMME parameter structure for second stage.
The default maximum basis dimension (maxBasisSize) for the eigensolvers is 15 and the dimension after restarting (minRestartSize) is 6 if finding less than 10 largest singular values. Otherwise restarting parameters are set to 35 and 14 respectively. The default block size (maxBlockSize) is 1.
All primme_svds_params parameters can be modified by the user. Furthermore the user can tune individual eigensolver parameters in primme and primmeStage2 for each stage respectively. Currently, the default method for the first stage is the DYNAMIC method of PRIMME which switches dynamically between GD+k and JDQMR attempting to minimize time. The second stage defaults to the JDQMR method. Users can change the default eigensolver methods by calling,
method_stage1 and method_stage2 can be any PRIMME preset method. With this function the user can also change the PHSVDS to a different SVD method, for example to perform only a single stage with the normal equations (primme_svds_normalequations) or the augmented approach (primme_svds_augmented). Future versions may include other methods such as LBD and JDSVD.
Other advanced features include the following.
initSize: the number of singular vectors provided as initial guesses.
targetShifts: contains the values closest to which we should find singular values. Only accessed if target has been set to find interior singular values.
numTargetShifts: the number of values in targetShifts.
numOrthoConst: number of singular vectors provided as external orthogonalization constraint vectors (see explanation below).
printLevel: specifies level for printing out information (0–5).
outputFile: the output file descriptor.
stats: the performance report of this run.
After specifying the required and optional fields in the structure, we can call the main function:
The argument svals and rnorms are arrays at least of size numSvals to store the computed singular values and the residual norms, computed as in (9). Both arrays are filled by all processes. The argument svecs is a dense matrix at least of dimension . If numOrthoConst is greater than zero, the code will find left (right) singular vectors that are orthogonal to a set of numOrthoConst left (right) constraint vectors. These constraint vectors are provided along with any initial guesses in svecs. On input, svecs holds first the numOrthoConst left constraint vectors followed by the initSize initial left singular vector guesses, each vector of size mLocal. Following these, it holds the same number of right constraint vectors and right initial guesses, each vector of size nLocal. On output, svecs holds the left constraint vectors, the converged left singular vectors, and their right counterparts in this order. initSize is updated with the number of converged singular triplets.
Figure 1 illustrates the most basic form of this interface with an example in C that computes the four smallest singular values of a rectangular matrix. Note that most parameters have appropriate defaults that do not have to be set (e.g., numOrthoConst is zero, maxBlockSize is 1, etc.).
Finally, we briefly discuss the MATLAB interface which is a MEX wrapper of PRIMME_SVDS. It is similar to MATLAB’s svds, allowing it to be called by non-expert users but also by experts that can adjust over 30 parameters.
Like svds, users only need to provide the matrix while PRIMME_SVDS sets a list of expert defaults underneath. Users can tackle more advanced tasks incrementally by specifying more parameters. For example, they can pass their own matrix-vector and preconditioning operations, or they can simply take advantage of MATLAB’s built-in matrix times block-of-vectors operators and preconditioners. Interfaces to the other scripting languages are developed similarly.
4 Numerical Experiments
We report numerical results in order to demonstrate the diverse functionalities of PRIMME_SVDS and to assess its performance relative to other software over a wide range of problems. We seek a few smallest and a few largest singular triplets of matrices of large size ( –
), with and without preconditioning, and on both distributed and shared memory parallel computers. We are not aware of other published results for computing smallest singular triplets of such large size and challenging spectrum. The matrices come from various applications, including least-squares problems, DNA electrophoresis, linear programming, and graph clustering, and many of the matrices are rectangular. The condition number of the non-singular matrices is bounded by. Therefore, for seeking smallest singular values, a tolerance of can be achieved by the first stage and a tolerance of requires also the second stage. Basic information on the matrices is listed in Table 4. The first stage of PHSVDS in PRIMME can use GD+k or JDQMR, but the second is set to JDQMR. Therefore we include comparisons with both variants. In practice, the first stage in PRIMME can use the DYNAMIC method which, with minimal overhead, identifies the best of the two variants.
Our computations are carried out on the NERSC’s Cray Edison supercomputer and on the SciClone cluster at the College of William and Mary. On Edison, each compute node has two Intel “Ivy Bridge” processors at 2.4 GHz for a total of 12 cores and 64 GB of memory, interconnected with high-speed Cray Aries in Dragonfly topology. On SciClone, we use 36 Dell PowerEdge R415 servers with AMD Opteron processors at 3.1 GHz for a total of 12 cores and 32 GB of memory running the Red Hat Enterprise Linux operating system, interconnected by an FDR InfiniBand communication network. The machine precision is in all environments.
4.1 Comparison with SLEPc LBD on a distributed memory system
We compare the two PHSVDS variants against thick-restart LBD in SLEPc 3.6.0 using 48 MPI processes on 4 nodes (i.e., 48 cores) of SciClone. We compute the 5 largest and the 5 smallest singular values with tolerances and in (9). For the largest singular values, all solvers use a maximum basis size of 15. This is the default for PRIMME_SVDS, but for SLEPc LBD the default 10 obtained worse performance in general so it was not used. For the smallest singular values and when tolerance is needed, PRIMME_SVDS uses the default basis size of 35 during the second stage. For the smallest singular values, LBD was run with a basis size of 35 for all tolerances because the basis size of 15 was far slower. All solvers start with a random initial vector. The results are summarized in Table 5 and also displayed as time ratios between the methods in Figure 2.
|the 5 largest singular values with tolerance|
|the 5 largest singular values with tolerance|
|the 5 smallest singular values with tolerance|
|the 5 smallest singular values with tolerance|
Computing the five largest singular values is a relatively easy problem, as expected from Table 4, and all packages perform similarly well (within less than a factor of two of each other). It is harder problems that tend to magnify the difference between methods. For example, PRIMME_SVDS has a clear advantage when looking for the largest singular values of cont1_l (10 and 6 times faster than LBD for large and low tolerance, respectively). These differences are far more apparent when looking for the poorly separated smallest singular values. There, the PRIMME_SVDS variants significantly outperform the LBD as shown both in the table and as time ratios of more than 10 in Figure 2.
We also note the increased robustness of PRIMME_SVDS over LBD. For cont1_l, both PRIMME_SVDS variants identify five singular values smaller than the requested tolerance of , and PHSVDS(JDQMR) is the only solver that converges to the smallest singular value to tolerance within the allowed time. In contrast, LBD fails to converge to a single triplet for both tolerances. The atmosmodl matrix has extremely small gap ratios in the lower part of its singular spectrum and a double (multiple) singular value. When using tolerance, PRIMME_SVDS finds only one of the two multiple singular values, but LBD misses in addition two other singular values. For tolerance , LBD still misses the required multiplicity while PRIMME_SVDS variants have no problem identifying it.
Preconditioning is not often considered for SVD problems because of the difficulty of preconditioning the matrices or . Besides cases where users can exploit a special structure of their problem, sometimes generic preconditioners can be used effectively. We demonstrate this on the above difficult matrices using some black box preconditioners without taking into account the structure of the matrices. The matrices cage15 and atmosmodl are preconditioned with , where is a preconditioner of A built with HYPRE BoomerAMG . For the rest, the preconditioner is based on block Jacobi on with block size limited to 600. The preconditioners are not specifically tuned (we use HYPRE BoomerAMG with the default setup in PETSc and we have not exhaustively tested the impact of the block size in the Jacobi preconditioner). Nevertheless, we obtain substantial speedups over the best time without preconditioning (in one case by a factor of 50), as shown in Table 5 and better depicted in Figure 2. PRIMME_SVDS provides full flexibility so that users can provide preconditioning for and or directly for and .
Figure 3 corroborates that the performance advantages of PRIMME_SVDS extend to the case when we seek a large number of singular triplets. In the left graph, PHSVDS(JDQMR) is consistently faster than SLEPc LBD seeking up to 120 singular triplets from the largest part of the spectrum: similar times for cage15 and atmostmodl, 6 times faster for cont1_l and 3 times faster for the rest. To study the asymptotic scalability with the number of singular triplets, the figures include two black dotted lines, one for the linear slope and one for the quadratic. As more singular triplets converge, the time for both solvers is dominated by the cost of orthogonalization against the locked singular vectors, approaching the quadratic slope. The use of a preconditioner makes each iteration more expensive but reduces the number of outer iterations, thus delaying the dominance of orthogonalization. This is observed in the right graph of Figure 3.
4.2 Comparison with PROPACK on a shared memory system
We compare the performance of PRIMME_SVDS with PROPACK on a shared memory system. Both solvers are provided with the multithreaded sparse matrix-vector operation in librsb  and linked with a threaded version of BLAS, the AMD Core Math Library (ACML). We use a single node of SciClone with a total of 12 threads.
shows the results in terms of matrix-vector products and time. For PRIMME_SVDS the results are qualitatively similar to the previous numerical experiments, demonstrating a clear advantage in robustness and performance. PROPACK, even with fine-tuned settings, has trouble converging to the largest singular values of the most difficult case and to the smallest singular values of almost all cases. And when it converges it is significantly slower than PRIMME_SVDS. It is also slower than the mathematically equivalent LBD method in SLEPc, probably because of the use of partial instead of full reorthogonalization.
|the 5 largest singular values with tolerance|
|the 5 smallest singular values with tolerance|
4.3 Strong and Weak Scalability
We investigate the parallel performance of various methods in PRIMME_SVDS and in SLEPc on a distributed memory system. All methods use the parallel sparse matrix-vector operation in PETSc.
In the first comparison, we report speedups in time of various methods over SLEPc’s LBD method as we vary the number of processes from one to 100. The other methods are the GD+k method in the first stage of PRIMME_SVDS, and GD+k, Jacobi-Davidson (JD), and Krylov-Schur (KS) as implemented in SLEPc, all operating on . The runs were made on SciClone and the results are shown in Figure 4. While PHSVDS clearly outperforms the rest of the solvers in terms of time, what is more relevant is that the ratio for PHSVDS keeps almost constant with the number of processes. This is an indicator that PRIMME_SVDS has similar parallel scalability with LBD, and better scalability than other SLEPc solvers (including the similar GD+k). Notice that the matrix of the left plot, delaunay_n24, is extremely sparse with only 3 elements per row, so this scalability study reflects better the parallel efficiency of the solver and much less of the PETSc matvec.
To further study the parallel scalability of the code, we again use the delaunay_n24 sparse matrix and also the two somewhat denser matrices, cage15 and relat9. The relat9 is rectangular with a small dimension of about half a million which is used as a stress test for strong scalability. We also test the weak parallel scalability of the code using a series of 3D Laplacian matrices, making one of its dimensions proportional to the number of processes; each process maintains 8,000 rows when the number of the MPI processes increases from 64 to 1000. The plots in Figure 5 show the scalability performance of PRIMME_SVDS on Edison when seeking 10 extreme singular triplets with and without preconditioning.
In Figure 4(a), PRIMME_SVDS can achieve near-ideal speedup until 256 processes on relat9, despite the small size. With 512 processes, the speedup starts to level off as each process has only about 1,000 rows of a very sparse matrix. In Figure 4(b), we use the HYPRE BoomerAMG multigrid preconditioner so the parallel efficiency is dominated by this library function. Still, the speedup is good up to 512 processes implying that good preconditioners should be used when available. Figure 4(c) illustrates the same good scalability performance when seeking largest singular triplets without preconditioning. In Figure 4(d) the code demonstrates good performance under weak scaling of the 3D Laplacian on a cubic grid, where each dimension is and the number of processes takes the values of all perfect cubes from 64 to 1000.
4.4 Approximating the condition number
A particular strength of our software is in seeking one extreme singular value. This has many important applications such as in computing pseudospectra, the matrix norm, or the more difficult problem of approximating the condition number. A challenge arises, however, when these quantities are needed only in very low accuracy. Any Davidson or Krylov type method may miss the extreme eigenpair (i.e., misconverge to an interior one) if low accuracy is needed and the spectrum is clustered close to the target eigenvalue. A common approach is to use a large enough block size, but there is a performance penalty especially if the rest of the eigenvalues in the block are not needed. For the matrices we test, a block size of one and stopping the eigensolver when the residual norm is ten times smaller than the approximate eigenvalue suffices to obtain the smallest singular value with a relative error of 10%.
To the best of our knowledge, the most recent high-performance package for approximating the condition number is based on solving a linear least-squares problem with a known solution using LSQR . Its implementation in libskylark666http://xdata-skylark.github.io/libskylark/ first approximates the largest singular value with 300 iterations of power method on . For the smallest, the authors proposed several stopping criteria but for all cases we test their method stopped when the relative forward error of the solution vector is less than . The authors refer to this as small-error stopping criterion, which they claim may fail with a small probability. It is unclear what the corresponding probability of failure of our criterion is.
Table 7 lists the aggregate matrix-vector products and time for computing one largest and one smallest singular value on a single SciClone node (12 cores) by PRIMME_SVDS using the librsb matrix-vector product and libskylark using the Elemental shared memory sparse matrix-vector product. In general, LSQR requires fewer matrix-vector products than PRIMME_SVDS without preconditioning, and achieves a smaller relative error of the condition number by at least a factor of ten. This is because of the underlying unrestarted LBD method. However, PRIMME_SVDS is much faster in time for a variety of reasons. First, the matrix-vector product of librsb was on average two times faster than the Elemental one. Second, and more importantly, for tall skinny rectangular matrices PRIMME_SVDS works on which is of much smaller dimension than the full size problem in LSQR. Finally, libskylark uses 300 power iterations to find the largest singular value which is often unnecessary. In addition, our software is far more flexible as it can compute the condition number with any required accuracy, not only at the tested 10%, and it can use preconditioning. Table 7 shows the enormous reduction in time resulting from using the preconditioners of the previous sections.
function in libskylark seeking the largest and the smallest singular triplet to estimate the condition number. PRIMME_SVDS stops at relative tolerance. With preconditioning it uses GD+k for the first stage, and without it uses JDQMRETol.
5 Conclusion and Future Work
PRIMME_SVDS is a high-performance package for the computation of a small number of singular triplets of large-scale matrices. Currently it includes an implementation of the state-of-the-art PHSVDS method, which expertly combines the solution of two equivalent eigenvalue formulations to obtain both performance and accuracy. Previously, the strategy has been compared favorably with other state-of-the-art methods, and in this paper we showed that the PRIMME_SVDS implementation improves over the best current SVD packages of SLEPc and PROPACK.
We have discussed the critical aspects of the implementation of the solver that affect robustness and performance, such as the extraction method and the heuristic to switch between the normal equations and the augmented approach. Furthermore we have described the interface, highlighting the considerations for parallel execution, and illustrated the usage of the package with a few examples. Finally, we include numerical experiments with problems with large condition numbers and packed spectrum that complement previously published experiments to show the clear advantages of PRIMME_SVDS.
The software package is released as part of PRIMME version 2.1 which is freely available at https://github.com/primme/primme. Future extensions include a JDSVD method which can also be tuned for use in the second stage of PHSVDS.
The authors are indebted to the two referees for their meticulous reading. They would also like to thank the Scientific Data Management Group for generously help on running experiments on Edison. This work is supported by NSF under grants No. CCF 1218349 and ACI SI2-SSE 1440700, by DOE under a grant No. DE-FC02-12ER41890 and partially supported by the Office of Advanced Scientific Computing Research, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This work was performed in part using computational facilities at the College of William and Mary which were provided with the assistance of the National Science Foundation, the Virginia Port Authority, Sun Microsystems, and Virginia’s Commonwealth Technology Research Fund.
-  H. Avron, A. Druinsky, and S. Toledo, Spectral Condition-Number Estimation of Large Sparse Matrices, ArXiv e-prints, (2013).
-  James Baglama and Lothar Reichel, Augmented implicitly restarted Lanczos bidiagonalization methods, SIAM J. Sci. Comput., 27 (2005), pp. 19–42.
-  , Restarted block Lanczos bidiagonalization methods, Numerical Algorithms, 43 (2006), pp. 251–272.
-  , An implicitly restarted block Lanczos bidiagonalization method using Leja shifts, BIT Numerical Mathematics, 53 (2013), pp. 285–310.
-  Christopher G Baker and Michael A Heroux, Tpetra, and the use of generic programming in scientific computing, Scientific Programming, 20 (2012), pp. 115–128.
-  S Balay, S Abhyankar, M Adams, J Brown, P Brune, K Buschelman, V Eijkhout, W Gropp, D Kaushik, M Knepley, et al., PETSc users manual revision 3.5, tech. report, Technical report, Argonne National Laboratory (ANL), 2014.
-  Michael W. Berry, Large-scale sparse singular value computations, Int. J. Supercomputer. Appl., 6 (1992), pp. 13–49.
Jie Chen, Lingfei Wu, Kartik Audhkhasi, Brian Kingsbury, and Bhuvana
Efficient one-vs-one kernel ridge regression for speech recognition, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2454–2458.
-  Jane Cullum, Ralph A. Willoughby, and Mark Lake, A Lanczos algorithm for computing singular values and vectors of large matrices, SIAM J. Sci. Stat. Comput., 4 (1983), pp. 197–215.
-  Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman, Indexing by latent semantic analysis, Journal of the American society for information science, 41 (1990), p. 391.
-  Jack J. Dongarra, Improving the accuracy of computed singular values, SIAM J. Sci. Stat. Comput., 4 (1983), pp. 712–719.
-  Gene Golub and William Kahan, Calculating the singular values and pseudo-inverse of a matrix, Journal of the Society for Industrial and Applied Mathematics Series B Numerical Analysis, 2 (1965), pp. 205–224.
-  Gene H Golub and Charles F Van Loan, An analysis of the total least squares problem, SIAM Journal on Numerical Analysis, 17 (1980), pp. 883–893.
-  Gene H. Golub and Charles F. Van Loan, Matrix Computations (3rd Ed.), Johns Hopkins University Press, Baltimore, MD, USA, 1996.
-  Gene H. Golub and Qiang Ye, An inverse free preconditioned Krylov subspace method for symmetric generalized eigenvalue problems, SIAM J. Sci. Comput., 24 (2002), pp. 312–334.
-  N. Halko, P.-G. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, ArXiv e-prints, (2009).
-  Van Emden Henson and Ulrike Meier Yang, BoomerAMG: A parallel algebraic multigrid solver and preconditioner, Applied Numerical Mathematics, 41 (2002), pp. 155 – 177.
-  Vicente Hernandez and Jose E. Roman, A robust and efficient parallel SVD solver based on restarted Lanczos bidiagonalization, Electronic Transactions on Numerical Analysis, 31 (2008), pp. 68–85.
-  Vicente Hernandez, Jose E. Roman, and Vicente Vidal, SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems, ACM Trans. Math. Softw., 31 (2005), pp. 351–362.
-  Michiel E. Hochstenbach, A Jacobi–Davidson type SVD method, SIAM J. Sci. Comput., 23 (2001), pp. 606–628.
-  , Harmonic and refined extraction methods for the singular value problem, with applications in least squares problems, BIT Numerical Mathematics, 44 (2004), pp. 721–754.
-  Zhongxiao Jia, Refined iterative algorithms based on Arnoldi’s process for large unsymmetric eigenproblems, Linear algebra and its applications, 259 (1997), pp. 1–23.
-  Zhongxiao Jia and Datian Niu, An implicitly restarted refined bidiagonalization Lanczos method for computing a partial singular value decomposition, SIAM J. Matrix Anal. Appl., 25 (2003), pp. 246–265.
-  , A refined harmonic Lanczos bidiagonalization method and an implicitly restarted algorithm for computing the smallest singular triplets of large matrices, SIAM J. Sci. Comput., 32 (2010), pp. 714–744.
-  Ian Jolliffe, Principal component analysis, Wiley Online Library, 2002.
-  Effrosyni Kokiopoulou, C. Bekas, and E. Gallopoulos, Computing smallest singular triplets with implicitly restarted Lanczos bidiagonalization, Applied numerical mathematics, 49 (2004), pp. 39–61.
-  Rasmus Munk Larsen, Lanczos bidiagonalization with partial reorthogonalization, DAIMI Report Series, 27 (1998).
-  , Combining implicit restarts and partial reorthogonalization in Lanczos bidiagonalization, SCCM, Stanford University, (2001).
-  Richard B. Lehoucq, Danny C. Sorensen, and Chao Yang, ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods, (1998).
-  Qiao Liang and Qiang Ye, Computing singular values of large matrices with an inverse-free preconditioned Krylov subspace method, Electronic Transactions on Numerical Analysis, 42 (2014), pp. 197–221.
-  Frank Lin and William W Cohen, Power iteration clustering, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, 2010, pp. 655–662.
-  Michele Martone, Efficient multithreaded untransposed, transposed or symmetric sparse matrix–vector multiplication with the recursive sparse blocks format, Parallel Computing, 40 (2014), pp. 251–270.
-  Ronald B. Morgan, Computing interior eigenvalues of large matrices, Linear Algebra and its Applications, 154 (1991), pp. 289–309.
-  Ronald B. Morgan and Min Zeng, Harmonic projection methods for large non-symmetric eigenvalue problems, Numerical Linear Algebra with Applications, 5 (1998), pp. 33–55.
-  Beresford N. Parlett, The Symmetric Eigenvalue Problem, Prentice-Hall, 1980.
-  B. Philippe and M. Sadkane, Computation of the fundamental singular subspace of a large matrix, Linear algebra and its applications, 257 (1997), pp. 77–104.
-  Benjamin Recht, Maryam Fazel, and Pablo A Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM review, 52 (2010), pp. 471–501.
-  A. Stathopoulos, A case for a biorthogonal Jacobi-Davidson method: restarting and correction equation, SIAM J. Matrix Anal. Appl., 24 (2002), pp. 238–259.
-  Andreas Stathopoulos, Nearly optimal preconditioned methods for Hermitian eigenproblems under limited memory. part i: Seeking one eigenvalue, SIAM J. Sci. Comput., 29 (2007), pp. 481–514.
-  Andreas Stathopoulos and James R. McCombs, PRIMME: Preconditioned iterative multimethod eigensolver: Methods and software description, ACM Trans. Math. Softw., 37 (2010), pp. 21:1–21:30.
-  Gilbert W. Stewart, Matrix Algorithms Volume 2: Eigensystems, Society for Industrial and Applied Mathematics, 2001.
-  Lloyd N. Trefethen, Computation of Pseudospectra, vol. 8, Cambridge University Press, 1999.
-  Christopher Williams and Matthias Seeger, Using the Nyström method to speed up kernel machines, in Proceedings of the 14th Annual Conference on Neural Information Processing Systems, no. EPFL-CONF-161322, 2001, pp. 682–688.
Lingfei Wu, Jesse Laeuchli, Vassilis Kalantzis, Andreas Stathopoulos, and
Estimating the trace of the matrix inverse by interpolating from the diagonal of an approximate inverse, Journal of Computational Physics, 326 (2016), pp. 828–844.
-  Lingfei Wu and Andreas Stathopoulos, A preconditioned hybrid svd method for accurately computing singular triplets of large matrices, SIAM Journal on Scientific Computing, 37 (2015), pp. S365–S388.
-  Lingfei Wu, Ian E.H. Yen, Jie Chen, and Rui Yan, Revisiting random binning feature: Fast convergence and strong parallelizability, in 22st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2016, pp. 1265–1274.
-  Zhang Xianyi, Wang Qian, and Zhang Yunquan, Model-driven level 3 BLAS performance optimization on loongson 3a processor, in Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, IEEE, 2012, pp. 684–691.