A New High Performance and Scalable SVD algorithm on Distributed Memory Systems

by   Shengguo Li, et al.

This paper introduces a high performance implementation of Zolo-SVD algorithm on distributed memory systems, which is based on the polar decomposition (PD) algorithm via the Zolotarev's function (Zolo-PD), originally proposed by Nakatsukasa and Freund [SIAM Review, 2016]. Our implementation highly relies on the routines of ScaLAPACK and therefore it is portable. Compared with the other PD algorithms such as the QR-based dynamically weighted Halley method (QDWH-PD), Zolo-PD is naturally parallelizable and has better scalability though performs more floating-point operations. When using many processes, Zolo-PD is usually 1.20 times faster than QDWH-PD algorithm, and Zolo-SVD can be about two times faster than the ScaLAPACK routine PDGESVD. These numerical experiments are performed on Tianhe-2 supercomputer, one of the fastest supercomputers in the world, and the tested matrices include some sparse matrices from particular applications and some randomly generated dense matrices with different dimensions. Our QDWH-SVD and Zolo-SVD implementations are freely available at https://github.com/shengguolsg/Zolo-SVD.



There are no comments yet.


page 1

page 2

page 3

page 4


Computing low-rank approximations of large-scale matrices with the Tensor Network randomized SVD

We propose a new algorithm for the computation of a singular value decom...

High-Performance Partial Spectrum Computation for Symmetric eigenvalue problems and the SVD

Current dense symmetric eigenvalue (EIG) and singular value decompositio...

High-performance sampling of generic Determinantal Point Processes

Determinantal Point Processes (DPPs) were introduced by Macchi as a mode...

GPU-Accelerated Forward-Backward algorithm with Application to Lattice-Free MMI

We propose to express the forward-backward algorithm in terms of operati...

Semiparametric Tensor Factor Analysis by Iteratively Projected SVD

This paper introduces a general framework of Semiparametric TEnsor FActo...

A Kogbetliantz-type algorithm for the hyperbolic SVD

In this paper a two-sided, parallel Kogbetliantz-type algorithm for the ...

Programming at Exascale: Challenges and Innovations

Supercomputers become faster as hardware and software technologies conti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computing the SVD of a matrix is an important problem in scientific computing and many applications. For example, SVD has been used for information retrieval LSI

, principal component analysis (PCA) in Statistics 

hotelling-pca , and signal processing moore-pca . How to compute it in an efficient and scalable way has gathered much attention.

State-of-the-art SVD solvers are based on the bidiagonal reduction (BRD) strategy, consisting of the following three stages. First, a general matrix is reduced to an upper bidiagonal form by a sequence of (two-sided) orthogonal transformations Demmel-book , and this step is called bidiagonal reduction. Second, the bidiagonal SVD problem is solved by any standard method such as DC Gu-singular , QR DK or MRRR MRRR-bidiagonal

. Finally, the singular vectors are computed by accumulating the orthogonal transformations from the bidiagonal reduction, which is called

back transformation. The reduction step is the most time-consuming phase, and takes 75%–99% of the total time on homogeneous multicore architecture Ltaif-TOMS . Some recent works have been focused on accelerating the bidiagonal reduction phase, see references New-SVD ; Ltaif-TOMS ; Faverge-BD . To be efficient, much effort is required to develop an efficient and scalable algorithm. Every detail has to be considered including synchronization of multi-threads, vectorization, cache issues, and so on.

There are four main types of bidiagonalization methods. One is the classical one-stage approach, used in LAPACK and ScaLAPACK, which directly reduces a matrix to its bidiagonal form through sequences of orthogonal transformations (Householder or Givens) from two sides. Another is the two-stage approach, proposed in GL99 , which first reduces a matrix to its banded form and then a banded matrix is bidiagonalized. Another approach is obtained by means of the Lanczos algorithm, see Golub-Kahan and Simon-Zha-bidiagonal . The fourth approach is the so called one-sided bidiagonalization, first proposed by Ralha in Ralha-LAA , and stabilized by Barlow, Bosner, and Drmač Barlow-OneSided and a block parallel version is proposed in Barlow-pOnesided .

In this work we exploit a different approach instead of accelerating the bidiagonal reduction phase. We try out some new algorithms firstly proposed in the numerical linear algebra area. Higham and Papadimitriou HP-polarSVD2 introduce a new framework for computing the SVD which is based on the polar decomposition and combines with the eigendecomposition algorithms. Note that any rectangular matrix has a polar decomposition (PD)


where is a (tall) matrix with orthogonal columns and is Hermitian positive semidefinite Higham-book2 . The SVD of can be obtained by further computing the eigendecomposition of , .i.e, .

One advantage of this framework is that it can be accelerated by many efficient PD algorithms and some well-developed scalable eigensolvers (such as ELPA elpa-library ) without implementing the complicated bidiagonalization codes. Therefore, its implementation is relatively simpler. While, its main drawback is that it requires much more floating point operations than the bidiagonal reduction approach, see NH-qdwheig ; Nakatsukasa-simrev for details. This framework is well-known in the numerical linear algebra area, but there are little or no results on its performance on supercomputers compared with existing parallel SVD algorithms, for example, the algorithms in ScaLAPACK Scalapack ; scalapack-book , the most famous parallel numerical linear package.

The SVD problem is reduced to an eigenvalue problem via the polar decomposition, and therefore the recent well-developed scalable eigenvalue packages are usable. The remaining problem is how to compute the polar decomposition in a scalable and efficient way. There exist many distinguish algorithms such as the scaled Newton (SN) method 

Higham-book2 , the QR-based dynamically weighted Halley (QDWH) method NBG-Halley , and the more recently proposed algorithm based on Zolotarev’s function Nakatsukasa-simrev .

The QDWH-SVD algorithm which is used to compute the SVD in NH-qdwheig , has been implemented on multicore architecture enhanced with multiple GPUs qdwh-multicore . The results there show that QDWH-SVD can outperform the standard methods. Sukkari, Ltaief, and Keyes qdwh-dist further represent a comprehensive performance of QDWH-SVD on a large-scale distributed-memory platform, based on the numerical library ScaLAPACK Scalapack . With QDWH-PD as a preprocessing step and using the eigensolver ELPA elpa-library for computing the eigendecomposition, the distributed parallel QDWH-SVD algorithm qdwh-dist achieves up to five-fold and two-fold than the ScaLAPACK routine PDGESVD on ill and well-conditioned matrices, respectively. Note that the convergence rate of QDWH relates to the condition number of matrices. For well-conditioned matrices it requires few iterations NH-qdwheig ; NBG-Halley to compute the PD. While, the condition number is usually irrelevant to the eigenvalue problems which concern more about whether the eigenvalues are relatively well separated.

In Nakatsukasa-simrev , Nakatsukasa and Freund proposed a variant of QDWH-PD algorithm with higher order of convergence for the polar decomposition by using Zolotarev’s function, and call it Zolo-PD. It is shown that the convergence order of Zolo-PD can be 17, and it usually requires one or two iterations (while QDWH requires less than 6 iterations). As in Nakatsukasa-simrev , we name the SVD algorithm based on Zolo-PD as Zolo-SVD. Up to now, we have not seen any results of Zolo-SVD on high performance computers.

In this paper we mainly exploit this Zolo-SVD algorithm, and introduce a high performance Zolo-PD implementation on distributed-memory platform based on the state-of-art numerical library ScaLAPACK. We discuss some ways to further improve it. We further compare the performance of QDWH-PD and Zolo-PD. It turns out that Zolo-PD requires more floating point operations, while it has better scalability than QDWH, since Zolo-PD decomposes MPI processes into some relatively independent groups thus is more loosely coupled. Zolo-PD can be much faster than QDWH-PD when using many (MPI) processes. Combining with ELPA, we show that Zolo-SVD can be much faster than ScaLAPACK routine PDGESVD and QDWH-SVD. We use some sparse matrices from University of Florida sparse matrix collection TimDavis-Matrix , and some randomly constructed matrices to test the Zolo-SVD algorithm. Our QDWH-SVD and Zolo-SVD so?ftware library are freely available at https://github.com/shengguolsg/Zolo-SVD.

2 Qdwh-Pd and Zolo-PD

The polar decomposition is an important problem in numerical linear algebra area, and the well-known PD algorithms include the scaled Newton (SN) method Higham-book2 , QDWH-PD NBG-Halley , and Zolo-PD Nakatsukasa-simrev , etc.

2.1 Qdwh-Pd

QDWH is a QR-based dynamically weighted Halley iterative method for computing the polar decomposition NBG-Halley . QDWH computes the polar factor as the limit of the sequence defined by



is an upper bound of the maximum singular value of

, . In QDWH the parameters and are chosen dynamically to speed up the convergence. If are fixed, it gives the Halley iteration, which is cubically convergent Higham-book2 .

The iteration (2) requires explicit matrix inversion and thus it may have potential numerical stability issue. It is shown in NBG-Halley that (2) is mathematically equivalent to a QR-based implementation, which is inverse-free. The practical QDWH iteration is : ,


where The main cost lies in computing the QR factorization of an matrix and a matrix multiplication, both of which can be done in a communication-optimal manner BDHS11 ; Hoemmen-sisc .

Iteration (2) is also mathematically equivalent to the following form,


where chol denotes the Cholesky factorization of . The starting point is that when the absolute value of is small, matrix

is probably well-conditioned and computing the Cholesky factorization is cheaper than QR factorization. As suggested in 

NH-qdwheig , we switch from (3) to (4) as long as . In general, the QR iteration (3) is usually required only once or twice. Our implementation of the QDWH algorithm is based on ScaLAPACK, and the main procedure is quite similar to that one in qdwh-dist .

2.2 Zolo-PD

The QDWH parameters , and in (2) are computed as the solution of the rational max-min optimization problem in NBG-Halley ,


subject to the constraint on . The parameters in the QDWH iteration can also be obtained by finding the best type- rational approximation to the sign function in the infinity norm, see Nakatsukasa-simrev .

Nakatsukasa and Freund Nakatsukasa-simrev extend to higher order rational polynomials for general . The obtained optimal rational function is called the type Zolotarev function corresponding to , and the solution is given by


Here, the constant is uniquely determined by the condition

and the coefficients are given by


where and are the Jacobi elliptic functions (see, e.g., (Akhiezer-book1, , Ch. 5)). Here and . It is more convenient to use the scaled Zolotarev function


where .

Combining the iterative process used in QDWH and the scaled Zolotarev functions, we arrive at an algorithm that computes the polar factorization by composing Zolotarev functions, called Zolo-PD Nakatsukasa-simrev , shown in Algorithm 2.2. One question remains: how to choose the order of rational functions ? The convergence order is , and is related to the condition number of matrix . For ill-conditioned matrices, should be large for fast convergence. A table is given in Nakatsukasa-simrev to guide the choice of . For completeness, the results are shown in Table 1, which shows the smallest (i.e., the number of iterations) for which the following condition is satisfied

for varying values of and . It is suggested in Nakatsukasa-simrev to choose such that it requires at most two iterations. This strategy may require much more computational resources in the distributed parallel environment. We will further discuss it below in section 4.

1.001 1.01 1.1 1.2 1.5 2 10
2 2 2 3 3 3 4 4 4 5 5 6
1 2 2 2 2 2 3 3 3 3 4 4
1 1 2 2 2 2 2 2 3 3 3 3
1 1 1 2 2 2 2 2 2 3 3 3
1 1 1 1 2 2 2 2 2 2 3 3
1 1 1 1 1 2 2 2 2 2 2 3
1 1 1 1 1 1 2 2 2 2 2 3
1 1 1 1 1 1 2 2 2 2 2 2
Table 1: Required number of iterations for varying and

The following two theorems show that equation (8) can be written in partial fraction form, which endow Zolo-PD with good parallelism. Each QR factorization in (12) can be done simultaneously in parallel, and therefore the computation time is reduced significantly. For example, when , we can divide all processes into groups, and each group computes one QR factorization in (12), instead of using all processes to compute QR factorizations sequentially. Compared with QDWH-PD, another advantage of Zolo-PD is that it requires much fewer iterations, about two-thirds fewer, see Table 1 and NBG-Halley , since QDWH requires six iterations for ill-conditioned matrices.

Theorem 2.1 (Nakatsukasa-simrev ).

The function as in (8) can be expressed as



Theorem 2.2 (Nakatsukasa-simrev ).

For the function as in (8), and a matrix with SVD , the matrix is equal to


Moreover, can be computed in an inverse-free manner as


[Zolo-PD for the polar decomposition] Let be an upper bound of of and a lower bound of of .

  1. Compute and ;

  2. Compute and let ;

  3. Choose based on from Table 1. If then and skip to (d).

  4. Compute and :

    • Compute and as defined in (7) and (10);

    • Compute as in (12);

    • Update and recompute and as in step (a);

    • Compute by and

      Verify that holds. If not, return to Step 1 with .

  5. and .

2.3 Zolo-SVD

A framework for computing the SVD via the polar decomposition and the eigendecomposition has been proposed in HP-polarSVD ; NH-polar . It assumes that the polar decomposition of is and the symmetric eigendecomposition of is , and the SVD of is obtained from .

Many algorithms have been proposed for this approach, and their differences lie in how to compute the polar decomposition and the symmetric eigendecomposition. In HP-polarSVD

, it suggests using a method based on Padé iteration for the polar decomposition and any standard method for the symmetric eigendecomposition. In 

NH-polar , it computes the polar decomposition by the QDWH algorithm and the symmetric eigendecomposition by QDWH-EIG which is a spectral divide-and-conquer algorithm based on the polar decomposition, see NH-polar for details. In qdwh-multicore , it is shown that QDWH-EIG is not as efficient as the symmetric eigendecomposition routines in MAGMA on GPUs. The results in qdwh-dist also show that QDWH-EIG would be slower than ELPA for ill- and well-conditioned matrices on distributed memory systems. Therefore, we also use ELPA to compute the eigendecomposition of in our implementation. The SVD algorithm in Nakatsukasa-simrev is based on Zolo-PD which is implemented in Matlab and there are no results about its performance on supercomputers.

The framework for computing SVD used in this paper is summarized in Algorithm 2.3. We use Zolo-PD to compute the polar decomposition and use ELPA Elpa to compute the symmetric eigendecomposition on distributed memory parallel computers.

One of the main differences between ELPA and the DC algorithm in ScaLAPACK is that ELPA uses two-stage approach for tridiagonalization. Compared with ScaLAPACK, ELPA has better scalability and can be up to two times faster on Tianhe-2 supercomputer, see elpa-library for results on other supercomputers. Unfortunately, there are few packages that have efficiently implemented the two-stage bidiagonal reduction (BRD) algorithm. We guess that two-stage BRD extended to distributed environment systems can achieve similar speedups as the tridiagonal reduction (TRD), which will be our future work. For multicore architectures, PLASMA has provided a high-performance BRD, which achieves up to 30-fold speedup LLD-bidiagonal on a 16-core Intel Xeon machine against the LAPACK implementation in Intel MKL version 10.2.

In this work, we are concerned with computing the SVD of (nearly) square matrices. For highly rectangular matrices, many efficient algorithms are based on randomized techniques Martinsson-Rev10 . Another approach is to compute the SVD incrementally, such as the SVD-updating algorithm in Zha-updating . An hierarchically incremental SVD approach is proposed in HI-SVD , which is similar in spirit to the CAQR factorization Hoemmen-sisc . It divides the matrix into many chunks along the column dimension, . The SVD of each can be computed independently by using different process groups and then merged hierarchically together. Similar to Zolo-PD, it is naturally parallelizable.

[Zolo-SVD] Input: A general matrix with .
Output: the SVD .

  1. Compute the polar decomposition via Zolo-PD.

  2. Compute the symmetric eigendecomposition via ELPA.

  3. Compute .

3 Implementation details

The serial Zolo-PD algorithm requires more flops than QDWH, see Nakatsukasa-simrev , but it is naturally parallelizable. We can use more MPI processes and implement Zolo-PD in parallel. We use process groups to compute QR factorizations or Cholesky factorizations simultaneouly. Therefore, the operations of each iteration cost by each process group is about the same as QDWH. By comparing with the number of iterations required, the parallel version of Zolo-PD is supposed to be about three times faster than QDWH, which is also confirmed by the results in section 4.

In this section, we introduce our implementation details based on the routines in ScaLAPACK, and introduce how to exploit the sparse structure of the matrix in (12) when computing its QR factorization. A structured QR algorithm is proposed in the following subsection by modifying the routines in ScaLAPACK. It can be up to times faster than the QR algorithm in ScaLAPACK.

3.1 Fast structured QR algorithms

In Algorithm 2.2, we need to compute the QR factorization of a tall matrix , where is a general matrix, and

is an identity matrix, sparse. In this subsection, we investigate a fast QR algorithms for matrix

, which has been mentioned in (NH-qdwheig, , Appendix A.1). The main idea is to take advantage of the sparsity of the bottom part of matrix , i.e., the identity matrix . While, the classical QR will ignore all the zeros in matrix and treat it as a dense matrix. By exploiting this special sparse structure, each Householder reflector in the structured QR algorithm has at most nonzero elements instead of in the classical one. Therefore, it could save a lot of floating point operations by exploiting this sparse structure. We introduce how to exploit the sparse structure of block column by block column below. Different from the method suggested in NH-qdwheig , our implementation can be seen as a block version of it, and every NB columns are transformed together. Therefore, each Householder reflector in our implementation has at most NB nonzero elements.

In this work, we implement it by modifying the routines in ScaLAPACK, PDGEQRF and PDORGQR. The routine PDGEQRF computes a QR factorization of a general matrix by computing the QR factorization of the first NB columns and then the next NB columns, and so on, where NB is the block size. Its process is similar to that shown in Figure 1 for the proposed structured QR algorithm which is denoted by MPDGEQRF, for simplicity. The difference between MPDGEQRF and PDGEQRF is that the row dimensions of the panels in MPDGEQRF are at most +NB, where NB. Note that we use “row dimension” to denote the number of rows of a matrix. While, the row dimensions of panels in PDGEQRF could be much larger than MPDGEQRF. For example, the row dimension of the first panel of PDGEQRF is . The ScaLAPACK routine PDORGQR

can be similarly modified to generate the orthogonal matrix

computed by MPDGEQRF, and the structured version is denoted by MPDORGQR.

Figure 1: The process of structed QR factorization

To compare this structured QR factorization with ScaLAPACK routines, we use two random matrices with different dimensions which are and , respectively. The experiments are done on Tianhe-2 super computer, located in Guangzhou, China. Each compute node has two Intel Xeon E5 2692-v2 CPUs and 24 cores in total. For the smaller matrix, we use and processes, respectively. For the larger one, we use and processes to test these four routines MPDGEQRF, MPDORGQR, PDGEQRF and PDORGQR. The execution times are shown in Table 2, from which we can see that the speedups over ScaLAPACK are from to .

No.Proc Mat.
256 0.91 0.69 1.32 0.39 0.27 1.43
512 0.89 0.66 1.34 0.30 0.20 1.51
1024 0.76 0.64 1.23 0.21 0.16 1.26
2048 0.87 0.75 1.18 0.15 0.13 1.21
256 3.36 2.46 1.36 1.98 1.37 1.45
512 2.91 2.14 1.36 1.45 0.97 1.49
1024 2.08 1.60 1.30 0.88 0.61 1.46
2048 2.00 1.57 1.28 0.69 0.47 1.47
Table 2: Comparison of the structured QR algorithms with ScaLAPACK

3.2 Implementation based on ScaLAPACK

Our implementation highly depends on the ScaLAPACK and BLACS routines, and therefore it is portable. We use BLACS routines to split the communicators and perform the communications among all these processes. The algorithm proposed here works for both sparse and dense matrices. Note that for sparse matrices our algorithm does not exploit their sparse properties when computing its SVD, as done in ScaLAPACK, since Algorithm 3.2 can be seen as a direct method for computing SVD.

[Distributed Parallel Zolo-SVD] Input: A sparse or dense matrix . Assume is distributed among all processes.
Output: The SVD of .

  1. Compute , and , involving all the processes;

  2. Compute and let , involving all the processes;

  3. Choose based on or fix or .
    Divide all the processes into subgroups, and redistribute the matrix to each subgroup.

  4. While not convergent,

    • Compute and as defined in (7) and (10);

    • Compute ;

    • The processes in subgroup compute the QR or Cholesky factorization



    • The subgroups together compute

    • Verify that holds. If not, and update ;

  5. End while

  6. One subgroup redistributes to all the processes;

  7. Compute the eigendecomposition of , and the singular vectors of .

We assume the whole matrix is distributed among all the processes, and use the BLACS routine PDGEMR2D to redistribute it to the process in each subcommunicator. We find the time cost in the data redistribution is very small. For a sparse matrix , we can also let each process have one copy of when its storage is not large, stored in its sparse form. This can avoid one time of communication. For simplicity, we further assume an upper bound of and a lower bound of are known, denoted by and

, respectively. They can be estimated in particular applications or computed very efficiently by using some sparse direct solvers such as SuiteSparse 

TimDavis-book , and the computation times are included in section 4, see Table 3. The third and fourth columns of Table 3 show the time of estimating and in second, which are be very small.

There are three MPI process grids in our implementation. Assume there are processes in total and the processes are organized into a 2D grid, and . The BLACS context associated with all the processes is named as ALL_CONTXT. For simpility, we assume is dividable by , . We use the BLACS routine BLACS_GRIDMAP to divide all the processes into groups. This is the second process grid, and its BLACS context is named as TOP_CONTXT. Each group contains the processes in the same row of the process grid TOP_CONTXT. The communications in the Zolo-PD algorithm are among the processes in the same column of the process grid TOP_CONTXT. The processes in the same row of TOP_CONTXT are further organized into a 2D grid which are used to compute each individual QR or Cholesky factorizations in (12). This is the third process grid and its BLACS context is named as SEP_CONTXT. The matrix is redistributed from the processes in ALL_CONTXT to the processes in SEP_CONTXT by using the BLACS routine PDGEMR2D. The parallel Zolo-SVD algorithm is summarized in Algorithm 3.2, which can be easily changed to an executable ScaLAPACK routine.

The main computation of Zolo-PD lies in the QR factorizations and the Cholesky factorizations. Each iteration step requires a summation of terms, and the computation of each term is independent, which is computed by a individual subgroup of processes. For example, it requires to compute in the QR stage

Each process subgroup computes the QR factorization of matrix ,

In the Cholesky factorization stage, each process subgroup computes the Cholesky factor of matrix . The work loads are well balanced. To compute the summation, communications are required. In our implementation, the summation is done by using the BLACS routine DGSUM2D. To compute the summation, we only need to add the corresponding data in the same process column of process grid TOP_CONTXT. This is because the data distribution of processes in every group is the same.

We test the performance of Zolo-PD by using two ways to choose the parameter . Firstly, , the number of subcommunicators, is obtained from Table 1 based on the condition number of . This strategy works well for well-conditioned matrices. However, for ill-conditioned matrices it requires more computational resources, for example, . Another strategy is to choose a small but use more iterations. In our implementation, we choose or . From the results in Table 1, the number of iterations would increase by one or two. This means the convergence rate of Zolo-PD is reduced, but its convergence rate is still of order or .

4 Numerical Results

Our experiments are performed on Tianhe-2 supercomputer located in Guangzhou, China, which is one of the fastest supercomputers in the world, having a peak performance of 54.9 petaflops in theory and 33.86 petaflops in Linpack benchmark. It has a total of 16,000 compute nodes. Each compute node is equipped with two Intel E5-2692 CPUs, 64GB of DDR3 main memory, and The interconnect network topology is an opto-electronic hybrid, hierarchical fat tree. For compilation we used Intel fortran compiler (ifort) and the optimization flag -O3 -mAVX, and linked the codes to Intel MKL (composer_xe_2015.1.133). As suggested in qdwh-dist , we only investigate only pure MPI implementation, and set NB = 64 for the two-dimensional block cyclic data distribution (BCDD). Each compute node uses 24 MPI processes in principle.

Example 1 We first use three sparse matrices with medium size from real applications, which are obtained from the University of Florida sparse matrix collection TimDavis-Matrix . The names of these matrices are illustrated in Table 3. The QDWH-PD and Zolo-PD algorithms require to estimate a lower bound of the smallest singular value and an upper bound of the largest singular value for these matrices. We find that the computations for and are quite fast, nearly negligible. See the third and fourth columns of Table 3, and the experiments are performed on a Laptop with Intel i7 CPU and 16GB memory using Matlab 2010b.

Matrix N Cond
nemeth03 9,506 1.03e-02 7.19e-02 1.29e+00 2
fv1 9,604 3.35e-02 2.56e-01 1.40e+01 3
linverse 11,999 3.01e-03 0.25e-01 9.06e+03 4
Table 3: Summary of basic matrix characteristics and times of computing the lower and upper bounds

The number of iterations of Zolo-PD depends on . In this example, we choose from the values in Table 1. When is larger, Zolo-PD has higher convergence rate. Table 5 shows the number of iterations cost by Zolo-PD. From it we can see that when Zolo-PD requires two fewer iterations than QDWH-PD, which explains the speedups of Zolo-PD over QDWH-PD in some sense when implemented in parallel.

Matrix Method No. of Processes
nemeth03 PDGESVD 61.45 15.97 15.14 14.85 15.57
Zolo-SVD 24.73 15.11 10.91 11.83 10.01
Speedup 2.48 1.06 1.39 1.26 1.56
fv1 PDGESVD 95.60 20.18 19.06 18.11 19.03
Zolo-SVD 30.65 17.58 12.82 12.65 11.54
Speedup 3.12 1.15 1.49 1.43 1.65
linverse PDGESVD 144.37 29.87 26.43 25.01 32.38
Zolo-SVD 50.29 28.64 19.62 18.64 15.83
Speedup 2.87 1.04 1.35 1.34 2.05
Table 4: Time comparisons of PDGESVD with Zolo-SVD

The comparison results of Zolo-SVD and PDGESVD are shown in Table 4. It shows that Zolo-SVD can be 3.12x faster than PDGESVD for matrix fv1 when using processes. Note that the number of processes used by PDGESVD and Zolo-SVD are shown in the first rows of Table 4. The comparisons of QDWH-PD and Zolo-PD are shown in Table 6. It turns out that Zolo-PD is about 1.20x times faster than QDWH-PD when using many processes.

Matrix QDWH
nemeth03 4 3 3 3
fv1 5 4 3 3
linverse 5 4 3 3
Table 5: The number of iterations required by QDWH-PD and Zolo-PD

Example 2 The main drawback of Zolo-PD is that it requires too much floating-point operations for ill-conditioned matrices when requiring at most TWO iterations. One approach to fix this problem is to choose a small . In this example we let be and let QDWH-PD and Zolo-PD use the same number of processes, and the results are shown in Table 6. The first rows of Table 6 show the number of processes used. From it, we can get that Zolo-PD is usually faster than QDWH-PD when using the same number of processes. Because Zolo-PD decomposes all MPI processes into relatively independent groups, it is more loosely coupled and therefore more scalable than QDWH-PD. Zolo-PD becomes faster than QDWH-PD when using many processes.

Matrix Method No. of Processes
nemeth03 QDWH-PD 16.51 12.02 8.14 7.36 5.55
Zolo-PD 16.56 8.55 6.42 4.79 4.44
Speedup 1.00 1.41 1.27 1.70 1.25
fv1 QDWH-PD 16.16 12.75 8.62 7.23 5.97
Zolo-PD 22.26 10.87 7.40 6.02 5.82
Speedup 0.73 1.17 1.16 1.20 1.03
linverse QDWH-PD 37.63 20.25 12.97 10.15 9.35
Zolo-PD 37.40 19.07 12.43 8.75 7.92
Speedup 1.01 1.06 1.04 1.16 1.18
Table 6: Times of QDWH-PD and Zolo-PD () when using the same number of processes.

Compared with QDWH-PD, Zolo-PD further requires communications among different subcommunicators. We profile the Zolo-PD algorithm and Table 7 shows the times cost by each stage of Zolo-PD, where the rows Combin. show the communication time cost by DGSUM2D, which computes the summation of in Algorithm 2.3. The rows FormX2 show the maximum time of computing of all subgroups. The results were obtained for the matrix fv1 when let , and Zolo-PD took three iterations, the first one used QR factorization and the other two used Cholesky factorization. From it we can see that most time lies in computing the QR and Cholesky factorizations and the communication times between different communicators are negligible.

No. of Proc.
QR 3.42 3.03 2.07
Combin. 6.64e-02 7.21e-03 6.26e-02
FormX2 2.82e-02 1.98e-02 1.47e-02
Chol. 1.95 1.38 0.89
Combin. 1.74e-02 1.84e-02 3.80e-02
FormX2 1.96e-02 1.38e-02 1.26e-02
Chol. 1.89 1.22 0.75
Combin. 1.34e-02 1.20e-02 1.03e-03
FormX2 1.87e-02 1.33e-02 1.38e-02
Table 7: Profiling the computational stages of Zolo-PD for matrix fv1. Zolo-PD requires three iterations and the times are in second.

Example 3 We use some larger, more ill-conditioned matrices to further test Zolo-SVD and compare it with PDGESVD. The properties of these matrices are shown in Table 8, where rand1 and rand2 are two matrices with random Gaussian entries. These matrices include symmetric ones and nonsymmetric ones, sparse and dense ones. Some matrices are very ill-conditioned and their condition numbers are in the order of .

Matrix Cond
bcsstk18 11,948 80,519 3.46e+11
c-47 15,343 113,372 3.16e+08
c-49 21,132 89,087 6.02e+08
cvxbqp1 50,000 199,984 1.09e+11
rand1 10,000 dense 3.97e+07
rand2 30,000 dense 1.24e+07
Table 8: Summary of basic matrix characteristics

In this example we let PDGESVD and Zolo-SVD use the same number of processes and let , and the numerical results are shown in Table 9, where the first row contains the number of processes used and the other rows contains the speedups of QDWH-SVD and Zolo-SVD over PDGESVD. It turns out that Zolo-SVD is always faster than PDGESVD for these matrices when using many MPI processes, for example using processes. It is interesting to see that Zolo-SVD is also faster than PDGESVD when using processes.

About how to choose , it depends on the computational resources you have and the condition number of matrix, refer to Table 1. The number of iterations cost by Zolo-PD when choosing different are illustrated in Table 10, which are consistent with the results estimated in Table 1. It only decreases by one iteration when is increased from to , and therefore or is probably a good choice. Another reason is that taking too large can lead to numerical instability Nakatsukasa-simrev .

Matrix Method No. of Processes
bcsstk18 QDWH-SVD 1.61 0.59 0.76 0.83 1.02
Zolo-SVD 1.45 0.67 0.85 1.19 1.29
c-47 QDWH-SVD 1.26 0.67 0.98 1.40 1.16
Zolo-SVD 2.33 0.92 1.07 1.51 1.29
c-49 QDWH-SVD 2.01 2.52 1.05 0.97 1.00
Zolo-SVD 1.28 3.02 1.08 1.54 1.73
cvxbqp1 QDWH-SVD 1.20 1.12 1.26 1.03 1.07
Zolo-SVD 1.37 1.36 1.17 1.51 1.28
rand1 QDWH-SVD 2.22 1.09 1.11 1.03 1.15
Zolo-SVD 3.21 1.10 1.24 1.70 1.79
rand2 QDWH-SVD 2.17 2.84 1.14 1.02 1.73
Zolo-SVD 2.20 3.44 1.21 1.60 1.61
Table 9: The speedups of QDWH-SVD and Zolo-SVD () over PDGESVD

Table 9 also shows the performance of QDWH-SVD, which are in the lines denoted by QDWH-SVD, the speedup compared with PDGESVD. We can see that Zolo-SVD can compete with QDWH-SVD, and is usually faster than both QDWH-SVD and PDGESVD for these matrices. This behavior was verified for processes going from 256 up to 4096.

4.1 Numerical Accuracy

The backward error of the overall SVD is computed as


where , are orthogonal and is diagonal and its diagonals are the singular values, denotes the Frobenius norm of matrix and denotes the 2-norm of . We test the orthogonality of the computed singular vectors by measuring

where and are the left and right singular vectors respectively, and is the dimension of matrix ,

bcsstk18 4 4 3 3
c-47 4 4 3 3
c-49 4 4 3 3
linverse 4 3 3 3
nemeth03 3 3 3 3
fv1 4 3 3 3
Table 10: The number of iterations cost by Zolo-PD when choosing different

We compare these three methods, PDGESVD, QDWH-SVD and Zolo-SVD, and the numerical results are shown in Figure 2. The matrices tested are bcsstk18, c-47, c-49, fv1, linverse, nemeth03, cvxbqp1, rand1, and rand2, which are numbered from to in the subfigures, respectively. The results illustrate that the algorithm is numerically stable, and the computed left and right singular vectors are highly orthogonal. The relative residuals, defined in (13), of the computed SVD by Zolo-SVD and QDWD-SVD are as accurate as those computed by PDGESVD for these matrices, and the results are shown in Figure 2(a). Zolo-SVD is comparable to QDWH-SVD, and their accuracy are in the same order. The orthogonality of the computed singular vectors by these three methods are always in machine precision, less than -. The results for the right and left singular vectors are similar, and only the results for left singular vectors are included in Figure 2(b).

(a) Res
(b) OrthL
Figure 2: The residual of the computed SVD, and the orthogonality of the computed singular vectors

5 Conclusions

A new distributed parallel SVD algorithm, called Zolo-SVD, is implemented in this work, which is based on the Zolo-PD Nakatsukasa-simrev algorithm and the symmetric eigenvalue decomposition algorithm. The main advantage of this algorithm is that it is highly scalable. When using more computational resources, numerical results show that Zolo-SVD can be three times faster than PDGESVD. When using the same number of processes, Zolo-SVD can also be faster than PDGESVD, and for some matrices it can be more than two times faster. Compared with QDWH-PD, the drawback of Zolo-PD is that it requires much more floating point operations. To be faster, it must be implemented in parallel. We use many sparse matrices from the University of Florida sparse matrix collection and some random dense matrices to conduct the numerical experiments. Our implementations of QDWH-SVD, Zolo-SVD and structured QR algorithms are freely available at https://github.com/shengguolsg/Zolo-SVD.


This work is partially supported by National Natural Science Foundation of China (No. 11401580, 91530324, 91430218 and 61402495).


  • (1) N. I. Akhiezer. Elements of the Theory of Elliptic Functions. AMS, Providence, RI, 1990.
  • (2) T. Auckenthaler, V. Blum, H. J. Bungartz, T. Huckle, R. Johanni, L. Krämer, B. Lang, H. Lederer, and P. R. Willems. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Computing, 37(12):783–794, 2011.
  • (3) G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl., 21(2):562–580, 2011.
  • (4) J. L. Barlow, N. Bosner, and Z. Drmač. A new stable bidiagonal reduction algorithm. Linear Algebra Appl., 397:35–84, 2005.
  • (5) J. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997.
  • (6) N. Bosner and J. L. Barlow. Block and parallel versions of one-sided bidiagonalization. SIAM J. Matrix Anal. Appl., 29(3):927–953, 2007.
  • (7) J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. Scalapack: A portable linear algebra library for distributed memory computers-design issues and performance. Computer Physics Communications, 97:1–15, 1996.
  • (8) T. Davis. Direct methods for sparse linear systems. SIAM, Philadelphia, 2006.
  • (9) T. Davis and Y. Hu. The Univeristy of Florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, 2011.
  • (10) S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. Soc. Inf. Sci., 41:391–407, 1990.
  • (11) J. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997.
  • (12) J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput., 34:A206–A239, 2012.
  • (13) J. W. Demmel and W. M. Kahan. Accurate singular values of bidiagonal matrices. SIAM J. Sci. Comput., 11:873–912, 1990.
  • (14) B. Großer and B. Lang. Efficient parallel reduction to bidiagonal form. Parallel computing, 25:969–986, 1999.
  • (15) M. Faverge, J. Langou, Y. Robert, and J. Dongarra. Bidiagonalization with parallel tiled algorithms, 2016. arXiv: 1611.06892v1.
  • (16) G. Golub and W. Kahan. Calculating the singular values and pseudo-inverse of a matrix. SIAM J. Numer. Anal., 2:205–224, 1965.
  • (17) M. Gu and S. C. Eisenstat. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM J. Matrix Anal. Appl., 16(1):79–92, 1995.
  • (18) A. Haidar, P. Luszczek, J. Kurzak, and J. Dongarra. An improved parallel singular value algorithm and its implementation for multicore hardware. In William Gropp, editor, Proceedings of SC13, pages 90:1–90:14, Denver, CO, USA, 2013.
  • (19) N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53:217–288, 2011.
  • (20) N. J. Higham. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2008.
  • (21) N. J. Higham and P. Papadimitriou.

    Parallel singular value decomposition via the polar decomposition.

    Technical Report Numerical Analysis Report 239, Manchester Centre for Computational Mathematics, Manchester, England, 1993.
  • (22) N. J. Higham and P. Papadimitriou. A new parallel algorithm for computing the singular value decomposition. In J. G. Lewis, editor, The Fifth SIAM Conference on Applied Linear Algebra, pages 80–84, Philadelphia, 1994. SIAM.
  • (23) H. Hotelling. Simplified calculation of principal components. Psychometrica, 1:27–35, 1935.
  • (24) M. A. Iwen and B. W. Ong. A distributed and incremental svd algorithm for agglomerative data analysis on large networks. SIAM J. Matrix Anal. Appl., 37(4):1699–1718, 2016.
  • (25) Hatem Ltaief, Piotr Luszczek, and Jack Dongarra. High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures. ACM Trans. Mathematical Software, 39(3):16, 2013.
  • (26) H. Ltaif, P. Luszczek, and J. Dongarra. High performance bidiagonal reduction using tile algorithm on homogeneous multicore architectures. ACM Transaction on Mathematical Software, 39(3):16:1–22, 2013.
  • (27) A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H. Bungartz, and H. Lederer. The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. J. Phys.: Condens. Matter, 26:1–15, 2014.
  • (28) B. C. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Trans. Autom. Control, 1:AC–26, 1981.
  • (29) Y. Nakatsukasa, Z. Bai, and F. Gygi. Optimizing Halley’s iteration for computing the matrix polar decomposition. SIAM J. Matrix Anal. Appl., 31:2700–2720, 2010.
  • (30) Y. Nakatsukasa and R. W. Freund. Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev’s functions. SIAM Review, xx(x):xx–xxx, 2016.
  • (31) Y. Nakatsukasa and N. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput., 35(3):A1325–A1349, 2013.
  • (32) Y. Nakatsukasa and N. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd. SIAM J. Sci. Comput., 35(3):A1325–A1349, 2013.
  • (33) R. Ralha. One-sided reduction to bidiagonal form. Linear Algebra Appl., 358:219–238, 2003.
  • (34) H. D. Simon and H. Zha. Low-rank matrix approximation using the Lanczos bidiagonalization process with applications. SIAM J. Sci. Comput., 21:2257–2274, 2000.
  • (35) D. E. Sukkari, H. Ltaief, and D. E. Keyes. A high performance QDWH-SVD solver using hardware accelerators. Technical report, KAUST Repository, 2015.
  • (36) D. E. Sukkari, H. Ltaief, and D. E. Keyes. High performance polar decomposition on distributed memory systems. In P. F. Dutot and D. Trystram, editors, Euro-Par 2016, LNCS, volume 9833, pages 605–616, Switzerland, 2016. Springer.
  • (37) P. R. Willems, B. Lang, and C. Vömel. Computing the bidiagonal SVD using multiple relatively robust representations. SIAM J. Matrix Anal. Appl., 28(4):907–926, 2006.
  • (38) H. Zha and H. D. Simon. On updating problems in latent semantic indexing. SIAM J. Sci. Comput., 21:782–791, 1999.