1 Introduction
Computing the SVD of a matrix is an important problem in scientific computing and many applications. For example, SVD has been used for information retrieval LSI
, principal component analysis (PCA) in Statistics
hotellingpca , and signal processing moorepca . How to compute it in an efficient and scalable way has gathered much attention.Stateoftheart SVD solvers are based on the bidiagonal reduction (BRD) strategy, consisting of the following three stages. First, a general matrix is reduced to an upper bidiagonal form by a sequence of (twosided) orthogonal transformations Demmelbook , and this step is called bidiagonal reduction. Second, the bidiagonal SVD problem is solved by any standard method such as DC Gusingular , QR DK or MRRR MRRRbidiagonal
. Finally, the singular vectors are computed by accumulating the orthogonal transformations from the bidiagonal reduction, which is called
back transformation. The reduction step is the most timeconsuming phase, and takes 75%–99% of the total time on homogeneous multicore architecture LtaifTOMS . Some recent works have been focused on accelerating the bidiagonal reduction phase, see references NewSVD ; LtaifTOMS ; FavergeBD . To be efficient, much effort is required to develop an efficient and scalable algorithm. Every detail has to be considered including synchronization of multithreads, vectorization, cache issues, and so on.There are four main types of bidiagonalization methods. One is the classical onestage approach, used in LAPACK and ScaLAPACK, which directly reduces a matrix to its bidiagonal form through sequences of orthogonal transformations (Householder or Givens) from two sides. Another is the twostage approach, proposed in GL99 , which first reduces a matrix to its banded form and then a banded matrix is bidiagonalized. Another approach is obtained by means of the Lanczos algorithm, see GolubKahan and SimonZhabidiagonal . The fourth approach is the so called onesided bidiagonalization, first proposed by Ralha in RalhaLAA , and stabilized by Barlow, Bosner, and Drmač BarlowOneSided and a block parallel version is proposed in BarlowpOnesided .
In this work we exploit a different approach instead of accelerating the bidiagonal reduction phase. We try out some new algorithms firstly proposed in the numerical linear algebra area. Higham and Papadimitriou HPpolarSVD2 introduce a new framework for computing the SVD which is based on the polar decomposition and combines with the eigendecomposition algorithms. Note that any rectangular matrix has a polar decomposition (PD)
(1) 
where is a (tall) matrix with orthogonal columns and is Hermitian positive semidefinite Highambook2 . The SVD of can be obtained by further computing the eigendecomposition of , .i.e, .
One advantage of this framework is that it can be accelerated by many efficient PD algorithms and some welldeveloped scalable eigensolvers (such as ELPA elpalibrary ) without implementing the complicated bidiagonalization codes. Therefore, its implementation is relatively simpler. While, its main drawback is that it requires much more floating point operations than the bidiagonal reduction approach, see NHqdwheig ; Nakatsukasasimrev for details. This framework is wellknown in the numerical linear algebra area, but there are little or no results on its performance on supercomputers compared with existing parallel SVD algorithms, for example, the algorithms in ScaLAPACK Scalapack ; scalapackbook , the most famous parallel numerical linear package.
The SVD problem is reduced to an eigenvalue problem via the polar decomposition, and therefore the recent welldeveloped scalable eigenvalue packages are usable. The remaining problem is how to compute the polar decomposition in a scalable and efficient way. There exist many distinguish algorithms such as the scaled Newton (SN) method
Highambook2 , the QRbased dynamically weighted Halley (QDWH) method NBGHalley , and the more recently proposed algorithm based on Zolotarev’s function Nakatsukasasimrev .The QDWHSVD algorithm which is used to compute the SVD in NHqdwheig , has been implemented on multicore architecture enhanced with multiple GPUs qdwhmulticore . The results there show that QDWHSVD can outperform the standard methods. Sukkari, Ltaief, and Keyes qdwhdist further represent a comprehensive performance of QDWHSVD on a largescale distributedmemory platform, based on the numerical library ScaLAPACK Scalapack . With QDWHPD as a preprocessing step and using the eigensolver ELPA elpalibrary for computing the eigendecomposition, the distributed parallel QDWHSVD algorithm qdwhdist achieves up to fivefold and twofold than the ScaLAPACK routine PDGESVD on ill and wellconditioned matrices, respectively. Note that the convergence rate of QDWH relates to the condition number of matrices. For wellconditioned matrices it requires few iterations NHqdwheig ; NBGHalley to compute the PD. While, the condition number is usually irrelevant to the eigenvalue problems which concern more about whether the eigenvalues are relatively well separated.
In Nakatsukasasimrev , Nakatsukasa and Freund proposed a variant of QDWHPD algorithm with higher order of convergence for the polar decomposition by using Zolotarev’s function, and call it ZoloPD. It is shown that the convergence order of ZoloPD can be 17, and it usually requires one or two iterations (while QDWH requires less than 6 iterations). As in Nakatsukasasimrev , we name the SVD algorithm based on ZoloPD as ZoloSVD. Up to now, we have not seen any results of ZoloSVD on high performance computers.
In this paper we mainly exploit this ZoloSVD algorithm, and introduce a high performance ZoloPD implementation on distributedmemory platform based on the stateofart numerical library ScaLAPACK. We discuss some ways to further improve it. We further compare the performance of QDWHPD and ZoloPD. It turns out that ZoloPD requires more floating point operations, while it has better scalability than QDWH, since ZoloPD decomposes MPI processes into some relatively independent groups thus is more loosely coupled. ZoloPD can be much faster than QDWHPD when using many (MPI) processes. Combining with ELPA, we show that ZoloSVD can be much faster than ScaLAPACK routine PDGESVD and QDWHSVD. We use some sparse matrices from University of Florida sparse matrix collection TimDavisMatrix , and some randomly constructed matrices to test the ZoloSVD algorithm. Our QDWHSVD and ZoloSVD so?ftware library are freely available at https://github.com/shengguolsg/ZoloSVD.
2 QdwhPd and ZoloPD
The polar decomposition is an important problem in numerical linear algebra area, and the wellknown PD algorithms include the scaled Newton (SN) method Highambook2 , QDWHPD NBGHalley , and ZoloPD Nakatsukasasimrev , etc.
2.1 QdwhPd
QDWH is a QRbased dynamically weighted Halley iterative method for computing the polar decomposition NBGHalley . QDWH computes the polar factor as the limit of the sequence defined by
(2) 
where
is an upper bound of the maximum singular value of
, . In QDWH the parameters and are chosen dynamically to speed up the convergence. If are fixed, it gives the Halley iteration, which is cubically convergent Highambook2 .The iteration (2) requires explicit matrix inversion and thus it may have potential numerical stability issue. It is shown in NBGHalley that (2) is mathematically equivalent to a QRbased implementation, which is inversefree. The practical QDWH iteration is : ,
(3) 
where The main cost lies in computing the QR factorization of an matrix and a matrix multiplication, both of which can be done in a communicationoptimal manner BDHS11 ; Hoemmensisc .
Iteration (2) is also mathematically equivalent to the following form,
(4) 
where chol denotes the Cholesky factorization of . The starting point is that when the absolute value of is small, matrix
is probably wellconditioned and computing the Cholesky factorization is cheaper than QR factorization. As suggested in
NHqdwheig , we switch from (3) to (4) as long as . In general, the QR iteration (3) is usually required only once or twice. Our implementation of the QDWH algorithm is based on ScaLAPACK, and the main procedure is quite similar to that one in qdwhdist .2.2 ZoloPD
The QDWH parameters , and in (2) are computed as the solution of the rational maxmin optimization problem in NBGHalley ,
(5) 
subject to the constraint on . The parameters in the QDWH iteration can also be obtained by finding the best type rational approximation to the sign function in the infinity norm, see Nakatsukasasimrev .
Nakatsukasa and Freund Nakatsukasasimrev extend to higher order rational polynomials for general . The obtained optimal rational function is called the type Zolotarev function corresponding to , and the solution is given by
(6) 
Here, the constant is uniquely determined by the condition
and the coefficients are given by
(7) 
where and are the Jacobi elliptic functions (see, e.g., (Akhiezerbook1, , Ch. 5)). Here and . It is more convenient to use the scaled Zolotarev function
(8) 
where .
Combining the iterative process used in QDWH and the scaled Zolotarev functions, we arrive at an algorithm that computes the polar factorization by composing Zolotarev functions, called ZoloPD Nakatsukasasimrev , shown in Algorithm 2.2. One question remains: how to choose the order of rational functions ? The convergence order is , and is related to the condition number of matrix . For illconditioned matrices, should be large for fast convergence. A table is given in Nakatsukasasimrev to guide the choice of . For completeness, the results are shown in Table 1, which shows the smallest (i.e., the number of iterations) for which the following condition is satisfied
for varying values of and . It is suggested in Nakatsukasasimrev to choose such that it requires at most two iterations. This strategy may require much more computational resources in the distributed parallel environment. We will further discuss it below in section 4.
1.001  1.01  1.1  1.2  1.5  2  10  

2  2  2  3  3  3  4  4  4  5  5  6  
1  2  2  2  2  2  3  3  3  3  4  4  
1  1  2  2  2  2  2  2  3  3  3  3  
1  1  1  2  2  2  2  2  2  3  3  3  
1  1  1  1  2  2  2  2  2  2  3  3  
1  1  1  1  1  2  2  2  2  2  2  3  
1  1  1  1  1  1  2  2  2  2  2  3  
1  1  1  1  1  1  2  2  2  2  2  2 
The following two theorems show that equation (8) can be written in partial fraction form, which endow ZoloPD with good parallelism. Each QR factorization in (12) can be done simultaneously in parallel, and therefore the computation time is reduced significantly. For example, when , we can divide all processes into groups, and each group computes one QR factorization in (12), instead of using all processes to compute QR factorizations sequentially. Compared with QDWHPD, another advantage of ZoloPD is that it requires much fewer iterations, about twothirds fewer, see Table 1 and NBGHalley , since QDWH requires six iterations for illconditioned matrices.
Theorem 2.1 (Nakatsukasasimrev ).
Theorem 2.2 (Nakatsukasasimrev ).
For the function as in (8), and a matrix with SVD , the matrix is equal to
(11) 
Moreover, can be computed in an inversefree manner as
(12) 
[ZoloPD for the polar decomposition] Let be an upper bound of of and a lower bound of of .

Compute and ;

Compute and let ;

Choose based on from Table 1. If then and skip to (d).

Compute and :

and .
2.3 ZoloSVD
A framework for computing the SVD via the polar decomposition and the eigendecomposition has been proposed in HPpolarSVD ; NHpolar . It assumes that the polar decomposition of is and the symmetric eigendecomposition of is , and the SVD of is obtained from .
Many algorithms have been proposed for this approach, and their differences lie in how to compute the polar decomposition and the symmetric eigendecomposition. In HPpolarSVD
, it suggests using a method based on Padé iteration for the polar decomposition and any standard method for the symmetric eigendecomposition. In
NHpolar , it computes the polar decomposition by the QDWH algorithm and the symmetric eigendecomposition by QDWHEIG which is a spectral divideandconquer algorithm based on the polar decomposition, see NHpolar for details. In qdwhmulticore , it is shown that QDWHEIG is not as efficient as the symmetric eigendecomposition routines in MAGMA on GPUs. The results in qdwhdist also show that QDWHEIG would be slower than ELPA for ill and wellconditioned matrices on distributed memory systems. Therefore, we also use ELPA to compute the eigendecomposition of in our implementation. The SVD algorithm in Nakatsukasasimrev is based on ZoloPD which is implemented in Matlab and there are no results about its performance on supercomputers.The framework for computing SVD used in this paper is summarized in Algorithm 2.3. We use ZoloPD to compute the polar decomposition and use ELPA Elpa to compute the symmetric eigendecomposition on distributed memory parallel computers.
One of the main differences between ELPA and the DC algorithm in ScaLAPACK is that ELPA uses twostage approach for tridiagonalization. Compared with ScaLAPACK, ELPA has better scalability and can be up to two times faster on Tianhe2 supercomputer, see elpalibrary for results on other supercomputers. Unfortunately, there are few packages that have efficiently implemented the twostage bidiagonal reduction (BRD) algorithm. We guess that twostage BRD extended to distributed environment systems can achieve similar speedups as the tridiagonal reduction (TRD), which will be our future work. For multicore architectures, PLASMA has provided a highperformance BRD, which achieves up to 30fold speedup LLDbidiagonal on a 16core Intel Xeon machine against the LAPACK implementation in Intel MKL version 10.2.
In this work, we are concerned with computing the SVD of (nearly) square matrices. For highly rectangular matrices, many efficient algorithms are based on randomized techniques MartinssonRev10 . Another approach is to compute the SVD incrementally, such as the SVDupdating algorithm in Zhaupdating . An hierarchically incremental SVD approach is proposed in HISVD , which is similar in spirit to the CAQR factorization Hoemmensisc . It divides the matrix into many chunks along the column dimension, . The SVD of each can be computed independently by using different process groups and then merged hierarchically together. Similar to ZoloPD, it is naturally parallelizable.
[ZoloSVD]
Input: A general matrix with .
Output: the SVD .

Compute the polar decomposition via ZoloPD.

Compute the symmetric eigendecomposition via ELPA.

Compute .
3 Implementation details
The serial ZoloPD algorithm requires more flops than QDWH, see Nakatsukasasimrev , but it is naturally parallelizable. We can use more MPI processes and implement ZoloPD in parallel. We use process groups to compute QR factorizations or Cholesky factorizations simultaneouly. Therefore, the operations of each iteration cost by each process group is about the same as QDWH. By comparing with the number of iterations required, the parallel version of ZoloPD is supposed to be about three times faster than QDWH, which is also confirmed by the results in section 4.
In this section, we introduce our implementation details based on the routines in ScaLAPACK, and introduce how to exploit the sparse structure of the matrix in (12) when computing its QR factorization. A structured QR algorithm is proposed in the following subsection by modifying the routines in ScaLAPACK. It can be up to times faster than the QR algorithm in ScaLAPACK.
3.1 Fast structured QR algorithms
In Algorithm 2.2, we need to compute the QR factorization of a tall matrix , where is a general matrix, and
is an identity matrix, sparse. In this subsection, we investigate a fast QR algorithms for matrix
, which has been mentioned in (NHqdwheig, , Appendix A.1). The main idea is to take advantage of the sparsity of the bottom part of matrix , i.e., the identity matrix . While, the classical QR will ignore all the zeros in matrix and treat it as a dense matrix. By exploiting this special sparse structure, each Householder reflector in the structured QR algorithm has at most nonzero elements instead of in the classical one. Therefore, it could save a lot of floating point operations by exploiting this sparse structure. We introduce how to exploit the sparse structure of block column by block column below. Different from the method suggested in NHqdwheig , our implementation can be seen as a block version of it, and every NB columns are transformed together. Therefore, each Householder reflector in our implementation has at most NB nonzero elements.In this work, we implement it by modifying the routines in ScaLAPACK, PDGEQRF and PDORGQR. The routine PDGEQRF computes a QR factorization of a general matrix by computing the QR factorization of the first NB columns and then the next NB columns, and so on, where NB is the block size. Its process is similar to that shown in Figure 1 for the proposed structured QR algorithm which is denoted by MPDGEQRF, for simplicity. The difference between MPDGEQRF and PDGEQRF is that the row dimensions of the panels in MPDGEQRF are at most +NB, where NB. Note that we use “row dimension” to denote the number of rows of a matrix. While, the row dimensions of panels in PDGEQRF could be much larger than MPDGEQRF. For example, the row dimension of the first panel of PDGEQRF is . The ScaLAPACK routine PDORGQR
can be similarly modified to generate the orthogonal matrix
computed by MPDGEQRF, and the structured version is denoted by MPDORGQR.To compare this structured QR factorization with ScaLAPACK routines, we use two random matrices with different dimensions which are and , respectively. The experiments are done on Tianhe2 super computer, located in Guangzhou, China. Each compute node has two Intel Xeon E5 2692v2 CPUs and 24 cores in total. For the smaller matrix, we use and processes, respectively. For the larger one, we use and processes to test these four routines MPDGEQRF, MPDORGQR, PDGEQRF and PDORGQR. The execution times are shown in Table 2, from which we can see that the speedups over ScaLAPACK are from to .
No.Proc  Mat.  

PDGEQRF  MPDGEQRF  Speedup  PDORGQR  MPDORGQR  Speedup  
256  0.91  0.69  1.32  0.39  0.27  1.43 
512  0.89  0.66  1.34  0.30  0.20  1.51 
1024  0.76  0.64  1.23  0.21  0.16  1.26 
2048  0.87  0.75  1.18  0.15  0.13  1.21 
Mat.  
256  3.36  2.46  1.36  1.98  1.37  1.45 
512  2.91  2.14  1.36  1.45  0.97  1.49 
1024  2.08  1.60  1.30  0.88  0.61  1.46 
2048  2.00  1.57  1.28  0.69  0.47  1.47 
3.2 Implementation based on ScaLAPACK
Our implementation highly depends on the ScaLAPACK and BLACS routines, and therefore it is portable. We use BLACS routines to split the communicators and perform the communications among all these processes. The algorithm proposed here works for both sparse and dense matrices. Note that for sparse matrices our algorithm does not exploit their sparse properties when computing its SVD, as done in ScaLAPACK, since Algorithm 3.2 can be seen as a direct method for computing SVD.
[Distributed Parallel ZoloSVD]
Input: A sparse or dense matrix .
Assume is distributed among all processes.
Output: The SVD of .

Compute , and , involving all the processes;

Compute and let , involving all the processes;

Choose based on or fix or .
Divide all the processes into subgroups, and redistribute the matrix to each subgroup. 
While not convergent,

End while

One subgroup redistributes to all the processes;

Compute the eigendecomposition of , and the singular vectors of .
We assume the whole matrix is distributed among all the processes, and use the BLACS routine PDGEMR2D to redistribute it to the process in each subcommunicator. We find the time cost in the data redistribution is very small. For a sparse matrix , we can also let each process have one copy of when its storage is not large, stored in its sparse form. This can avoid one time of communication. For simplicity, we further assume an upper bound of and a lower bound of are known, denoted by and
, respectively. They can be estimated in particular applications or computed very efficiently by using some sparse direct solvers such as SuiteSparse
TimDavisbook , and the computation times are included in section 4, see Table 3. The third and fourth columns of Table 3 show the time of estimating and in second, which are be very small.There are three MPI process grids in our implementation. Assume there are processes in total and the processes are organized into a 2D grid, and . The BLACS context associated with all the processes is named as ALL_CONTXT. For simpility, we assume is dividable by , . We use the BLACS routine BLACS_GRIDMAP to divide all the processes into groups. This is the second process grid, and its BLACS context is named as TOP_CONTXT. Each group contains the processes in the same row of the process grid TOP_CONTXT. The communications in the ZoloPD algorithm are among the processes in the same column of the process grid TOP_CONTXT. The processes in the same row of TOP_CONTXT are further organized into a 2D grid which are used to compute each individual QR or Cholesky factorizations in (12). This is the third process grid and its BLACS context is named as SEP_CONTXT. The matrix is redistributed from the processes in ALL_CONTXT to the processes in SEP_CONTXT by using the BLACS routine PDGEMR2D. The parallel ZoloSVD algorithm is summarized in Algorithm 3.2, which can be easily changed to an executable ScaLAPACK routine.
The main computation of ZoloPD lies in the QR factorizations and the Cholesky factorizations. Each iteration step requires a summation of terms, and the computation of each term is independent, which is computed by a individual subgroup of processes. For example, it requires to compute in the QR stage
Each process subgroup computes the QR factorization of matrix ,
In the Cholesky factorization stage, each process subgroup computes the Cholesky factor of matrix . The work loads are well balanced. To compute the summation, communications are required. In our implementation, the summation is done by using the BLACS routine DGSUM2D. To compute the summation, we only need to add the corresponding data in the same process column of process grid TOP_CONTXT. This is because the data distribution of processes in every group is the same.
We test the performance of ZoloPD by using two ways to choose the parameter . Firstly, , the number of subcommunicators, is obtained from Table 1 based on the condition number of . This strategy works well for wellconditioned matrices. However, for illconditioned matrices it requires more computational resources, for example, . Another strategy is to choose a small but use more iterations. In our implementation, we choose or . From the results in Table 1, the number of iterations would increase by one or two. This means the convergence rate of ZoloPD is reduced, but its convergence rate is still of order or .
4 Numerical Results
Our experiments are performed on Tianhe2 supercomputer located in Guangzhou, China, which is one of the fastest supercomputers in the world, having a peak performance of 54.9 petaflops in theory and 33.86 petaflops in Linpack benchmark. It has a total of 16,000 compute nodes. Each compute node is equipped with two Intel E52692 CPUs, 64GB of DDR3 main memory, and The interconnect network topology is an optoelectronic hybrid, hierarchical fat tree. For compilation we used Intel fortran compiler (ifort) and the optimization flag O3 mAVX, and linked the codes to Intel MKL (composer_xe_2015.1.133). As suggested in qdwhdist , we only investigate only pure MPI implementation, and set NB = 64 for the twodimensional block cyclic data distribution (BCDD). Each compute node uses 24 MPI processes in principle.
Example 1 We first use three sparse matrices with medium size from real applications, which are obtained from the University of Florida sparse matrix collection TimDavisMatrix . The names of these matrices are illustrated in Table 3. The QDWHPD and ZoloPD algorithms require to estimate a lower bound of the smallest singular value and an upper bound of the largest singular value for these matrices. We find that the computations for and are quite fast, nearly negligible. See the third and fourth columns of Table 3, and the experiments are performed on a Laptop with Intel i7 CPU and 16GB memory using Matlab 2010b.
Matrix  N  Cond  

nemeth03  9,506  1.03e02  7.19e02  1.29e+00  2 
fv1  9,604  3.35e02  2.56e01  1.40e+01  3 
linverse  11,999  3.01e03  0.25e01  9.06e+03  4 
The number of iterations of ZoloPD depends on . In this example, we choose from the values in Table 1. When is larger, ZoloPD has higher convergence rate. Table 5 shows the number of iterations cost by ZoloPD. From it we can see that when ZoloPD requires two fewer iterations than QDWHPD, which explains the speedups of ZoloPD over QDWHPD in some sense when implemented in parallel.
Matrix  Method  No. of Processes  

nemeth03  PDGESVD  61.45  15.97  15.14  14.85  15.57 
ZoloSVD  24.73  15.11  10.91  11.83  10.01  
Speedup  2.48  1.06  1.39  1.26  1.56  
fv1  PDGESVD  95.60  20.18  19.06  18.11  19.03 
ZoloSVD  30.65  17.58  12.82  12.65  11.54  
Speedup  3.12  1.15  1.49  1.43  1.65  
linverse  PDGESVD  144.37  29.87  26.43  25.01  32.38 
ZoloSVD  50.29  28.64  19.62  18.64  15.83  
Speedup  2.87  1.04  1.35  1.34  2.05 
The comparison results of ZoloSVD and PDGESVD are shown in Table 4. It shows that ZoloSVD can be 3.12x faster than PDGESVD for matrix fv1 when using processes. Note that the number of processes used by PDGESVD and ZoloSVD are shown in the first rows of Table 4. The comparisons of QDWHPD and ZoloPD are shown in Table 6. It turns out that ZoloPD is about 1.20x times faster than QDWHPD when using many processes.
Matrix  QDWH  

nemeth03  4  3  3  3 
fv1  5  4  3  3 
linverse  5  4  3  3 
Example 2 The main drawback of ZoloPD is that it requires too much floatingpoint operations for illconditioned matrices when requiring at most TWO iterations. One approach to fix this problem is to choose a small . In this example we let be and let QDWHPD and ZoloPD use the same number of processes, and the results are shown in Table 6. The first rows of Table 6 show the number of processes used. From it, we can get that ZoloPD is usually faster than QDWHPD when using the same number of processes. Because ZoloPD decomposes all MPI processes into relatively independent groups, it is more loosely coupled and therefore more scalable than QDWHPD. ZoloPD becomes faster than QDWHPD when using many processes.
Matrix  Method  No. of Processes  

nemeth03  QDWHPD  16.51  12.02  8.14  7.36  5.55 
ZoloPD  16.56  8.55  6.42  4.79  4.44  
Speedup  1.00  1.41  1.27  1.70  1.25  
fv1  QDWHPD  16.16  12.75  8.62  7.23  5.97 
ZoloPD  22.26  10.87  7.40  6.02  5.82  
Speedup  0.73  1.17  1.16  1.20  1.03  
linverse  QDWHPD  37.63  20.25  12.97  10.15  9.35 
ZoloPD  37.40  19.07  12.43  8.75  7.92  
Speedup  1.01  1.06  1.04  1.16  1.18 
Compared with QDWHPD, ZoloPD further requires communications among different subcommunicators. We profile the ZoloPD algorithm and Table 7 shows the times cost by each stage of ZoloPD, where the rows Combin. show the communication time cost by DGSUM2D, which computes the summation of in Algorithm 2.3. The rows FormX2 show the maximum time of computing of all subgroups. The results were obtained for the matrix fv1 when let , and ZoloPD took three iterations, the first one used QR factorization and the other two used Cholesky factorization. From it we can see that most time lies in computing the QR and Cholesky factorizations and the communication times between different communicators are negligible.
No. of Proc.  

QR  3.42  3.03  2.07 
Combin.  6.64e02  7.21e03  6.26e02 
FormX2  2.82e02  1.98e02  1.47e02 
Chol.  1.95  1.38  0.89 
Combin.  1.74e02  1.84e02  3.80e02 
FormX2  1.96e02  1.38e02  1.26e02 
Chol.  1.89  1.22  0.75 
Combin.  1.34e02  1.20e02  1.03e03 
FormX2  1.87e02  1.33e02  1.38e02 
Example 3 We use some larger, more illconditioned matrices to further test ZoloSVD and compare it with PDGESVD. The properties of these matrices are shown in Table 8, where rand1 and rand2 are two matrices with random Gaussian entries. These matrices include symmetric ones and nonsymmetric ones, sparse and dense ones. Some matrices are very illconditioned and their condition numbers are in the order of .
Matrix  Cond  

bcsstk18  11,948  80,519  3.46e+11 
c47  15,343  113,372  3.16e+08 
c49  21,132  89,087  6.02e+08 
cvxbqp1  50,000  199,984  1.09e+11 
rand1  10,000  dense  3.97e+07 
rand2  30,000  dense  1.24e+07 
In this example we let PDGESVD and ZoloSVD use the same number of processes and let , and the numerical results are shown in Table 9, where the first row contains the number of processes used and the other rows contains the speedups of QDWHSVD and ZoloSVD over PDGESVD. It turns out that ZoloSVD is always faster than PDGESVD for these matrices when using many MPI processes, for example using processes. It is interesting to see that ZoloSVD is also faster than PDGESVD when using processes.
About how to choose , it depends on the computational resources you have and the condition number of matrix, refer to Table 1. The number of iterations cost by ZoloPD when choosing different are illustrated in Table 10, which are consistent with the results estimated in Table 1. It only decreases by one iteration when is increased from to , and therefore or is probably a good choice. Another reason is that taking too large can lead to numerical instability Nakatsukasasimrev .
Matrix  Method  No. of Processes  

bcsstk18  QDWHSVD  1.61  0.59  0.76  0.83  1.02 
ZoloSVD  1.45  0.67  0.85  1.19  1.29  
c47  QDWHSVD  1.26  0.67  0.98  1.40  1.16 
ZoloSVD  2.33  0.92  1.07  1.51  1.29  
c49  QDWHSVD  2.01  2.52  1.05  0.97  1.00 
ZoloSVD  1.28  3.02  1.08  1.54  1.73  
cvxbqp1  QDWHSVD  1.20  1.12  1.26  1.03  1.07 
ZoloSVD  1.37  1.36  1.17  1.51  1.28  
rand1  QDWHSVD  2.22  1.09  1.11  1.03  1.15 
ZoloSVD  3.21  1.10  1.24  1.70  1.79  
rand2  QDWHSVD  2.17  2.84  1.14  1.02  1.73 
ZoloSVD  2.20  3.44  1.21  1.60  1.61 
Table 9 also shows the performance of QDWHSVD, which are in the lines denoted by QDWHSVD, the speedup compared with PDGESVD. We can see that ZoloSVD can compete with QDWHSVD, and is usually faster than both QDWHSVD and PDGESVD for these matrices. This behavior was verified for processes going from 256 up to 4096.
4.1 Numerical Accuracy
The backward error of the overall SVD is computed as
(13) 
where , are orthogonal and is diagonal and its diagonals are the singular values, denotes the Frobenius norm of matrix and denotes the 2norm of . We test the orthogonality of the computed singular vectors by measuring
where and are the left and right singular vectors respectively, and is the dimension of matrix ,
Matrix  

bcsstk18  4  4  3  3 
c47  4  4  3  3 
c49  4  4  3  3 
linverse  4  3  3  3 
nemeth03  3  3  3  3 
fv1  4  3  3  3 
We compare these three methods, PDGESVD, QDWHSVD and ZoloSVD, and the numerical results are shown in Figure 2. The matrices tested are bcsstk18, c47, c49, fv1, linverse, nemeth03, cvxbqp1, rand1, and rand2, which are numbered from to in the subfigures, respectively. The results illustrate that the algorithm is numerically stable, and the computed left and right singular vectors are highly orthogonal. The relative residuals, defined in (13), of the computed SVD by ZoloSVD and QDWDSVD are as accurate as those computed by PDGESVD for these matrices, and the results are shown in Figure 2(a). ZoloSVD is comparable to QDWHSVD, and their accuracy are in the same order. The orthogonality of the computed singular vectors by these three methods are always in machine precision, less than . The results for the right and left singular vectors are similar, and only the results for left singular vectors are included in Figure 2(b).
5 Conclusions
A new distributed parallel SVD algorithm, called ZoloSVD, is implemented in this work, which is based on the ZoloPD Nakatsukasasimrev algorithm and the symmetric eigenvalue decomposition algorithm. The main advantage of this algorithm is that it is highly scalable. When using more computational resources, numerical results show that ZoloSVD can be three times faster than PDGESVD. When using the same number of processes, ZoloSVD can also be faster than PDGESVD, and for some matrices it can be more than two times faster. Compared with QDWHPD, the drawback of ZoloPD is that it requires much more floating point operations. To be faster, it must be implemented in parallel. We use many sparse matrices from the University of Florida sparse matrix collection and some random dense matrices to conduct the numerical experiments. Our implementations of QDWHSVD, ZoloSVD and structured QR algorithms are freely available at https://github.com/shengguolsg/ZoloSVD.
Acknowledgement
This work is partially supported by National Natural Science Foundation of China (No. 11401580, 91530324, 91430218 and 61402495).
References
 (1) N. I. Akhiezer. Elements of the Theory of Elliptic Functions. AMS, Providence, RI, 1990.
 (2) T. Auckenthaler, V. Blum, H. J. Bungartz, T. Huckle, R. Johanni, L. Krämer, B. Lang, H. Lederer, and P. R. Willems. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Computing, 37(12):783–794, 2011.
 (3) G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl., 21(2):562–580, 2011.
 (4) J. L. Barlow, N. Bosner, and Z. Drmač. A new stable bidiagonal reduction algorithm. Linear Algebra Appl., 397:35–84, 2005.
 (5) J. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997.
 (6) N. Bosner and J. L. Barlow. Block and parallel versions of onesided bidiagonalization. SIAM J. Matrix Anal. Appl., 29(3):927–953, 2007.
 (7) J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. Scalapack: A portable linear algebra library for distributed memory computersdesign issues and performance. Computer Physics Communications, 97:1–15, 1996.
 (8) T. Davis. Direct methods for sparse linear systems. SIAM, Philadelphia, 2006.
 (9) T. Davis and Y. Hu. The Univeristy of Florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, 2011.
 (10) S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. Soc. Inf. Sci., 41:391–407, 1990.
 (11) J. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997.
 (12) J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communicationoptimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput., 34:A206–A239, 2012.
 (13) J. W. Demmel and W. M. Kahan. Accurate singular values of bidiagonal matrices. SIAM J. Sci. Comput., 11:873–912, 1990.
 (14) B. Großer and B. Lang. Efficient parallel reduction to bidiagonal form. Parallel computing, 25:969–986, 1999.
 (15) M. Faverge, J. Langou, Y. Robert, and J. Dongarra. Bidiagonalization with parallel tiled algorithms, 2016. arXiv: 1611.06892v1.
 (16) G. Golub and W. Kahan. Calculating the singular values and pseudoinverse of a matrix. SIAM J. Numer. Anal., 2:205–224, 1965.
 (17) M. Gu and S. C. Eisenstat. A divideandconquer algorithm for the bidiagonal SVD. SIAM J. Matrix Anal. Appl., 16(1):79–92, 1995.
 (18) A. Haidar, P. Luszczek, J. Kurzak, and J. Dongarra. An improved parallel singular value algorithm and its implementation for multicore hardware. In William Gropp, editor, Proceedings of SC13, pages 90:1–90:14, Denver, CO, USA, 2013.
 (19) N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53:217–288, 2011.
 (20) N. J. Higham. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2008.

(21)
N. J. Higham and P. Papadimitriou.
Parallel singular value decomposition via the polar decomposition.
Technical Report Numerical Analysis Report 239, Manchester Centre for Computational Mathematics, Manchester, England, 1993.  (22) N. J. Higham and P. Papadimitriou. A new parallel algorithm for computing the singular value decomposition. In J. G. Lewis, editor, The Fifth SIAM Conference on Applied Linear Algebra, pages 80–84, Philadelphia, 1994. SIAM.
 (23) H. Hotelling. Simplified calculation of principal components. Psychometrica, 1:27–35, 1935.
 (24) M. A. Iwen and B. W. Ong. A distributed and incremental svd algorithm for agglomerative data analysis on large networks. SIAM J. Matrix Anal. Appl., 37(4):1699–1718, 2016.
 (25) Hatem Ltaief, Piotr Luszczek, and Jack Dongarra. Highperformance bidiagonal reduction using tile algorithms on homogeneous multicore architectures. ACM Trans. Mathematical Software, 39(3):16, 2013.
 (26) H. Ltaif, P. Luszczek, and J. Dongarra. High performance bidiagonal reduction using tile algorithm on homogeneous multicore architectures. ACM Transaction on Mathematical Software, 39(3):16:1–22, 2013.
 (27) A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H. Bungartz, and H. Lederer. The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. J. Phys.: Condens. Matter, 26:1–15, 2014.
 (28) B. C. Moore. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Trans. Autom. Control, 1:AC–26, 1981.
 (29) Y. Nakatsukasa, Z. Bai, and F. Gygi. Optimizing Halley’s iteration for computing the matrix polar decomposition. SIAM J. Matrix Anal. Appl., 31:2700–2720, 2010.
 (30) Y. Nakatsukasa and R. W. Freund. Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev’s functions. SIAM Review, xx(x):xx–xxx, 2016.
 (31) Y. Nakatsukasa and N. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput., 35(3):A1325–A1349, 2013.
 (32) Y. Nakatsukasa and N. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd. SIAM J. Sci. Comput., 35(3):A1325–A1349, 2013.
 (33) R. Ralha. Onesided reduction to bidiagonal form. Linear Algebra Appl., 358:219–238, 2003.
 (34) H. D. Simon and H. Zha. Lowrank matrix approximation using the Lanczos bidiagonalization process with applications. SIAM J. Sci. Comput., 21:2257–2274, 2000.
 (35) D. E. Sukkari, H. Ltaief, and D. E. Keyes. A high performance QDWHSVD solver using hardware accelerators. Technical report, KAUST Repository, 2015.
 (36) D. E. Sukkari, H. Ltaief, and D. E. Keyes. High performance polar decomposition on distributed memory systems. In P. F. Dutot and D. Trystram, editors, EuroPar 2016, LNCS, volume 9833, pages 605–616, Switzerland, 2016. Springer.
 (37) P. R. Willems, B. Lang, and C. Vömel. Computing the bidiagonal SVD using multiple relatively robust representations. SIAM J. Matrix Anal. Appl., 28(4):907–926, 2006.
 (38) H. Zha and H. D. Simon. On updating problems in latent semantic indexing. SIAM J. Sci. Comput., 21:782–791, 1999.
Comments
There are no comments yet.