An efficient hybrid tridiagonal divide-and-conquer algorithm on distributed memory architectures

12/22/2016
by   Shengguo Li, et al.
0

In this paper, an efficient divide-and-conquer (DC) algorithm is proposed for the symmetric tridiagonal matrices based on ScaLAPACK and the hierarchically semiseparable (HSS) matrices. HSS is an important type of rank-structured matrices.Most time of the DC algorithm is cost by computing the eigenvectors via the matrix-matrix multiplications (MMM). In our parallel hybrid DC (PHDC) algorithm, MMM is accelerated by using the HSS matrix techniques when the intermediate matrix is large. All the HSS algorithms are done via the package STRUMPACK. PHDC has been tested by using many different matrices. Compared with the DC implementation in MKL, PHDC can be faster for some matrices with few deflations when using hundreds of processes. However, the gains decrease as the number of processes increases. The comparisons of PHDC with ELPA (the Eigenvalue soLvers for Petascale Applications library) are similar. PHDC is usually slower than MKL and ELPA when using 300 or more processes on Tianhe-2 supercomputer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/05/2020

A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems

In this paper, a parallel structured divide-and-conquer (PSDC) eigensolv...
12/21/2017

A non-commutative algorithm for multiplying (7 × 7) matrices using 250 multiplications

We present a non-commutative algorithm for multiplying (7x7) matrices us...
12/01/2020

A Parallel Direct Eigensolver for Sequences of Hermitian Eigenvalue Problems with No Tridiagonalization

In this paper, a Parallel Direct Eigensolver for Sequences of Hermitian ...
01/12/2018

A fast spectral divide-and-conquer method for banded matrices

Based on the spectral divide-and-conquer algorithm by Nakatsukasa and Hi...
10/25/2021

Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose

The multiplication of a matrix by its transpose, A^T A, appears as an in...
12/06/2019

Communication-Efficient Network-Distributed Optimization with Differential-Coded Compressors

Network-distributed optimization has attracted significant attention in ...
09/10/2019

Efficient Interleaved Batch Matrix Solvers for CUDA

In this paper we present a new methodology for data accesses when solvin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The symmetric tridiagonal eigenvalue problems are usually solved by the divide and conquer (DC) algorithm both on shared memory multicore platforms and parallel distributed memory machines. The DC algorithm is fast and stable, and well-studied in numerous references Cuppen81 ; BNS-Rankone ; Ipsen-TDC ; Gu-eigenvalue ; Gates-Arbenz ; Tisseur-DC . It is now the default method in LAPACK anderson1999lapack and ScaLAPACK Scalapack when the eigenvectors of a symmetric tridiagonal matrix are required.

Recently, the authors LSG-NLAA used the hierarchically semiseparable (HSS) matrices ChandrasekaranGu04 to accelerate the tridiagonal DC in LAPACK, and obtained about 6x speedups in comparison with that in LAPACK for some large matrices on a shared memory multicore platform. The bidiagonal and banded DC algorithms for the SVD problem are accelerated similarly Shengguo-SIMAX2 ; Liao-camwa . The main point is that some intermediate eigenvector matrices are rank-structured matrices ChandrasekaranGu04 ; Hackbusch1999 . The HSS matrices are used to approximate them and then use fast HSS algorithms to update the eigenvectors. HSS is an important type of rank-structured matrices, and others include -matrix Hackbusch1999 , -matrix Hackbusch-Sauter2000 , quasiseparable Eidelman-Gohberg1999 and sequentially semiseparable (SSS) CG-SSS-report ; Gu-DSSS . In this paper, we extend the techniques used in Shengguo-SIMAX2 ; LSG-NLAA to the distributed memory environment, try to accelerate the tridiagonal DC algorithm in ScaLAPACK Scalapack . To integrate HSS algorithms into ScaLAPACK routines, an efficient distributed HSS construction routine and an HSS matrix multiplication routine are required. In our experiments, we use the routines in STRUMPACK (STRUctured Matrices PACKage) package Strumpack , which is designed for computations with both sparse and dense structured matrices. The current STRUMPACK has two main components: dense matrix computation package and sparse direct solver and preconditioner. In this work we only use its dense matrix operation part111The current version is STRUMPACK-Dense-1.1.1, which is available at http://portal.nersc.gov/project/sparse/strumpack/. It is written in C++ using OpenMP and MPI parallism, uses HSS matrices, and it implements a parallel HSS construction algorithm with randomized sampling Martinsson-randhss10 ; Martinsson-Rev10 . Note that some routines are available for sequential HSS algorithms Xia-random ; SSS-QR or parallel HSS algorithms on shared memory platforms such as HSSPACK Liao-camwa 222Some Fortran and Matlab codes are available at Jianlin Xia’s homepage, http://www.math.purdue.edu/~xiaj/, and HSSPACK is available at GitHub.. But STRUMPACK is the only available one for the distributed parallel HSS algorithms. More details about it and HSS matrices will be introduced in section 2.

The ScaLAPACK routine implements the rank-one update of Cuppen’s DC algorithm Cuppen81 . We briefly introduce the main processes. Assume that is a symmetric tridiagonal matrix,

(1)

Cuppen introduced the decomposition

(2)

where and with ones at the -th and -th entries. Let be be eigen decompositions, and then (1) can be written as

(3)

where , and The problem is reduced to computing the spectral decomposition of the diagonal plus rank-one

(4)

By Theorem 2.1 in Cuppen81 , the eigenvalues of matrix are the root of the secular equation

where and are the th component of and the kth diagonal entry of , respectively, and its corresponding eigenvector is given by The eigenvectors simply computed this way may loss orthogonality. To ensure orthogonality, Sorensen and Tang Sorensen-sina91 proposed to use extended precision. While, the implementation in ScaLAPACK uses the Löwner theorem approach, instead of the extended precision approach Sorensen-sina91 . The extra precision approach was used by Gates and Arbenz Gates-Arbenz in their implementation.

Remark 1. The extra precision approach is “embarrassingly” parallel with each eigenvalue and eigenvector computed without communication, but it is not portable in some platform. The Löwner approach requires information about all the eigenvalues, requiring a broadcast. However, the length of communication message is which is trivial compared with the communication of eigenvectors.

The excellent performance of the DC algorithm is partially due to deflation BNS-Rankone ; Cuppen81 , which happens in two cases. If the entry of are negligible or zero, the corresponding is already an eigenpair of . Similarly, if two eigenvalues in are identical then one entry of can be transformed to zero by applying a sequence of plane rotations. All the deflated eigenvalues would be permuted to back of by a permutation matrix, so do the corresponding eigenvectors. Then (3) reduces to, after deflation,

(5)

where is the product of all rotations, and is a permutation matrix and are the deflated eigenvalues.

According to (4), the eigenvectors of are computed as

(6)

To improve efficiency, Gu Gu-thesis suggested a permutation strategy for reorganizing the data structure of the orthogonal matrices, which has been used in ScaLAPACK. The matrix in square brackets is permuted as , where the first and third block columns contain the eigenvectors that have not been affected by deflation, the fourth block column contains the deflated eigenvectors, and the second block column contains the remaining columns. Then, the computation of can be done by two parallel matrix-matrix products (calling PBLAS PDGEMM) involving parts of and the matrices , . Another factor that contributes to the excellent performance of DC is that most operations can take advantage of highly optimized matrix-matrix products.

When there are few deflations, the size of matrix in (6) will be large, and most time of DC would be cost by the matrix-matrix multiplication in (6), which is confirmed by the results of Example 2 in section 2.2. Furthermore, it is well-known that matrix defined as in (4) is a Cauchy-like matrix with off-diagonally low rank property, see Gu-eigenvalue ; LSG-NLAA . Therefore, we simply use an HSS matrix to approximate and use HSS matrix-matrix multiplication routine in STRUMPACK to compute the eigenvector matrix in (6). Since HSS matrix multiplications require much fewer floating point operations than the plain matrix-matrix multiplication, PDGEMM, this approach makes the DC algorithm in ScaLAPACK much faster. More details are included section 2.2 and numerical results are shown in section 3.

2 HSS matrices and STRUMPACK

The HSS matrix is an important type of rank-structured matrices. A matrix is called rank-structured if the ranks of all off-diagonal blocks are relatively small compared to the size of matrix. Other rank-structured matrices include -matrix Hackbusch1999 ; Hackbusch2000 , -matrix Hackbusch-Sauter2000 ; Hackbusch-Borm2002 , quasiseparable matrices Eidelman-Gohberg1999 ; Vandebril-book1 , and sequentially semiseparable (SSS) Chandrasekaran03 ; ChandrasekaranGu05 matrices. We mostly follow the notation used in Martinsson-randhss10 and Xia-Fast09 ; LSG-NLAA to introduce HSS.

Both HSS representations and algorithms rely on a tree . For simplicity, we assume it is a binary tree, name it HSS tree, and the generalization is straightforward. Assume that and is the dimension of matrix . Each node of is associated with a contiguous subset of , , satisfying the following conditions:

  • and , for a parent node with left child and right child ;

  • , where denotes the set of all leaf nodes;

  • , denotes the root of .

We assume is postordered. That means the ordering of a nonleaf node satisfies , where is its left child and is its right child. Figure 1(a) shows an HSS tree with three levels and associated with each node .

(a) Matrix
(b) HSS tree
Figure 1: A three level postordering tree and the HSS blocks of Node 2

A block row or column excluding the diagonal block is called an HSS block row or column, denoted by

associated with node . We simply call them HSS blocks. For an HSS matrix, HSS blocks are assumed to be numerically low-rank. Figure 1(b) shows the HSS blocks corresponding to node 2. We name the maximum (numerical) rank of all the HSS blocks by HSS rank.

For each node in , it associates with four generators , , and , which are matrices, such that

(7)

For a leaf node , , , . A (block) HSS matrix can be written as

(8)

and Figure 1(a) shows its corresponding postordering HSS tree. The HSS representation (7) is equivalent to the representations in Strumpack ; Hss-ulv ; ChandrasekaranGu04 ; Xia-HSS-Chol .

To take advantage of the fast HSS algorithms, we need to construct an HSS matrix first. There exist many HSS construction algorithms such as using SVD Hss-ulv , RRQR(rank revealing QR) Xia-Fast09 , and so on. Most of these algorithms cost in flops, where is the dimension of matrix and is its HSS rank. In Martinsson-randhss10 , a randomized HSS construction algorithm (RandHSS) is proposed, which combines random sampling with interpolative decomposition (ID), see Cheng-Random ; Martinsson-Rev10 ; Martinsson-PNAS07 . The cost of RandHSS can be reduced to

flops if there exists a fast matrix-vector multiplication algorithm in order of

flops. STRUMPACK also uses this algorithm to construct HSS matrices. For completeness, RandHSS is restated in Algorithm 2.

(Randomized HSS construction algorithm) Given a matrix and integers and , generate two Gaussian random matrices and , and then compute matrices and .

do

for node at level

  

if is a leaf node,

  1. ;

  2. compute , ;

  3. compute the ID of , ;

  4. compute , ;

  

else

  1. store generators , ;

  2. compute ,  ;

  3. compute the ID of ,  ;

  4. Compute ,  ;

  

end if

end for

end do

For the root node , store , .

The parameter in Algorithm 2

is an estimate of the HSS rank of

, which would be chosen adaptively in STRUMPACK Strumpack , and is the oversampling parameter, usually equals to or , see Martinsson-Rev10 . After matrix is represented in its HSS form, there exist fast algorithms for multiplying it with a vector in flops Lyons-thesis ; Chandrasekaran03 . Therefore, multiplying an HSS matrix with another matrix only costs in flops. Note that the general matrix-matrix multiplication algorithm PDGEMM costs in flops.

2.1 Strumpack

STRUMPACK is an abbreviation of STRUctured Matrices PACKage, designed for computations with sparse and dense structured matrices. It is based on some parallel HSS algorithms using randomization Strumpack ; Strumpack-sparse .

Comparing with ScaLAPACK, STRUMPACK requires more memory, since besides the original matrix it also stores the random vectors and the samples, and the generators of HSS matrix. The memory overhead increases as the HSS rank increases. Therefore, STRUMPACK is suitable for matrices with low off-diagonal ranks. For the our eigenvalue problems, we are fortunate that the HSS rank of intermediate eigenvector matrices appeared in the DC algorithm is not large, usually less than 100. Through the experiments in section 3, we let the compression threshold of constructing HSS be -, to keep the orthogonality of computed eigenvectors. One advantage of STRUMPACK is that it requires much fewer operations than the classical algorithms by exploiting the off-diagonally low rank property.

Another property of STRUMPACAK, the same as other packages that explores low-rank structures (-matrices Hlib and Block Low-Rank representations), is irregular computational patterns, dealing with irregular and imbalanced taskflows and manipulating a collection of small matrices instead of one large one. HSS algorithms requires a lower asymptotic complexity, but the flop rate with HSS is often lower than with traditional matrix-matrix multiplications, BLAS3 kernels. We expect HSS algorithms to have good performances for problems with large size. Therefore, we only use HSS algorithms when the problem size is large enough, just as in Shengguo-SIMAX2 ; LSG-NLAA .

The HSS construction algorithm in STRUMPACK uses randomized sampling algorithm combined with Interpolative Decomposition (ID) Martinsson-PNAS07 ; Cheng-Random , first proposed in Martinsson-randhss10 . For this randomized algorithm, the HSS rank needs to be estimated in advance, which is difficult to be estimated accurately. An adaptive sampling mechanism is proposed in STRUMPACK, and its basic idea Strumpack is ‘to start with a low number of random vectors , and whenever the rank found during Interpolative Decomposition is too large, is increased’. To control the communication cost, we use a little more sample vectors () and let inc_rand_HSS relatively large which is a parameter for constructing HSS matrix in STRUMPACK.

In this work, we mainly use two STRUMPACK driver routines: the parallel HSS construction and parallel HSS matrix multiplication routines. We use these two routines to replace the general matrix-matrix multiplication routine PDGEMM in hope of achieving good speedups for large matrices. We test the efficiency and scalability of STRUMPACK by using an HSS matrix firstly appeared in the test routine of STRUMPACK Strumpack .

Example 1. We use two Toeplitz matrices, which have been used in Strumpack . In this example we assume . The first one is defined as and for , which is diagonally dominant and yields very low HSS rank (about 2). The second one is defined as and , which is a kinetic energy matrix from quantum chemistry Strumpack , is a discretization parameter. This matrix has slightly larger HSS rank (about ).

PDGEMM 185.33 53.58 13.42 8.02 3.79 1.89
Mat1 Const 6.18 2.39 1.21 1.24 1.29 2.89
Multi 16.64 11.65 3.72 2.46 1.78 1.75
Speedup 8.12 3.82 2.72 2.17 1.23 0.41
Mat2 Const 4.57 1.80 0.96 1.20 1.23 2.68
Multi 15.42 11.10 3.46 2.29 1.71 1.58
Speedup 9.27 4.15 3.04 2.30 1.29 0.44
Table 1: The comparisons of HSS matrix multiplication with PDGEMM

From the results in Table 1, the HSS matrix multiplication implemented in STRUMPACK can be more than 1.2x times faster than PDGEMM when the used processes are around 256. The speedups are even better when using fewer processes. But, the HSS construction and matrix multiplication algorithms are not as scalable as PDGEMM, and they become slower than PDGEMM when using more processes, for example more than 676. Therefore, it suggests to use no more than 256 processes when combining STRUMPACK with ScaLAPACK. The execution time highly depends on many factors, such as parameters NB, block_HSS, etc. The results in Table 1 were obtained by choosing and . Note that we did not try to find the optimal parameters.

2.2 Combining DC with HSS

In this section we show more details of combining HSS matrix techniques with ScaLAPACK. As mentioned before, the central idea is to replace PDGEMM by the HSS matrix multiplication algorithms. The eigenvectors are updated in the ScaLAPACK routine PDLAED1, and therefore we modify it and call STRUMPACK routines in it instead of PDGEMM.

Note that after applying permutations to in (6), matrix should also be permuted accordingly. From the results in Gu-eigenvalue ; LSG-NLAA ; Shengguo-SIMAX2 , we know that is a Cauchy-like matrix and off-diagonally low-rank, the numerical rank is usually around -. When combining with HSS, we would not use Gu’s idea since permutation may destroy the off-diagonally low-rank structure of in (6). We need to modify the ScaLAPACK routine PDLAED2, and only when the size of deflated matrix in (5) is large enough, HSS techniques are used, otherwise use Gu’s idea. Denote the size of by , and it depends the architecture of particular parallel computers, and may be different for different computers.

Most work of DC is spent on the first two top-levels matrix-matrix multiplications. The first top-level takes nearly 50% when using 256 processes or fewer, see the results in Table 2 and also the results in Elpa . Therefore, we could expect at most 2x speedup when replacing PDGEMM with parallel HSS matrix multiplications.

In this subsection, we fist use an example to show the off-diagonally low-rank property of in (4). Here we assume that the dimension of is , and the diagonal entries of satisfy , , and is a normalized random vector in (4

). Then the singular values of matrix

are illustrated in Figure 2(b) with . From it, we get that the singular values decay very quickly.

(a) Matrix
(b) Singular values of
Figure 2: The off-diagonally low-rank property of

Example 2. We use a symmetric tridiagonal matrix to show the percentage of time cost by the top level matrix-matrix multiplications. The diagonal entries of tridiagonal matrix are all two and its off-diagonal entries are all one. The size of this matrix is .

Top One 59.80 44.13 30.29 16.92 9.23 4.77 1.09 0.53
Total 108.04 77.06 52.78 30.04 17.70 10.00 6.20 6.22
Percent(%) 55.35 57.2 57.39 56.32 52.15 47.70 17.58 8.52
Table 2: The percentage of time cost by the first top-level matrix-matrix multiplications

The results in Table 2 are obtained by using optimization flags "-O2 -C -qopenmp -mavx", and linked with multi-threaded Intel MKL. From the results in it, we can see that the top-one level matrix-matrix multiplications can take half of the total time cost by PDSTEDC in some case. Since PDGEMM in MKL has very good scalability, the percentage of top-one level matrix multiplications decreases as the number of processes increases. This example also implies that we are better not to use more than 256 processes in our numerical experiments.

3 Numerical results

All the results are obtained on Tianhe-2 supercomputer Liao-TH2 ; Liao-HPCG , located in Guangzhou, China. It employs accelerator based architectures and each compute node is equipped with two Intel Xeon E5-2692 CPUs and three Intel Xeon Phi accelerators based on the many-integrated-core (MIC) architectures. In our experiments we only use CPU cores.

Example 3. We use some ‘difficult’ matrices A880 for the DC algorithm, for which few or no eigenvalues are deflated. Examples include the Clement-type, Hermite-type and Toeplitz-type matrices, which are defined as follows.

The Clement-type matrix A880 is given by

(9)

where the off-diagonal entries are .

The Hermite-type matrix is given as A880 ,

(10)

The Toeplitz-type matrix is defined as A880 ,

(11)

For the results of strong scaling, we let the dimension be =, and use HSS techniques only when the size of secular equation is larger than . The results for strong scaling are shown in Figure 3(a). The speedups of PHDC over ScaLAPACK are reported in Table 3, and shown in Figure 3(b). We can see that PHDC is about times faster than PDSTEDC in MKL when using 120 processes or fewer.

Matrix Number of Processes
Clement 2.45 1.91 1.58 1.42 1.14
Hermite 1.92 1.43 1.30 1.22 1.00
Toeplitz 2.30 1.92 1.58 1.43 1.26
Table 3: The strong scaling of PHDC compared with Intel MKL
(a) The strong scaling of PHDC
(b) The speedups of PHDC over ScaLAPACK
Figure 3: The results for the difficult matrices

The orthogonality of the computed eigenvectors by PHDC are in the same order as those by ScaLAPACK, which are shown in Table 4. The orthogonality of matrix is defined as , where is the maximum absolute value of entries of .

Matrix Number of Processes
Clement - - - - -
Hermite - - - - -
Toeplitz - - - - -
Table 4: The orthogonality of the computed eigenvectors by PHDC

Example 4. In Tygert-SHT2 , Tygert shows that the spherical harmonic transform (SHT) can be accelerated using the tridiagonal DC algorithm and one symmetric tridiagonal matrix is defined as follows,

(12)

for , where

for Assume that the dimension of this matrix is and . The execution times of using PHDC and PDSTEDC are reported in Table 5.

Method Number of Processes
PDSTEDC 475.00 139.09 39.66 25.82 16.87
PHDC 224.97 88.58 28.53 20.12 14.71
Speedup 2.11 1.57 1.39 1.28 1.15
Table 5: The execution time of PHDC and PDSTEDC for SHT

3.1 Comparison with other implementations

Different from ScaLAPACK, the ELPA routines Elpa ; elpa-library do not rely on BLACS, all communication between different processors is handled by direct calls to a MPI library, where ELPA stands for Eigenvalue soLver for Petascale Applications. For its communications, ELPA relies on two separate sets of MPI communicators, row communicators and column communicators, respectively (connecting either the processors that handle the same rows or the same columns). For the tridiagonal eigensolver, ELPA implements its own matrix-matrix multiplications, does not use PBLAS routine PDGEMM. It is known that ELPA has better scalability and is faster than MKL Elpa ; elpa-library .

Example 5. We use the same matrices as in Example 3 to test ELPA, and compare it with the newly proposed algorithm PHDC. The running times of ELPA are shown in Table 6. Figure 4 shows the speedups of PHDC over ELPA. PHDC is faster than ELPA since it requires fewer floating point operations. However, its scalability is worse than ELPA, and PHDC becomes slower than ELPA when using more than 200 processes.

Matrix Number of Processes
Clement 487.63 137.43 40.36 25.61 12.49
Hermite 363.94 117.61 34.58 21.18 10.89
Toeplitz 509.32 141.57 39.55 25.62 12.67
Table 6: The execution time of ELPA for different matrices
Figure 4: The speedups of PHDC over ELPA for the difficult matrices

4 Conclusions

By combining ScaLAPACK with STRUMPACK, we propose a hybrid tridiagonal DC algorithm for the symmetric eigenvalue problems, which can be faster than the classical DC algorithm implemented in ScaLAPACK when using about 200 processes. The central idea is to relpace PDGEMM by the HSS matrix multiplication algorithms, since HSS matrix algorithms require fewer flops than PDGEMM. Numerical results show that the scalability of HSS matrix algorithms is not as good as PDGEMM. The proposed PHDC algorithm in this work becomes slower than the classical DC algorithm when using more processes.

Acknowledgement

The authors would like to acknowledge many helpful discussions with Xiangke Liao, Sherry Li and Shuliang Lin. This work is partially supported by National Natural Science Foundation of China (Nos. 11401580, 91530324, 91430218 and 61402495).

References

  • (1) E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999.
  • (2) T. Auckenthaler, V. Blum, H. J. Bungartz, T. Huckle, R. Johanni, L. Krämer, B. Lang, H. Lederer, and P. R. Willems. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Computing, 37(12):783–794, 2011.
  • (3) S. Börm and L. Grasedyck. H-Lib– a library for - and -matrices, 1999.
  • (4) J. R. Bunch, C. P. Nielsen, and D. C. Sorensen. Rank one modification of the symmetric eigenproblem. Numer. Math., 31:31–48, 1978.
  • (5) S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A. J. van der Veen, and D. White. Fast stable solvers for sequentially semi-separable linear systems of equations and least squares problems. Technical report, University of California, Berkeley, CA, 2003.
  • (6) S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A. J. van der Veen, and D. White. Some fast algorithms for sequentially semiseparable representation. SIAM J. Matrix Aanal. Appl., 27:341–364, 2005.
  • (7) S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, and A. J. van der Veen. Fast stable solvers for sequentially semi-separable linear systems of equations. Technical report, UC, Berkeley, CA, 2003.
  • (8) S. Chandrasekaran, M. Gu, and T. Pals. Fast and stable algorithms for hierarchically semi-separable representations. Technical report, University of California, Berkeley, CA, 2004.
  • (9) S. Chandrasekaran, M. Gu, and T. Pals. A fast ULV decomposition solver for hierarchical semiseparable representations. SIAM J. Matrix Anal. Appl., 28:603–622, 2006.
  • (10) S. Chandrasekaran, M. Gu, J. Xia, and J. Zhu. A fast QR algorithm for companion matrices. In I. A. Ball and .et al., editors, Recent Advances in Matrix and Operator Theory, pages 111–143, Birkh’́auser, Basel, 2008.
  • (11) H. Cheng, Z. Gimbutas, P. Martinsson, and V. Rokhlin. On the compression of low rank matrices. SIAM J. Sci. Comput., 26(4):1389–1404, 2005.
  • (12) J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. Scalapack: A portable linear algebra library for distributed memory computers-design issues and performance. Computer Physics Communications, 97:1–15, 1996.
  • (13) J. J. M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numer. Math., 36:177–195, 1981.
  • (14) Y. Eidelman and I. Gohberg. On a new class of structured matrices. Integral Equations and Operator Theory, 34:293–324, 1999.
  • (15) K. Gates and P. Arbenz. Parallel divide and conquer algorithms for the symmetric tridiagonal eigenproblem. Technical report, Institute for Scientific Comouting, ETH Zurich, Zurich, Switzerland, 1994.
  • (16) P. Ghysels, X. Li, and F. Rouet S. Williams. An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling, 2014. Submitted to SIAM J. Sci. Comput.
  • (17) M. Gu. Studies in Numerical Linear Algebra. PhD thesis, Yale University, New Haven, CT, 1993.
  • (18) M. Gu and S. C. Eisenstat. A divide-and-conquer algorithm for the symmetric tridiagonal eigenproblem. SIAM J. Matrix Anal. Appl., 16:172–191, 1995.
  • (19) M. Gu, X. S. Li, and P. S. Vassilevski. Direction-preserving and schur-monotonic semiseparable approximations of symmetric positive definite matrices. SIAM J. Matrix Anal. Appl., 31:2650–2664, 2010.
  • (20) W. Hackbusch. A sparse matrix arithmetic based on -matrices. Part I: Introduction to -matrices. Computing, 62:89–108, 1999.
  • (21) W. Hackbusch and S. Börm. Data-sparse approximation by adaptive -matrices. Computing, 69:1–35, 2002.
  • (22) W. Hackbusch, B. Khoromskij, and S. Sauter. On -matrices. In Zenger C Bungartz H, Hoppe RHW, editor, Lecture on Applied Mathematics, pages 9–29, Berlin, 2000. Springer.
  • (23) W. Hackbusch and B.N. Khoromskij. A sparse matrix arithmetic based on -matrices. Part II: Application to multi-dimensional problems. Computing, 64:21–47, 2000.
  • (24) N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53:217–288, 2011.
  • (25) I. Ipsen and E. Jessup. Solving the symmetric tridiagonal eigenvalue problem on the hypercube. SIAM J. Sci. Statist. Comput., 11:203–229, 1990.
  • (26) S. Li, M. Gu, L. Cheng, X. Chi, and M. Sun. An accelerated divide-and-conquer algorithm for the bidiagonal SVD problem. SIAM J. Matrix Anal. Appl., 35(3):1038–1057, 2014.
  • (27) S. Li, X. Liao, J. Liu, and H. Jiang. New fast divide-and-conquer algorithm for the symmetric tridiagonal eigenvalue problem. Numer. Linear Algebra Appl., 23:656–673, 2016.
  • (28) X. Liao, S. Li, L. Cheng, and M. Gu. An improved divide-and-conquer algorithm for the banded matrices with narrow bandwidths. Comput. Math. Appl., 71:1933–1943, 2016.
  • (29) X. Liao, L. Xiao, C. Yang, and Y. Lu. Milkyway-2 supercomputer: System and application. Frontiers of Computer Science, 8(3):345–356, 2014.
  • (30) E. Liberty, F. Woolfe, P. G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the low-rank approximation of matrices. PNAS, 104(51):20167–20172, 2007.
  • (31) Y. Liu, C. Yang, F. Liu, X. Zhang, Y. Lu, Y. Du, C. Yang, M. Xie, and X. Liao. 623 Tflop/s HPCG run on tianhe-2: Leveraging millions of hybrid cores. International Journal of High Performace Computing Applications, 30(1):39–54, 2016.
  • (32) W. Lyons. Fast algorithms with applications to PDEs. PhD thesis, University of California, Santa Barbara, 2005.
  • (33) A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H. Bungartz, and H. Lederer. The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. J. Phys.: Condens. Matter, 26:1–15, 2014.
  • (34) O. A. Marques, C. Voemel, J. W. Demmel, and B. N. Parlett. Algorithm 880: A testing infrastructure for symmetric tridiagonal eigensolvers. ACM Trans. Math. Softw., 35:8:1–13, 2008.
  • (35) P. G. Martinsson. A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Anal. Appl., 32:1251–1274, 2011.
  • (36) F. Rouet, X. Li, P. Ghysels, and A. Napov. A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Transactions on Mathematical Software, 42(4):27:1–27:35, 2016.
  • (37) D. C. Sorensen and P. T. P. Tang. On the orthogonality of eigenvectors computed by divide-and-conquer techniques. SIAM J. Numer. Anal., 28:1752–1775, 1991.
  • (38) F. Tisseur and J. Dongarra. A parallel divide and conquer algorithm for the symmetric eigenvalue problem on distributed memory architectures. SIAM J. Sci. Comput., 20(6):2223–2236, 1999.
  • (39) M. Tygert. Fast algorithms for spherical harmonic expansions, ii. Journal of Computational Physics, 227:4260–4279, 2008.
  • (40) R. Vandebril, M. Van Barel, and N. Mastronardi. Matrix Computations and Semiseparable Matrices, Volume I: Linear Systems. Johns Hopkins University Press, 2008.
  • (41) J. Xia, S. Chandrasekaran, M. Gu, and X.S. Li. Fast algorithm for hierarchically semiseparable matrices. Numer. Linear Algebra Appl., 17:953–976, 2010.
  • (42) J. Xia and M. Gu. Robust approximate Choleksy factorization of rank-structured symmetric positive definite matrices. SIAM J. Matrix Anal. Appl., 31:2899–2920, 2010.
  • (43) J. Xia, Y. Xi, and M. Gu. A superfast structured solver for Toeplitz linear systems via randomized sampling. SIAM J. Matrix Anal. Appl., 33:837–858, 2012.