1 Introduction
The symmetric tridiagonal eigenvalue problems are usually solved by the divide and conquer (DC) algorithm both on shared memory multicore platforms and parallel distributed memory machines. The DC algorithm is fast and stable, and wellstudied in numerous references Cuppen81 ; BNSRankone ; IpsenTDC ; Gueigenvalue ; GatesArbenz ; TisseurDC . It is now the default method in LAPACK anderson1999lapack and ScaLAPACK Scalapack when the eigenvectors of a symmetric tridiagonal matrix are required.
Recently, the authors LSGNLAA used the hierarchically semiseparable (HSS) matrices ChandrasekaranGu04 to accelerate the tridiagonal DC in LAPACK, and obtained about 6x speedups in comparison with that in LAPACK for some large matrices on a shared memory multicore platform. The bidiagonal and banded DC algorithms for the SVD problem are accelerated similarly ShengguoSIMAX2 ; Liaocamwa . The main point is that some intermediate eigenvector matrices are rankstructured matrices ChandrasekaranGu04 ; Hackbusch1999 . The HSS matrices are used to approximate them and then use fast HSS algorithms to update the eigenvectors. HSS is an important type of rankstructured matrices, and others include matrix Hackbusch1999 , matrix HackbuschSauter2000 , quasiseparable EidelmanGohberg1999 and sequentially semiseparable (SSS) CGSSSreport ; GuDSSS . In this paper, we extend the techniques used in ShengguoSIMAX2 ; LSGNLAA to the distributed memory environment, try to accelerate the tridiagonal DC algorithm in ScaLAPACK Scalapack . To integrate HSS algorithms into ScaLAPACK routines, an efficient distributed HSS construction routine and an HSS matrix multiplication routine are required. In our experiments, we use the routines in STRUMPACK (STRUctured Matrices PACKage) package Strumpack , which is designed for computations with both sparse and dense structured matrices. The current STRUMPACK has two main components: dense matrix computation package and sparse direct solver and preconditioner. In this work we only use its dense matrix operation part^{1}^{1}1The current version is STRUMPACKDense1.1.1, which is available at http://portal.nersc.gov/project/sparse/strumpack/. It is written in C++ using OpenMP and MPI parallism, uses HSS matrices, and it implements a parallel HSS construction algorithm with randomized sampling Martinssonrandhss10 ; MartinssonRev10 . Note that some routines are available for sequential HSS algorithms Xiarandom ; SSSQR or parallel HSS algorithms on shared memory platforms such as HSSPACK Liaocamwa ^{2}^{2}2Some Fortran and Matlab codes are available at Jianlin Xia’s homepage, http://www.math.purdue.edu/~xiaj/, and HSSPACK is available at GitHub.. But STRUMPACK is the only available one for the distributed parallel HSS algorithms. More details about it and HSS matrices will be introduced in section 2.
The ScaLAPACK routine implements the rankone update of Cuppen’s DC algorithm Cuppen81 . We briefly introduce the main processes. Assume that is a symmetric tridiagonal matrix,
(1) 
Cuppen introduced the decomposition
(2) 
where and with ones at the th and th entries. Let be be eigen decompositions, and then (1) can be written as
(3) 
where , and The problem is reduced to computing the spectral decomposition of the diagonal plus rankone
(4) 
By Theorem 2.1 in Cuppen81 , the eigenvalues of matrix are the root of the secular equation
where and are the th component of and the kth diagonal entry of , respectively, and its corresponding eigenvector is given by The eigenvectors simply computed this way may loss orthogonality. To ensure orthogonality, Sorensen and Tang Sorensensina91 proposed to use extended precision. While, the implementation in ScaLAPACK uses the Löwner theorem approach, instead of the extended precision approach Sorensensina91 . The extra precision approach was used by Gates and Arbenz GatesArbenz in their implementation.
Remark 1. The extra precision approach is “embarrassingly” parallel with each eigenvalue and eigenvector computed without communication, but it is not portable in some platform. The Löwner approach requires information about all the eigenvalues, requiring a broadcast. However, the length of communication message is which is trivial compared with the communication of eigenvectors.
The excellent performance of the DC algorithm is partially due to deflation BNSRankone ; Cuppen81 , which happens in two cases. If the entry of are negligible or zero, the corresponding is already an eigenpair of . Similarly, if two eigenvalues in are identical then one entry of can be transformed to zero by applying a sequence of plane rotations. All the deflated eigenvalues would be permuted to back of by a permutation matrix, so do the corresponding eigenvectors. Then (3) reduces to, after deflation,
(5) 
where is the product of all rotations, and is a permutation matrix and are the deflated eigenvalues.
According to (4), the eigenvectors of are computed as
(6) 
To improve efficiency, Gu Guthesis suggested a permutation strategy for reorganizing the data structure of the orthogonal matrices, which has been used in ScaLAPACK. The matrix in square brackets is permuted as , where the first and third block columns contain the eigenvectors that have not been affected by deflation, the fourth block column contains the deflated eigenvectors, and the second block column contains the remaining columns. Then, the computation of can be done by two parallel matrixmatrix products (calling PBLAS PDGEMM) involving parts of and the matrices , . Another factor that contributes to the excellent performance of DC is that most operations can take advantage of highly optimized matrixmatrix products.
When there are few deflations, the size of matrix in (6) will be large, and most time of DC would be cost by the matrixmatrix multiplication in (6), which is confirmed by the results of Example 2 in section 2.2. Furthermore, it is wellknown that matrix defined as in (4) is a Cauchylike matrix with offdiagonally low rank property, see Gueigenvalue ; LSGNLAA . Therefore, we simply use an HSS matrix to approximate and use HSS matrixmatrix multiplication routine in STRUMPACK to compute the eigenvector matrix in (6). Since HSS matrix multiplications require much fewer floating point operations than the plain matrixmatrix multiplication, PDGEMM, this approach makes the DC algorithm in ScaLAPACK much faster. More details are included section 2.2 and numerical results are shown in section 3.
2 HSS matrices and STRUMPACK
The HSS matrix is an important type of rankstructured matrices. A matrix is called rankstructured if the ranks of all offdiagonal blocks are relatively small compared to the size of matrix. Other rankstructured matrices include matrix Hackbusch1999 ; Hackbusch2000 , matrix HackbuschSauter2000 ; HackbuschBorm2002 , quasiseparable matrices EidelmanGohberg1999 ; Vandebrilbook1 , and sequentially semiseparable (SSS) Chandrasekaran03 ; ChandrasekaranGu05 matrices. We mostly follow the notation used in Martinssonrandhss10 and XiaFast09 ; LSGNLAA to introduce HSS.
Both HSS representations and algorithms rely on a tree . For simplicity, we assume it is a binary tree, name it HSS tree, and the generalization is straightforward. Assume that and is the dimension of matrix . Each node of is associated with a contiguous subset of , , satisfying the following conditions:

and , for a parent node with left child and right child ;

, where denotes the set of all leaf nodes;

, denotes the root of .
We assume is postordered. That means the ordering of a nonleaf node satisfies , where is its left child and is its right child. Figure 1(a) shows an HSS tree with three levels and associated with each node .
A block row or column excluding the diagonal block is called an HSS block row or column, denoted by
associated with node . We simply call them HSS blocks. For an HSS matrix, HSS blocks are assumed to be numerically lowrank. Figure 1(b) shows the HSS blocks corresponding to node 2. We name the maximum (numerical) rank of all the HSS blocks by HSS rank.
For each node in , it associates with four generators , , and , which are matrices, such that
(7) 
For a leaf node , , , . A (block) HSS matrix can be written as
(8) 
and Figure 1(a) shows its corresponding postordering HSS tree. The HSS representation (7) is equivalent to the representations in Strumpack ; Hssulv ; ChandrasekaranGu04 ; XiaHSSChol .
To take advantage of the fast HSS algorithms, we need to construct an HSS matrix first. There exist many HSS construction algorithms such as using SVD Hssulv , RRQR(rank revealing QR) XiaFast09 , and so on. Most of these algorithms cost in flops, where is the dimension of matrix and is its HSS rank. In Martinssonrandhss10 , a randomized HSS construction algorithm (RandHSS) is proposed, which combines random sampling with interpolative decomposition (ID), see ChengRandom ; MartinssonRev10 ; MartinssonPNAS07 . The cost of RandHSS can be reduced to
flops if there exists a fast matrixvector multiplication algorithm in order of
flops. STRUMPACK also uses this algorithm to construct HSS matrices. For completeness, RandHSS is restated in Algorithm 2.(Randomized HSS construction algorithm) Given a matrix and integers and , generate two Gaussian random matrices and , and then compute matrices and .

do

for node at level

if is a leaf node,

;

compute , ;

compute the ID of , ;

compute , ;


else

store generators , ;

compute , ;

compute the ID of , ;

Compute , ;


end if

end for

end do
For the root node , store , .
The parameter in Algorithm 2
is an estimate of the HSS rank of
, which would be chosen adaptively in STRUMPACK Strumpack , and is the oversampling parameter, usually equals to or , see MartinssonRev10 . After matrix is represented in its HSS form, there exist fast algorithms for multiplying it with a vector in flops Lyonsthesis ; Chandrasekaran03 . Therefore, multiplying an HSS matrix with another matrix only costs in flops. Note that the general matrixmatrix multiplication algorithm PDGEMM costs in flops.2.1 Strumpack
STRUMPACK is an abbreviation of STRUctured Matrices PACKage, designed for computations with sparse and dense structured matrices. It is based on some parallel HSS algorithms using randomization Strumpack ; Strumpacksparse .
Comparing with ScaLAPACK, STRUMPACK requires more memory, since besides the original matrix it also stores the random vectors and the samples, and the generators of HSS matrix. The memory overhead increases as the HSS rank increases. Therefore, STRUMPACK is suitable for matrices with low offdiagonal ranks. For the our eigenvalue problems, we are fortunate that the HSS rank of intermediate eigenvector matrices appeared in the DC algorithm is not large, usually less than 100. Through the experiments in section 3, we let the compression threshold of constructing HSS be , to keep the orthogonality of computed eigenvectors. One advantage of STRUMPACK is that it requires much fewer operations than the classical algorithms by exploiting the offdiagonally low rank property.
Another property of STRUMPACAK, the same as other packages that explores lowrank structures (matrices Hlib and Block LowRank representations), is irregular computational patterns, dealing with irregular and imbalanced taskflows and manipulating a collection of small matrices instead of one large one. HSS algorithms requires a lower asymptotic complexity, but the flop rate with HSS is often lower than with traditional matrixmatrix multiplications, BLAS3 kernels. We expect HSS algorithms to have good performances for problems with large size. Therefore, we only use HSS algorithms when the problem size is large enough, just as in ShengguoSIMAX2 ; LSGNLAA .
The HSS construction algorithm in STRUMPACK uses randomized sampling algorithm combined with Interpolative Decomposition (ID) MartinssonPNAS07 ; ChengRandom , first proposed in Martinssonrandhss10 . For this randomized algorithm, the HSS rank needs to be estimated in advance, which is difficult to be estimated accurately. An adaptive sampling mechanism is proposed in STRUMPACK, and its basic idea Strumpack is ‘to start with a low number of random vectors , and whenever the rank found during Interpolative Decomposition is too large, is increased’. To control the communication cost, we use a little more sample vectors () and let inc_rand_HSS relatively large which is a parameter for constructing HSS matrix in STRUMPACK.
In this work, we mainly use two STRUMPACK driver routines: the parallel HSS construction and parallel HSS matrix multiplication routines. We use these two routines to replace the general matrixmatrix multiplication routine PDGEMM in hope of achieving good speedups for large matrices. We test the efficiency and scalability of STRUMPACK by using an HSS matrix firstly appeared in the test routine of STRUMPACK Strumpack .
Example 1. We use two Toeplitz matrices, which have been used in Strumpack . In this example we assume . The first one is defined as and for , which is diagonally dominant and yields very low HSS rank (about 2). The second one is defined as and , which is a kinetic energy matrix from quantum chemistry Strumpack , is a discretization parameter. This matrix has slightly larger HSS rank (about ).
PDGEMM  185.33  53.58  13.42  8.02  3.79  1.89  

Mat1  Const  6.18  2.39  1.21  1.24  1.29  2.89 
Multi  16.64  11.65  3.72  2.46  1.78  1.75  
Speedup  8.12  3.82  2.72  2.17  1.23  0.41  
Mat2  Const  4.57  1.80  0.96  1.20  1.23  2.68 
Multi  15.42  11.10  3.46  2.29  1.71  1.58  
Speedup  9.27  4.15  3.04  2.30  1.29  0.44 
From the results in Table 1, the HSS matrix multiplication implemented in STRUMPACK can be more than 1.2x times faster than PDGEMM when the used processes are around 256. The speedups are even better when using fewer processes. But, the HSS construction and matrix multiplication algorithms are not as scalable as PDGEMM, and they become slower than PDGEMM when using more processes, for example more than 676. Therefore, it suggests to use no more than 256 processes when combining STRUMPACK with ScaLAPACK. The execution time highly depends on many factors, such as parameters NB, block_HSS, etc. The results in Table 1 were obtained by choosing and . Note that we did not try to find the optimal parameters.
2.2 Combining DC with HSS
In this section we show more details of combining HSS matrix techniques with ScaLAPACK. As mentioned before, the central idea is to replace PDGEMM by the HSS matrix multiplication algorithms. The eigenvectors are updated in the ScaLAPACK routine PDLAED1, and therefore we modify it and call STRUMPACK routines in it instead of PDGEMM.
Note that after applying permutations to in (6), matrix should also be permuted accordingly. From the results in Gueigenvalue ; LSGNLAA ; ShengguoSIMAX2 , we know that is a Cauchylike matrix and offdiagonally lowrank, the numerical rank is usually around . When combining with HSS, we would not use Gu’s idea since permutation may destroy the offdiagonally lowrank structure of in (6). We need to modify the ScaLAPACK routine PDLAED2, and only when the size of deflated matrix in (5) is large enough, HSS techniques are used, otherwise use Gu’s idea. Denote the size of by , and it depends the architecture of particular parallel computers, and may be different for different computers.
Most work of DC is spent on the first two toplevels matrixmatrix multiplications. The first toplevel takes nearly 50% when using 256 processes or fewer, see the results in Table 2 and also the results in Elpa . Therefore, we could expect at most 2x speedup when replacing PDGEMM with parallel HSS matrix multiplications.
In this subsection, we fist use an example to show the offdiagonally lowrank property of in (4). Here we assume that the dimension of is , and the diagonal entries of satisfy , , and is a normalized random vector in (4
). Then the singular values of matrix
are illustrated in Figure 2(b) with . From it, we get that the singular values decay very quickly.Example 2. We use a symmetric tridiagonal matrix to show the percentage of time cost by the top level matrixmatrix multiplications. The diagonal entries of tridiagonal matrix are all two and its offdiagonal entries are all one. The size of this matrix is .
Top One  59.80  44.13  30.29  16.92  9.23  4.77  1.09  0.53 

Total  108.04  77.06  52.78  30.04  17.70  10.00  6.20  6.22 
Percent(%)  55.35  57.2  57.39  56.32  52.15  47.70  17.58  8.52 
The results in Table 2 are obtained by using optimization flags "O2 C qopenmp mavx", and linked with multithreaded Intel MKL. From the results in it, we can see that the topone level matrixmatrix multiplications can take half of the total time cost by PDSTEDC in some case. Since PDGEMM in MKL has very good scalability, the percentage of topone level matrix multiplications decreases as the number of processes increases. This example also implies that we are better not to use more than 256 processes in our numerical experiments.
3 Numerical results
All the results are obtained on Tianhe2 supercomputer LiaoTH2 ; LiaoHPCG , located in Guangzhou, China. It employs accelerator based architectures and each compute node is equipped with two Intel Xeon E52692 CPUs and three Intel Xeon Phi accelerators based on the manyintegratedcore (MIC) architectures. In our experiments we only use CPU cores.
Example 3. We use some ‘difficult’ matrices A880 for the DC algorithm, for which few or no eigenvalues are deflated. Examples include the Clementtype, Hermitetype and Toeplitztype matrices, which are defined as follows.
The Hermitetype matrix is given as A880 ,
(10) 
The Toeplitztype matrix is defined as A880 ,
(11) 
For the results of strong scaling, we let the dimension be =, and use HSS techniques only when the size of secular equation is larger than . The results for strong scaling are shown in Figure 3(a). The speedups of PHDC over ScaLAPACK are reported in Table 3, and shown in Figure 3(b). We can see that PHDC is about times faster than PDSTEDC in MKL when using 120 processes or fewer.
Matrix  Number of Processes  

Clement  2.45  1.91  1.58  1.42  1.14 
Hermite  1.92  1.43  1.30  1.22  1.00 
Toeplitz  2.30  1.92  1.58  1.43  1.26 
The orthogonality of the computed eigenvectors by PHDC are in the same order as those by ScaLAPACK, which are shown in Table 4. The orthogonality of matrix is defined as , where is the maximum absolute value of entries of .
Matrix  Number of Processes  

Clement           
Hermite           
Toeplitz           
Example 4. In TygertSHT2 , Tygert shows that the spherical harmonic transform (SHT) can be accelerated using the tridiagonal DC algorithm and one symmetric tridiagonal matrix is defined as follows,
(12) 
for , where
for Assume that the dimension of this matrix is and . The execution times of using PHDC and PDSTEDC are reported in Table 5.
Method  Number of Processes  

PDSTEDC  475.00  139.09  39.66  25.82  16.87 
PHDC  224.97  88.58  28.53  20.12  14.71 
Speedup  2.11  1.57  1.39  1.28  1.15 
3.1 Comparison with other implementations
Different from ScaLAPACK, the ELPA routines Elpa ; elpalibrary do not rely on BLACS, all communication between different processors is handled by direct calls to a MPI library, where ELPA stands for Eigenvalue soLver for Petascale Applications. For its communications, ELPA relies on two separate sets of MPI communicators, row communicators and column communicators, respectively (connecting either the processors that handle the same rows or the same columns). For the tridiagonal eigensolver, ELPA implements its own matrixmatrix multiplications, does not use PBLAS routine PDGEMM. It is known that ELPA has better scalability and is faster than MKL Elpa ; elpalibrary .
Example 5. We use the same matrices as in Example 3 to test ELPA, and compare it with the newly proposed algorithm PHDC. The running times of ELPA are shown in Table 6. Figure 4 shows the speedups of PHDC over ELPA. PHDC is faster than ELPA since it requires fewer floating point operations. However, its scalability is worse than ELPA, and PHDC becomes slower than ELPA when using more than 200 processes.
Matrix  Number of Processes  

Clement  487.63  137.43  40.36  25.61  12.49 
Hermite  363.94  117.61  34.58  21.18  10.89 
Toeplitz  509.32  141.57  39.55  25.62  12.67 
4 Conclusions
By combining ScaLAPACK with STRUMPACK, we propose a hybrid tridiagonal DC algorithm for the symmetric eigenvalue problems, which can be faster than the classical DC algorithm implemented in ScaLAPACK when using about 200 processes. The central idea is to relpace PDGEMM by the HSS matrix multiplication algorithms, since HSS matrix algorithms require fewer flops than PDGEMM. Numerical results show that the scalability of HSS matrix algorithms is not as good as PDGEMM. The proposed PHDC algorithm in this work becomes slower than the classical DC algorithm when using more processes.
Acknowledgement
The authors would like to acknowledge many helpful discussions with Xiangke Liao, Sherry Li and Shuliang Lin. This work is partially supported by National Natural Science Foundation of China (Nos. 11401580, 91530324, 91430218 and 61402495).
References
 (1) E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999.
 (2) T. Auckenthaler, V. Blum, H. J. Bungartz, T. Huckle, R. Johanni, L. Krämer, B. Lang, H. Lederer, and P. R. Willems. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Computing, 37(12):783–794, 2011.
 (3) S. Börm and L. Grasedyck. HLib– a library for  and matrices, 1999.
 (4) J. R. Bunch, C. P. Nielsen, and D. C. Sorensen. Rank one modification of the symmetric eigenproblem. Numer. Math., 31:31–48, 1978.
 (5) S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A. J. van der Veen, and D. White. Fast stable solvers for sequentially semiseparable linear systems of equations and least squares problems. Technical report, University of California, Berkeley, CA, 2003.
 (6) S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A. J. van der Veen, and D. White. Some fast algorithms for sequentially semiseparable representation. SIAM J. Matrix Aanal. Appl., 27:341–364, 2005.
 (7) S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, and A. J. van der Veen. Fast stable solvers for sequentially semiseparable linear systems of equations. Technical report, UC, Berkeley, CA, 2003.
 (8) S. Chandrasekaran, M. Gu, and T. Pals. Fast and stable algorithms for hierarchically semiseparable representations. Technical report, University of California, Berkeley, CA, 2004.
 (9) S. Chandrasekaran, M. Gu, and T. Pals. A fast ULV decomposition solver for hierarchical semiseparable representations. SIAM J. Matrix Anal. Appl., 28:603–622, 2006.
 (10) S. Chandrasekaran, M. Gu, J. Xia, and J. Zhu. A fast QR algorithm for companion matrices. In I. A. Ball and .et al., editors, Recent Advances in Matrix and Operator Theory, pages 111–143, Birkh’́auser, Basel, 2008.
 (11) H. Cheng, Z. Gimbutas, P. Martinsson, and V. Rokhlin. On the compression of low rank matrices. SIAM J. Sci. Comput., 26(4):1389–1404, 2005.
 (12) J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. Scalapack: A portable linear algebra library for distributed memory computersdesign issues and performance. Computer Physics Communications, 97:1–15, 1996.
 (13) J. J. M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numer. Math., 36:177–195, 1981.
 (14) Y. Eidelman and I. Gohberg. On a new class of structured matrices. Integral Equations and Operator Theory, 34:293–324, 1999.
 (15) K. Gates and P. Arbenz. Parallel divide and conquer algorithms for the symmetric tridiagonal eigenproblem. Technical report, Institute for Scientific Comouting, ETH Zurich, Zurich, Switzerland, 1994.
 (16) P. Ghysels, X. Li, and F. Rouet S. Williams. An efficient multicore implementation of a novel HSSstructured multifrontal solver using randomized sampling, 2014. Submitted to SIAM J. Sci. Comput.
 (17) M. Gu. Studies in Numerical Linear Algebra. PhD thesis, Yale University, New Haven, CT, 1993.
 (18) M. Gu and S. C. Eisenstat. A divideandconquer algorithm for the symmetric tridiagonal eigenproblem. SIAM J. Matrix Anal. Appl., 16:172–191, 1995.
 (19) M. Gu, X. S. Li, and P. S. Vassilevski. Directionpreserving and schurmonotonic semiseparable approximations of symmetric positive definite matrices. SIAM J. Matrix Anal. Appl., 31:2650–2664, 2010.
 (20) W. Hackbusch. A sparse matrix arithmetic based on matrices. Part I: Introduction to matrices. Computing, 62:89–108, 1999.
 (21) W. Hackbusch and S. Börm. Datasparse approximation by adaptive matrices. Computing, 69:1–35, 2002.
 (22) W. Hackbusch, B. Khoromskij, and S. Sauter. On matrices. In Zenger C Bungartz H, Hoppe RHW, editor, Lecture on Applied Mathematics, pages 9–29, Berlin, 2000. Springer.
 (23) W. Hackbusch and B.N. Khoromskij. A sparse matrix arithmetic based on matrices. Part II: Application to multidimensional problems. Computing, 64:21–47, 2000.
 (24) N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53:217–288, 2011.
 (25) I. Ipsen and E. Jessup. Solving the symmetric tridiagonal eigenvalue problem on the hypercube. SIAM J. Sci. Statist. Comput., 11:203–229, 1990.
 (26) S. Li, M. Gu, L. Cheng, X. Chi, and M. Sun. An accelerated divideandconquer algorithm for the bidiagonal SVD problem. SIAM J. Matrix Anal. Appl., 35(3):1038–1057, 2014.
 (27) S. Li, X. Liao, J. Liu, and H. Jiang. New fast divideandconquer algorithm for the symmetric tridiagonal eigenvalue problem. Numer. Linear Algebra Appl., 23:656–673, 2016.
 (28) X. Liao, S. Li, L. Cheng, and M. Gu. An improved divideandconquer algorithm for the banded matrices with narrow bandwidths. Comput. Math. Appl., 71:1933–1943, 2016.
 (29) X. Liao, L. Xiao, C. Yang, and Y. Lu. Milkyway2 supercomputer: System and application. Frontiers of Computer Science, 8(3):345–356, 2014.
 (30) E. Liberty, F. Woolfe, P. G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the lowrank approximation of matrices. PNAS, 104(51):20167–20172, 2007.
 (31) Y. Liu, C. Yang, F. Liu, X. Zhang, Y. Lu, Y. Du, C. Yang, M. Xie, and X. Liao. 623 Tflop/s HPCG run on tianhe2: Leveraging millions of hybrid cores. International Journal of High Performace Computing Applications, 30(1):39–54, 2016.
 (32) W. Lyons. Fast algorithms with applications to PDEs. PhD thesis, University of California, Santa Barbara, 2005.
 (33) A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H. Bungartz, and H. Lederer. The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. J. Phys.: Condens. Matter, 26:1–15, 2014.
 (34) O. A. Marques, C. Voemel, J. W. Demmel, and B. N. Parlett. Algorithm 880: A testing infrastructure for symmetric tridiagonal eigensolvers. ACM Trans. Math. Softw., 35:8:1–13, 2008.
 (35) P. G. Martinsson. A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Anal. Appl., 32:1251–1274, 2011.
 (36) F. Rouet, X. Li, P. Ghysels, and A. Napov. A distributedmemory package for dense hierarchically semiseparable matrix computations using randomization. ACM Transactions on Mathematical Software, 42(4):27:1–27:35, 2016.
 (37) D. C. Sorensen and P. T. P. Tang. On the orthogonality of eigenvectors computed by divideandconquer techniques. SIAM J. Numer. Anal., 28:1752–1775, 1991.
 (38) F. Tisseur and J. Dongarra. A parallel divide and conquer algorithm for the symmetric eigenvalue problem on distributed memory architectures. SIAM J. Sci. Comput., 20(6):2223–2236, 1999.
 (39) M. Tygert. Fast algorithms for spherical harmonic expansions, ii. Journal of Computational Physics, 227:4260–4279, 2008.
 (40) R. Vandebril, M. Van Barel, and N. Mastronardi. Matrix Computations and Semiseparable Matrices, Volume I: Linear Systems. Johns Hopkins University Press, 2008.
 (41) J. Xia, S. Chandrasekaran, M. Gu, and X.S. Li. Fast algorithm for hierarchically semiseparable matrices. Numer. Linear Algebra Appl., 17:953–976, 2010.
 (42) J. Xia and M. Gu. Robust approximate Choleksy factorization of rankstructured symmetric positive definite matrices. SIAM J. Matrix Anal. Appl., 31:2899–2920, 2010.
 (43) J. Xia, Y. Xi, and M. Gu. A superfast structured solver for Toeplitz linear systems via randomized sampling. SIAM J. Matrix Anal. Appl., 33:837–858, 2012.
Comments
There are no comments yet.