1. Introduction
Largescale Hermitian eigenvalue problems arise in many scientific computing applications (e.g. condensed matter (Kohn, 1999), thermoacoustics (Salas et al., 2015)). In the case of electronic structure calculations based on KohnSham density functional theory, a large nonlinear eigenvalue problem is iteratively solved through the selfconsistent field (SCF) procedure, in which a sequence of related linear eigenvalue problems needs to be solved. Depending on the discretization method and basis used, the matrices can be sparse, dense or banded. In this work, we focus on the dense case. Dense eigenvalue problems are usually solved by using direct methods, and famous packages include ScaLAPACK (Blackford et al., 1997), ELPA (Marek et al., 2014), EigenExa (Imamura et al., 2011), etc. When solving sequences of related eigenvalue problems, these eigensolvers cannot exploit any knowledge of the properties of the problem, and thus they solve each problem of the sequence in complete isolation.
Compared to direct methods, one advantage of iterative methods is that the approximate eigenvectors from previous SCF iterations can be used as the initial guess to improve the convergence rate of subspace iterations (SI) for the current SCF iteration. Based on SI accelerated by Chebyshev polynomials, an efficient eigensolver (ChASE) has been proposed for sequences of dense Hermitian eigenvalue problems (Winkelmann et al., 2019). Compared to Elemental’s direct eigensolver (Poulson et al., 2013), ChASE is faster when computing a small portion of the extremal spectrum. ChASE becomes slower when computing a large number of eigenpairs (Winkelmann et al., 2019). It is also a common problem for other iterative methods such as the implicitly restarted Lanczos algorithm (Lehoucq et al., 1998) and LOBPCG (Knyazev, 2001), since they have to solve a projected dense eigenvalue problem as a part of the RayleighRitz procedure.
When computing a large number of eigenpairs with iterative algorithms, the most common approach is the spectrum slicing technique (Bai et al., 2000; Saad, 2011), that tries to compute the eigenvalues by chunks. Often, spectrum slicing is based on the shiftinvert spectral transformation, which involves costly matrix factorizations. Some related works include (most of them are for sparse eigenproblems):

Polynomial filtering based method. It uses matrix polynomial filters to amplify the spectral components associated with the interested eigenvalues (Banerjee et al., 2016; Li et al., 2016), avoiding matrix factorizations, and packages include EVSL (Eigenvalue Slicing Library) (Li et al., 2019) and ChASE (Winkelmann et al., 2019).

The shiftinvert subspace iteration method. A new method (SISLICE) is recently proposed in (WilliamsYoung et al., 2020) which combines shiftinvert subspace iteration with spectrum slicing.
Iterative methods are rarely involved to solve dense Hermitian eigenvalue problems since both dense matrix vector multiplication and solving a dense linear system are considerably expensive. In this paper, iterative methods are combined with direct methods in a novel way. The central idea is that iterative methods are applied on a banded matrix with small bandwidth (usually less than
) instead of the original matrix. Then, iterative methods can be used to compute partial or full eigendecomposition of a dense Hermitian or symmetric matrix. It is a new framework for solving dense Hermitian eigenvalue problem, and will be named PDESHEP in this paper, short for a Parallel Direct Eigensolver for Sequences of Hermitian Eigenvalue Problems with no tridiagonalization.The classical direct methods consist of three steps, see Fig. 1 for the main procedure. First, a dense matrix is reduced to its tridiagonal form via orthogonal transformations. One efficient reduction algorithm for tridiagonalization is the twostep approach (Bischof et al., 1994), which first reduces a dense matrix to band form and then further reduces it to a tridiagonal matrix. The symmetric band reduction (SBR) from band to tridiagonal fully consists of memorybound operations, and its scalability is limited. Then, the eigendecomposition of tridiagonal matrix is computed by classical algorithms and finally the eigenvectors are computed via backtransform. The details are introduced in the next section. The starting point of this work is trying to get around SBR and propose a more efficient algorithm to solve a symmetric banded eigenvalue problem. From the results of Example 1 in section 4, we see that step (II) takes about half of total time when using many processes. It is better to replace step (II) by a more efficient and faster solver.
Many eigenvalue algorithms have been proposed for symmetric banded matrices. The most commonly used one is to reduce the banded matrix to tridiagonal form and then apply the DC, QR or MRRR (Dhillon et al., 2006) algorithms to it, just as done in ELPA (Auckenthaler et al., 2011). Some works suggest to use the banded DC algorithm (Arbenz, 1992; Gansterer et al., 2002; Liao et al., 2016) when the bandwidth is small, which requires more floating point operations. Just as shown in (Li et al., 2021b), the banded DC algorithm is slower than ELPA when the bandwidth is large. Another completely different approach is to use iterative methods. In this work, we apply the spectrum slicing methods to the symmetric banded matrix instead of the original full (dense) matrix. Since the semibandwidth of the intermediate banded matrix is very small, , it is very cheap to solve a banded linear system. Our method is similar to (WilliamsYoung et al., 2020; Zhang et al., 2007), but the matrix in our problem is banded, which makes the linear systems easy to solve. Similarly to previous methods (SIESTASIPs, SISLICE, ChASE), our algorithm is designed for SCF eigenvalue problems, and prior knowledge of eigenvalues and eigenvectors are used for spectrum partition and as initial vectors, respectively. The main advantage of PDESHEP is that it is suitable for many eigenpairs of dense matrices, not only for sparse matrices.
For spectrum slicing algorithms, the spectrum partition is very important for efficiency and robustness. It relates to load balance and the orthogonality of computed eigenvectors. The ideal case is that every slice has nearly the same number of eigenvalues and the gaps between eigenvalues in different slices are large, as shown in Fig. 2
. Since we are dealing with selfconsistent field (SCF) eigenvalue problems, we assume the prior knowledge of eigenvalue distributions is known. Otherwise, the spectral density estimation (also known as density of states or DOS)
(Lin et al., 2016) and Sylvester’s inertia theorem (Sylvester, 1852) can be used to estimate or compute the distribution of eigenvalues of a banded matrix. In this work, we use a simplified kmeans algorithm to partition the spectrum, that exploits the one dimensional structure of eigenvalues which are stored increasingly. Therefore, the simplified version requires much fewer operations than the general kmeans algorithms. We first partition eigenvalues equally into groups and use their centers as the initial conditions of our simplified kmeans algorithm, where is the number of partitioned slices. We find that this initialization method for kmeans algorithms usually leads to a more ‘balanced’ partition, i.e, the variance of the number of eigenvalues in all slices is small. The simplified
kmeans algorithm is introduced in section 3.1.To combine direct methods with iterative methods, we need some efficient data redistribution routines since the direct methods such as in ScaLAPACK and ELPA store matrices in 2D block cyclic data distribution (BCDD) while the iterative methods such as in FEAST store matrices in 1D block data distribution (BDD). We need two efficient data redistribution algorithms. One is used to redistribute a banded matrix in BCDD to 1D compact form as in LAPACK. The other one is used to redistribute a dense matrix stored in irregular 1D BDD to 2D BCDD with given block size . By irregular we mean different processes may have different number of block rows or columns. None of BLACS routines works for the irregular case. We used the data redistribution algorithm proposed in (Li et al., 2021a). For the case that BLACS fits, the new data redistribution algorithm can be more than faster than the BLACS routine PXGEMR2D.
PDESHEP can be seen as a new framework of direct methods, and it can be combined with any spectrum slicing methods such as FEAST, SIPs or EVSL. In our current version, it is implemented by combining ELPA (Marek et al., 2014) with FEAST (Kestyn et al., 2016). Later, we will try to include SLEPc (Campos and Roman, 2012), EVSL (Li et al., 2019) and others. In summary, the main contributions of this work are:

We propose a new framework for solving the (dense) Hermitian or symmetric eigenvalue problems, which combines direct and iterative methods, and is suitable for computing many eigenpairs.

We propose a simplified 1D kmeans algorithm for partitioning the spectrum, which only requires floating operations, faster than the general kmeans algorithm, where is the number of points and is the number of clusters.

We perform numerous experiments to test PDESHEP and use some dense Hermitian matrices from real applications and real large sparse matrices from the SuiteSparse collection (Davis and Hu, 2011).
The remainder of this paper is organized as follows. In Section 2, we briefly review the main steps of direct methods for solving dense symmetric or Hermitian eigenvalue problems and describe the iterative methods based on contour integration, especially the one implemented in FEAST. In section 3, we present our new algorithm (PDESHEP) for sequences of symmetric or Hermitian eigenvalue problems, and some related techniques such as spectrum partition, data redistribution and validation are included. We present numerical results in section 4 and future works in section 5.
2. Preliminaries
In this section, we give a brief introduction to the direct methods and contour integral based iterative methods for eigenvalue problems. As mentioned above, classical direct eigensolvers can not exploit any knowledge of the properties of related eigenvalue problems. Thus, the algorithms in this section are only for a single eigenproblem and how they are combined and used for solving sequences of related eigenproblems will be introduced in section 3.
2.1. Direct Methods for (Dense) Eigenvalue Problems
In this section, we introduce the main steps of direct methods for solving a standard eigenvalue problem,
(1) 
where is an Hermitian or real symmetric matrix, is the eigenvector matrix and the diagonal matrix contains eigenvalues , . Throughout this paper we will focus on the Hermitian case, but our algorithm is applicable to real symmetric matrices. Furthermore, the generalized eigenvalue problem , where is Hermitian and is Hermitian and positive definite, can be reduced to the standard case via Cholesky factorization,
(2) 
where , and
Only the standard eigenvalue problem (1) is considered in this paper. We mainly follow the description in (Marek et al., 2014) and introduce the main stages of a direct method, which consists of three steps. The input dense matrix is first reduced to tridiagonal form by using onestep or twostep reduction methods (Golub and Loan, 1996; Bischof et al., 1994). The onestep method transforms a dense matrix to tridiagonal form through sequences of Householder transformations directly. In contrast, the twostep method (Bischof et al., 1994) first reduces a dense matrix to band form and then further reduces it to tridiagonal form, which is shown to be more effective (Auckenthaler et al., 2011). We only introduce here the twostep reduction method.

Reduce to band form through a sequence of unitary transformations,
(3) where
is a unitary matrix. This step relies on efficient matrixmatrix operations
(Bischof et al., 1994), and can be computed efficiently on high performance computers. 
Compute the eigendecomposition of the banded matrix ,
(4) This is achieved by the following three steps.

Solve the tridiagonal eigenvalue problem:
(6) It can be solved by QR (Francis, 1962), divideandconquer (DC) (Cuppen, 1981; Gu and Eisenstat, 1995; Tisseur and Dongarra, 1999), Multiple Relatively Robust Representation (MRRR) (Dhillon, 1997; Willems and Lang, 2013; Petschow et al., 2013). ELPA implements an efficient DC algorithm (Auckenthaler et al., 2011).

Backtransform to obtain the eigenvectors of the banded matrix,
(7)

Backtransform to get the eigenvectors of matrix ,
(8)
The steps (I)(III) are the classical approach for solving dense Hermitian eigenvalue problems. Some other approaches have been proposed to compute the eigendecomposition of the intermediate banded matrix. When the bandwidth is small, it may be more efficient to use the banded DC (BDC) algorithm (Haidar et al., 2012; Liao et al., 2016; Li et al., 2021b), but it is very slow when the bandwidth is large (Li et al., 2021b). The package EigenExa (Imamura et al., 2011) also implements the BDC algorithm working on a pentadiagonal matrix (band matrix with semibandwidth ).
Figure 1 shows the main steps of direct methods. Step (II) is traditionally based on tridiagonalization via symmetric bandwidth reduction (SBR) (Bischof et al., 2000) and solved by DC or MRRR algorithms. The starting point of this work is that step (II) has poor scalability and it takes a large portion of total time. See Table 2 in section 4 for more details. In this work, we instead replace step (II) by a spectrum slicing method, and steps (I) and (III) are the same as traditional direct methods. The advantage is that we get around SBR, spectrum slicing methods are naturally parallelizable and hence PDESHEP is more efficient than classical direct methods when using many processes.
2.2. The FEAST Algorithm
The FEAST algorithm utilizes spectral projection and subspace iteration to obtain selected interior eigenpairs. Note that we still use the standard eigenvalue problem (1) to introduce FEAST, which works for generalized eigenvalue problems too. A RayleighRitz procedure is used to project matrix onto a reduced search subspace to form matrix
(9) 
Approximate eigenvalues and eigenvectors of the original matrix can be obtained from the much smaller eigenvalue problem,
(10) 
as
(11) 
In theory, the ideal filter for the Hermitian problem would act as a projection operator onto the subspace spanned by the eigenvector basis, which can be expressed via the Cauchy integral formula
(12) 
where the eigenvalues associated with the orthonormal eigenvectors are located within a interval delimited by the closed curve . In practice, the spectral projector is approximated by a quadrature rule using integration nodes and weights ,
(13) 
where the search subspace is of size , and is a rational function of order . The computation of amounts to solving a set of independent complex shifted linear systems with multiple righthand sides,
(14) 
and The matrix is then used as the RayleighRitz projector to form the reduced matrix . The general outline of FEAST is shown in Algorithm 1 for computing eigenpairs in a given interval. The initial vectors are usually a set of random vectors, and good initial vectors can accelerate the convergence (Peter Tang and Polizzi, 2014). It is mathematically equivalent to subspace iteration applied with a rational matrix function , see (Peter Tang and Polizzi, 2014). The rational function can be constructed from numerical approximation of the integral (12) by using Gauss quadrature (Polizzi, 2009) or Trapezoid rule (Sakurai and Sugiura, 2003). An improved rational approximation based on the work of Zolotarev is proposed in (Guttel et al., 2015), which improves both convergence robustness and load balancing when FEAST runs on multiple search intervals in parallel.
There are three levels of parallelism inherent to the FEAST algorithm, which are denoted as L1, L2 and L3. Figure 3 shows the threelevel parallelism of FEAST. The first level L1 parallelism divides the search interval into several slices and the eigenvalues in each slice are computed by a subgroup of processes. The second level L2 parallelism means that each linear system in equation (14) can be solved independently by different groups of processes. The third level L3 parallelism means that each linear system in equation (14) can be solved in parallel by using multithreads or multiple processes. The total number of MPI processes that can be used equals the product of number of slices in L1, number of shifted linear systems solved simultaneously in L2 and number of MPI used to solve each linear system in L3, i.e.,
(15) 
where means the number of entries in set .
The parallelism of L2 is limited by the number of integration nodes. FEAST uses eight quadrature points per interval by default, and each quadrature point corresponds to a shifted linear system. However, the numerical results in (Aktulga et al., 2014) show that using quadrature points per interval usually gives the best overall performance from all the tested cases (8, 10, 12, 16, 20 and 24 quadrature points). Therefore, we limit the parallelism of L2 to in our implementation. Different MPI process groups of L2 solve different shifted linear system simultaneously. For example, we can use 16 process groups to solve 16 shifted linear systems simultaneously if .
The L3 level of parallelism is not used in our experiments, since FEAST4.0^{1}^{1}1We downloaded from http://www.ecs.umass.edu/~polizzi/feast/download.htm. does not fully support L3 parallelism for banded eigenvalue problems. Each linear system in L3 is solved by using one MPI process in our numerical experiments. Because we currently cannot use L3 parallelism and the number of MPI processes used in L2 is limited to , the spectrum must be partitioned into many slices when using many processes. The spectrum is partitioned into up to slices in our experiments when using processes, see Example 3 in section 4. Future work includes adding support for a distributedmemory parallel solver for banded linear systems.
3. Proposed algorithm
In this section, we introduce our new algorithm PDESHEP, which can be seen as a hybrid algorithm, that combines the techniques of direct and iterative methods.
To reduce computational complexity, the direct methods first reduce a dense matrix
to a canonical form such as tridiagonal, bidiagonal or Hessenberg, and then apply QR, DC or MRRR to compute the eigenvalue or singular value decomposition.
PDESHEP also follows this approach. It first reduces a dense Hermitian matrix to its banded form via a sequence of unitary transformations. In contrast to classical direct algorithms, PDESHEP uses spectrum slicing methods to compute the eigendecomposition of the banded matrix. After obtaining the eigenpairs of the banded matrix, the eigenvectors of are computed via backtransforms. The advantage is that it gets around the SBR process, which consists of memorybound operations and has poor scalability just as shown in Example 1 in section 4.The procedure of PDESHEP is shown in Algorithm 2 and see also Figure 4. The structure of PDESHEP looks like a sandwich: the two outer layers remain unchanged as in the classical direct methods, but the inner kernel is replaced by an iterative method. PDESHEP is more like a framework, and we do not specify which spectrum slicing algorithm to use. Step 2 of PDESHEP can use any of the four types of spectrum slicing algorithms mentioned in section 1. In Algorithm 2, the input initial vectors are constructed from the eigenvectors of the banded matrix reduced from previous (dense) Hermitian matrix , which may be interlaced with some random vectors when more initial vectors are required. Note that are the eigenvectors of the intermediate banded matrix instead of the original dense Hermitian matrix. The classical dense numerical linear algebra packages store matrices in 2D BCDD format such as ScaLAPACK and ELPA. The standard iterative methods usually store matrices in 1D BDD format such as FEAST (Kestyn et al., 2016) and SLEPc (Hernandez et al., 2005). We need one more data redistribution if using as the initial eigenvectors.

Apply block Householder transformations to and reduce it to a banded matrix,

Redistribute the banded matrix to each process and store it locally in compact form.

Apply a spectrum slicing algorithm to matrix by using as the initial vectors, and compute eigenpairs in parallel,

Redistribute the computed eigenvectors to 2D BCDD form .

Compute the eigenvectors of matrix via backtransform,
In this work, the banded eigenvalue problem is solved by the contour integral based method implemented in FEAST (Kestyn et al., 2016). We currently focus on the Hermitian standard eigenvalue problem. For the generalized eigenvalue problem, we assume it has been reduced to the standard form. FEAST can be seen as a subspace iteration eigensolver accelerated by approximate spectral projection (Peter Tang and Polizzi, 2014), and good starting vectors can speed up the convergence rate. Since the input matrices are correlated in our problem, we use the eigenvectors of the previous banded intermediate matrix as the starting vectors for the current banded matrix . The intuition is that as the SCF converges, matrices change very little, and so do the intermediate banded matrices after SCF iterations, where is an integer number. Numerical results verify this fact and show that using intermediate eigenvectors computed in the previous SCF iteration as starting vectors can significantly improve the convergence, which is usually more than times faster, and the results are shown in Table 3 in section 4.
The dense packages such as ScaLAPACK and ELPA store the matrices in 2D BCDD form. In contrast, iterative methods, which are usually developed for sparse matrices, use 1D block data distribution to store matrices. Since the semibandwidth of the intermediate banded matrix is small, usually , we store the full locally on each process. Therefore, we first need a data redistribution to obtain by collecting data from all processes, which is originally stored in 2D BCDD. The process is shown in Fig. (4(a)), and then is stored compactly as in LAPACK. Later, we need to redistribute the computed eigenvectors of the banded matrix from 1D BDD to 2D BCDD after applying the spectrum slicing algorithm. This process is shown in Fig. (4(b)), and the details of data redistribution are described in (Li et al., 2021a).
3.1. Spectrum Partition
In spectrum slicing algorithms, the spectrum is partitioned into several slices, and the eigenvalues in each slice are computed by a group of processes. It corresponds to the first level L1 parallelism. One classical way of spectrum partition is based on Sylvester’s law of inertia, which computes the factorization of (Golub and Loan, 1996). A more modern approach is to estimate the density of states (DOS) (Lin et al., 2016; Xi et al., 2018) of , which relies on matrixvector multiplications. The DOS has been used to partition the spectrum for the first SCF iteration in (WilliamsYoung et al., 2020) and EVSL (Li et al., 2019). In our problem, we could call ELPA or ScaLAPACK for the first several SCF iterations without estimating the DOS. Since the intermediate band matrix has narrow semibandwidth, , the cost of factorizations is small and the spectrum can also be partitioned by using Sylvester’s law of inertia. To reduce storage cost, we store the banded matrix in LAPACK compact form.
After the first several SCF iterations, we have prior knowledge of eigenvalues, and apply a kmeans clustering method to the previous eigenvalues and partition the spectrum, just as done in (Keçeli et al., 2018; WilliamsYoung et al., 2020). Since the eigenvalues are stored as a one dimensional increasing array, the data structure is very simple, we propose a simplified kmeans algorithm in the following subsection, which turns out to be good for the load balance problem. In this work, the aim of using kmeans is to find a good partition of the spectrum, i.e., separate the clustered eigenvalues.
3.1.1. Kmeans Clustering
When we have a prior knowledge of the distribution of eigenvalues, we can use one clustering algorithm to divide the spectrum into different slices based on the previous computed eigenvalues. One famous clustering algorithm is the kmeans algorithm (Lloyd, 1982). The goal is to group any cluster of eigenvalues into a separate slice and let the boundaries of slices not cross any cluster of eigenvalues. Figure 7 shows the boundaries of slices when partitioning eigenvalues into eight groups, where the red vertical lines mark the boundaries of slices. From it we can see that the kmeans algorithm can capture the clusters of eigenvalues and separate them well. One potential problem of kmeans partitioning is that different slices may be quite imbalanced, i.e., some slices may have many eigenvalues while others may have very few eigenvalues, just as shown in Figure 6, especially when partitioning into many slices. Figure 6 shows the result of partitioning the first eigenvalues of matrix H2O from the SuiteSparse collection (Davis and Hu, 2011) into slices, where the yaxis represents the number of eigenvalues in each slice. One slice contains only two eigenvalues while another one has 40 eigenvalues. Finding the optimal clustering of eigenvalues is a difficult task. In this work, we focus more on the goodness of separation instead of the load imbalance problem. The results of Example 2 in section 4 show that the convergence rate of spectrum slicing algorithms is more sensitive to the relative gap between eigenvalues in different slices than the number of eigenvectors to be computed.
A good clustering should satisfy the following objectives, see also (Keçeli et al., 2018):

The clustered eigenvalues (separated by less than ) should be kept in the same slice;

The number of eigenvalues in different slices should be nearly equal;

The eigenvalue range with each slice should be minimized.
Objective is important for the orthogonality of computed eigenvectors, and therefore should be considered first. Objectives and concern the load imbalance problem, which is considered secondary in this paper.
The classical kmeans algorithm is designed for multidimensional and complicated data sets, and does not exploit the simplicity in the case of a one dimensional point set. Our problem is very simple. There is a one dimensional set of points , which are ordered increasingly, and we want to partition into groups in the hope that each group has nearly equal number of points. We propose a simplified and fast one dimensional kmeans algorithm, which has the following advantages comparing with traditional kmeans algorithm.

This method has less computational complexity, which only requires floating pointing operations (flops), and is the number of points and is the number of groups. Note that classical kmeans algorithm requires flops.

This method usually gets a partition with smaller variance of numbers of points in all slices.
Our algorithm solves the same problem as the classical kmeans method,
(16) 
where are the centroid of each group The whole process is shown in Algorithm 3. In our implementation, we use Euclidean norm and the centroid of group is computed as , where and are the leftmost and rightmost points in , respectively. Algorithm 3 is allowed to adaptively adjust the parameter during iterations. Therefore, the return number of slices may be less than the input value . When it happens, we will randomly generate centroids and run the classical kmeans algorithm to generate partitions when the number of slices is given. During all our experiments, Algorithm 3 always returns the same number of slices as given by .
The spectrum is partitioned by recording the number of each slice, and the boundary set is updated in the next SCF iteration, , where is the largest value in and is the smallest value in . Since Algorithm 3 only costs linear time, we perform kmeans clustering at every SCF iteration. But, we find it is not necessary. For the last few iteration steps of SCF before convergence, the spectrum partitions computed by Algorithm 3 are the same. When the spectrum is repartitioned, the number of eigenvalues in each slice may become different. It happens in the first few SCF iterations. When is larger than before, the previous initial vectors can be interlaced with some random vectors. When is smaller than before, some previous initial vectors can be removed.
We compare Algorithm 3 with two other partition approaches in section 4. One is dividing the eigenvalues equally into groups and every group has nearly the same number of eigenvalues, , named by Approach1, and the boundaries of slices are computed as Algorithm 3, , where is the largest value in and is the smallest value in . Approach1 has perfect load balance since every slice has nearly the same number of eigenvalues when used in PDESHEP. However, the boundary may be inside a cluster of eigenvalues, i.e., there are many eigenvalues near the boundary , and it may introduce orthogonality problem, and slow down the convergence. The other one, Approach2, is dividing the spectrum interval into pieces equally, which is rough. Approach2 may have both load imbalance and orthogonality problems, and is not suggested to be used in practice.
3.2. Validation and Other Details
The contour integral based algorithm can compute all the wanted eigenvalues enclosed in a given interval. To guarantee that eigenpairs can be computed, the lower bound and upper bound must enclose all the wanted eigenvalues^{2}^{2}2For simplicity, we are assuming that the smallest nev eigenvalues are required. The cases for the largest or interior nev eigenvalues are similar. The users should provide how many (nev) eigenvalues are required.. First, the upper bound should be larger than the th eigenvalue of the current matrix. Since we know the previous eigenvalues, can be computed by , where is a small real number and is the th previous eigenvalue. Note that should not be too large. Otherwise, will enclose too many eigenvalues. can be checked by computing the inertia of . If is not large enough, we iteratively enlarge it by ratio , , and check it again until there are more than eigenvalues less than . Algorithm 4 illustrates process of computing the upper boundary . Since LAPACK does not support factorization for banded symmetric matrices, we implemented a simplified version of the retraction algorithm (Kaufman, 2007), which is used to only compute the inertia without recording the matrix , and the storage and computational cost are further reduced, becoming very small.
Another approach is to enlarge as , where is an integer and , the distance between the th and th previous eigenvalues. When is not large enough, it will be enlarged by another . Its computation is similar to the first approach. Since we are solving sequences of eigenvalue problems, the parameters and can be updated during the SCF iterations. In our implementation, we let and . For the first fewer SCF iterations, Algorithm 4 needs to iterate at most – times to obtain the upper boundary point. Since is banded with small bandwidth, the computation of these inertia is very cheap.
The lower bound can be computed similarly, , and make sure is a lower bound. When computing the smallest few eigenvalues, the parameter is relatively easy to choose since any lower bound works. Similarly, can also be updated by , where is integer and , negative. When computing interior eigenvalues, should be chosen carefully in case is too small to enclose too many eigenvalues in . Since the process is similar to Algorithm 4, it is not included.
The spectrum is divided into slices and one process group works on one slice. Note that only the first process group needs to compute the lower boundary point , and only the th process group needs to compute the upper boundary point . The th process group computes the eigenvalues in the interval , and every process can compute the exact number of eigenvalues in the current interval by computing the inertia of matrices and . Therefore, every process group can check whether the computed eigenvalues are correct or not, and if some eigenvalues are missed, the corresponding process group should rerun FEAST by using more starting vectors.
In our problem, the value is the exact number of eigenvalues in the th slice, which is used to check the correctness of computed eigenpairs. For stability, we further enlarge the number of starting vectors in FEAST, ^{3}^{3}3The number is chosen arbitrarily in order to deal with the case that is too small and equals .. In our experiments, FEAST always computes more than eigenvalues and no eigenvalues are missed.
4. Numerical results
All the experimental results are obtained on the Tianhe2 supercomputer (Liao et al., 2014; Liu et al., 2016), located at Guangzhou, China. Each compute node is equipped with two core Intel Xeon E52692 v2 CPUs. The details of the test platform and environment of the compute nodes are shown in Table 1. For all these numerical experiments, we only used plain MPI, running MPI processes per node in principle, that is, one process per core. Our codes are written in Fortran 90 and part of our codes will be released on Github^{4}^{4}4https://github.com/shengguolsg/.
Items  Values 

2*CPU  Intel Xeon CPU E52692 v2@2.2GHz 
Memory size  64GB (DDR3) 
Operating System  Linux 3.10.0 
Compiler  Intel ifort 
Optimization  O3 mavx 
Example 1. For the experiments in this example, we use an eigenvalue problem obtained from a DFT simulation using the FLEUR code (Blügel et al., 2017) that produces a matrix labeled NaCl, which has been used in (Winkelmann et al., 2019). This matrix is Hermitian and its size is . We use ELPA to compute partial and full eigendecomposition of this matrix, and the point is to show that step (II) has poor scalability. The codes of ELPA that we used are obtained from package ELSI v2.5.0 (Yu et al., 2020). The performance of ELPA highly depends on the parameter , the block size used for storing matrices in 2D BCDD format. Many works suggest that are usually good choices, and smaller such as less than will degrade the performance, see (Auckenthaler et al., 2011) and (Li et al., 2021b) for more details. There are no big differences for ELPA when choosing or on Tianhe2 supercomputer. All the performance results of ELPA are obtained by choosing in this paper.
The results for computing and eigenpairs are shown in Table 2 when using and processes. We use matrix NaCl to construct a standard eigenvalue problem, solve it by ELPA, and measure the times of step (I)(III). From the results, we can see that

the timings of step (II) nearly stop decreasing after ;

step (II) takes about half of the total time, especially when is large.
Therefore, it is better to replace step (II) by a more efficient and faster solver. In this paper, we use a spectrum slicing approach for this step. The next example shows that this approach can reduce the times of step (II) by half, and our final algorithm (PDESHEP) can be more than times faster than ELPA.
Step (I)  Step (II)  Step (III)  Step (I)  Step (II)  Step (III)  Step (I)  Step (II)  Step (III)  

64  
256  
1024  
4096 
Example 2. In this example, we use a sequence of eigenvalue problems. It is the NaCl sequence of size , used in the previous example and in (Winkelmann et al., 2019). There are such matrices. We first use PDESHEP to compute the partial eigendecomposition of the th matrix by using the eigenpairs of the th and the purpose is to show that PDESHEP works for highly related eigenvalue problems. The first eigenvalues of the th matrix are shown in Figure 7. Since the SCF loop is nearly converged, eigenvalues of the th and th matrices are quite similar and the starting vectors obtained from the th matrix are also very good approximations.
The ChASE algorithm (Winkelmann et al., 2019) computes smallest eigenpairs of NaCl by using Chebyshev accelerated subspace iterations. When computing many eigenpairs, ChASE would become much slower since it does not use spectrum slicing strategy. The advantage of our algorithm is that it can compute many (even all) eigenpairs efficiently. In this example, we compute and eigenpairs for this matrix.
Let the block size be , which is nearly optimal on our test platform. We first reduce the NaCl matrix to a banded Hermitian matrix with semibandwidth , and call FEAST to compute the eigenpairs in all slices simultaneously. Since step (I) and step (III) are the same as in ELPA, we mainly compare step (II) of ELPA and PDESHEP. The timing results of FEAST shown in Fig. 8 are obtained when using the eigenvectors of the banded matrix from the previous SCF iteration. If using random starting vectors, FEAST would require more time as illustrated in Table 3. From the results in Table 3, we can see that using the previous eigenvectors as starting vectors can be more than four times faster than using random starting vectors when using processes. The results show that using previous eigenvectors as initial vectors provide a significant benefit when the eigenvalue problems are highly correlated. PDESHEP is suggested to be used for the last few SCF iterations before convergence.
With  Without  With  Without  With  Without  

64  
256  
1024  
4096 
When comparing with step (II) of ELPA, we add the time of FEAST with the time spent in data redistributions. The speedups over Stage II of ELPA are shown in Fig. 7(a) when using different processes () to compute or eigenpairs. The speedup can be up to when using processes. The total speedups of PDESHEP over ELPA are shown in Fig. 7(b). PDESHEP can be more than 20% faster than ELPA when computing full or partial eigendecomposition and using processes.
During the first few SCF iterations, the matrices can be quite different and so do the eigenvalues and eigenvectors. We now apply PDESHEP on a matrix on the early stages of SCF iterations and compare PDESHEP with ELPA. We choose the th NaCl matrix and use the same parameter . We first compute the eigendecomposition of the th NaCl matrix and using its eigenvalues to partition the spectrum of the th matrix. Then, we use eigenvectors of the th matrix as initial vectors of FEAST. We see that the ‘good‘ estimated initial vectors do help the convergence of FEAST but they are not as significant as that obtained on the th matrix. The results are shown in Table 4, where the column With shows the times (in second) cost by step (II) of FEAST when the initial vectors are constructed from the eigenvectors of the th matrix; the column Without shows the times when using random vectors as initial vectors; the column ELPA shows the times cost by step (II) of ELPA. For this matrix, we use at most processes. From the results, we can see that PDESHEP can still be faster than ELPA for the th matrix.
With  Without  ELPA  With  Without  ELPA  With  Without  ELPA  

64  
256  
1024 
For the matrices appeared at the even earlier stages of SCF, for example the th matrix, the ’good’ estimated initial vectors and eigenvalues help very little. FEAST may cost more than step (II) of ELPA when computing the eigenpairs accurately with . However, at the early stages of SCF iterations, the eigenvalue problems only need to be solved approximately. Therefore, we let and find that step (II) of PDESHEP is also faster than that of ELPA, and the results are shown in Table 5 for the th matrix with .
With  Without  ELPA  With  Without  ELPA  With  Without  ELPA  

64  
256  
1024 
In this paper, the orthogonality is measured as
(17) 
where is the dimension of the matrix, is the computed eigenvector matrix and
is the identity matrix of size
. The orthogonality of computed eigenvectors is shown in Fig. 9, being about or less than in all cases, an acceptable value for most material simulations. Though the orthogonality of eigenvectors computed by PDESHEP is not as accurate as those obtained by direct methods, which is usually around the machine precision, it should be accurate enough for the SCF iterations, The computed residuals of eigenpairs are all less than since the residual tolerance of FEAST is chosen to be in this paper. In the previous work (Aktulga et al., 2014), the residual tolerance was , a little larger than that used in this paper.Since ChASE does not implement the spectrum slicing techniques, it is suggested to compute less than % eigenpairs of matrix (Aktulga et al., 2014). We use ChASE to compute the first eigenpairs of the th NaCl matrix, and the orthogonality of computed eigenvector by ChASE is around which is more accurate than those computed by PDESHEP, but ChASE costs much more time than PDESHEP. If we let , the orthogonality of computed eigenvectors by PDESHEP are also improved, around , when using , and MPI processes. The results are shown in Table 6. The convergence of FEAST depends on the parameter , and we find it may take FEAST a lot of iterations to converge if .
Method  

PDESHEP       
ChASE       
The results of this example show that PDESHEP can compute both partial and full eigendecomposition efficiently and the computed eigenvectors are orthogonal with acceptable accuracy when the slices are constructed carefully. From our experimental results, we claim that the kmeans based spectrum partition strategy (Algorithm 3) is helpful for the orthogonality of computed eigenvectors. We compare Algorithm 3 with two other partition approaches: Approach1 and Approach2, and Algorithm 3 performs the best.
Example 3. In this example, we use some relatively large matrices to test our algorithm. The first class is AuAg with dimension , which originates from simulations using FLAPW (Jansen and Freeman, 1984) and has been used in (Winkelmann et al., 2019). They have different spectral distribution compared to the NaCl dataset. There are such matrices, and we compute the eigendecomposition of the th matrix by using the information from the th matrix.
The other two matrices are from the SuiteSparse collection (Davis and Hu, 2011). One is Si5H12 with and , and the other is the H2O matrix with and , which are both real symmetric and sparse. Since there is only one such matrix for each type, we construct another two matrices by adding a small random symmetric perturbation to the nonzero entries of these matrices such that they have the same structure and similar eigenvalues. The nonzero entry is perturbed as
(18) 
where and is a random real number in the interval . Note that the contour integral based algorithms always operate with complex arithmetic. Therefore, FEAST may require more flops than other iterative methods for real matrices.
In this example, we compute the first smallest eigenpairs and use and processes, respectively. The results of ELPA are shown in Table 7. For these experiments, the convergence residual tolerance is also , and the dimension of the subspace is , where is the estimated number of eigenvalues in the current slice. We note that can affect the performance considerably. When is small, it makes the linear equations with fewer righthand sides much easier to solve. When is too small, it may make FEAST not converge. There is a tradeoff between speed and robustness.
AuAg  Si5H12  H2O  
Step (I)  Step (II)  Step (III)  Step (I)  Step (II)  Step (III)  Step (I)  Step (II)  Step (III)  
256  
1024  
4096 
The speedups of FEAST over step (II) of ELPA are shown in Fig. 9(a). The results in Fig 9(a) show that it is difficult to predict the speedup of spectrum slicing algorithms over Stage II of ELPA, that is because the partition of the spectrum may affect the performance of spectrum slicing algorithms. It is also very hard to know how many slices to partition is best, which depends on the distribution of eigenvalues and the used spectrum slicing algorithm. Figure 6 shows the number of eigenvalues of the slices for H2O when using processes. It is not well balanced, and most slices have eigenvalues between and , but one slice only has eigenvalues and one slice has eigenvalues. Comparing with load imbalance problem, we find that good separation of eigenvalues may be more important. If one boundary is inside a cluster of eigenvalues, it may introduce convergence and orthogonality problems. We use our simplified kmeans algorithm to partition the spectrum, and the load balance problem is also taken into account. The total speedups of PDESHEP over ELPA are shown in Fig. 9(b), which are always larger than and up to when using processes. The wall times of FEAST used to compute the second step, step (II), are shown in Table 8.
AuAg  Si5H12  H2O  

256  
1024  
4096 
5. Conclusions and future works
In this paper, we propose a new framework for computing partial or full eigendecomposition of dense Hermitian or real symmetric matrices, which can be easily extended to generalized eigenvalue problems. This new framework is denoted by PDESHEP, which combines the classical direct methods with spectrum slicing algorithms. In this work, we combine ELPA (Auckenthaler et al., 2011) with FEAST (Kestyn et al., 2016). PDESHEP can also be combined with other spectrum slicing algorithms such as those implemented in SLEPc (Campos and Roman, 2012) or EVSL (Li et al., 2019). This will be a topic for future work.
In PDESHEP, a dense matrix is first reduced to a banded form and then spectrum slicing algorithms are used to compute its partial eigendecomposition instead of further reducing it to tridiagonal form. Therefore, the symmetric bandwidth reduction (SBR) process is avoided, which consists of memorybounded operations. For selfconsistent field (SCF) eigenvalue problems which are correlated, PDESHEP can be about times faster than ELPA. Numerical results are obtained on Tianhe2 supercomputer and up to processes are used. The tested matrices include dense Hermitian matrices from real applications, and large real symmetric sparse matrices downloaded from the SuiteSparse collection (Davis and Hu, 2011). Another future work is to incorporate PDESHEP into ELSI (Yu et al., 2020) or some electronicstructure calculation packages such as QE (Giannozzi et al., 2009) to do some real simulations.
Acknowledgements.
The authors would like to acknowledge many helpful discussions with Chao Yang from LBNL in USA, and Chun Huang, Min Xie and Tao Tang from NUDT in China, and also want to thank Dr. Di Napoli from Jülich Supercomputing Centre for having provided some matrices for the eigenproblems used in the experiments. Most of this work was done while the first author was visiting Universitat Politècnica de València during 2020, and he wants to thank J. Roman for his hospitality. This work is in part supported by National Key RD Program of China (2017YFB0202104), National Natural Science Foundation of China (No. NNW2019ZT6B20, NNW2019ZT6B21, NNW2019ZT5A10), NSF of Hunan (No. 2019JJ40339), and NSF of NUDT (No. ZK180301). Xinzhe Wu was in part supported by European Union (EU) under PRACE6IPWP8 Performance Portable Linear Algebra project (grant agreement ID 823767). Jose E. Roman was supported by the Spanish Agencia Estatal de Investigación (AEI) under project SLEPcDA (PID2019107379RBI00).References
 (1)
 Aktulga et al. (2014) Hasan Metin Aktulga, Lin Lin, Christopher Haine, Esmond G Ng, and Chao Yang. 2014. Parallel eigenvalue calculation based on multiple shift–invert Lanczos and contour integral based spectral projection method. Parallel Comput. 40, 7 (2014), 195–212.
 Arbenz (1992) P. Arbenz. 1992. DivideandConquer algorithms for the bandsymmetric eigenvalue problem. Parallel Comput. 18 (1992), 1105–1128.
 Auckenthaler et al. (2011) T. Auckenthaler, V. Blum, H. J. Bungartz, T. Huckle, R. Johanni, L. Krämer, B. Lang, H. Lederer, and P. R. Willems. 2011. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Comput. 37, 12 (2011), 783–794.
 Bai et al. (2000) Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. 2000. Templates for the solution of algebraic eigenvalue problems A practical Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.
 Banerjee et al. (2016) Amartya S Banerjee, Lin Lin, Wei Hu, Chao Yang, and John E Pask. 2016. Chebyshev polynomial filtered subspace iteration in the discontinuous Galerkin method for largescale electronic structure calculations. The Journal of chemical physics 145, 15 (2016), 154101.
 Bischof et al. (1994) C. Bischof, B. Lang, and X. Sun. 1994. Parallel tridiagonalization through twostep band reduction. In Proceedings of the Scalable HighPerformance Computing Conference. IEEE Computer Society Press, Los Alamitos, CA, 23–27.
 Bischof et al. (2000) C. H. Bischof, B. Lang, and X. B. Sun. 2000. Algorithm 807: The SBR toolboxsoftware for successive band reduction. ACM Trans. Math. Softw. 26, 4 (2000), 602–616.
 Blackford et al. (1997) J. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. 1997. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.
 Blügel et al. (2017) Stefan Blügel, Gustav Bihlmayer, and Daniel Wortmann. 2017. The FLEUR Code. Website: http://www.flapw.de.
 Campos and Roman (2012) Carmen Campos and Jose E Roman. 2012. Strategies for spectrum slicing based on restarted Lanczos methods. Numerical Algorithms 60, 2 (2012), 279–295.
 Cuppen (1981) J. J. M. Cuppen. 1981. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numer. Math. 36 (1981), 177–195.
 Davis and Hu (2011) Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1–25.
 Dhillon (1997) I. S. Dhillon. 1997. A New Algorithm for the Symmetric Tridiagonal Eigenvalue/Eigenvector Problem. Ph.D. Dissertation. Computer Science Division, University of California, Berkeley, California.
 Dhillon et al. (2006) I. S. Dhillon, B. N. Parlett, and C. Vömel. 2006. The design and implementation of the MRRR algorithm. ACM Trans. Math. Softw. 32, 4 (2006), 533–560.
 Francis (1962) John GF Francis. 1962. The QR transformation–part 2. Comput. J. 4, 4 (1962), 332–345.
 Gansterer et al. (2002) W. N. Gansterer, R. C. Ward, and R. P. Muller. 2002. An extension of the divideandconquer method for a class of symmetric blocktridiagonal eigenproblems. ACM Trans. Math. Softw. 28, 1 (2002), 45–58.

Giannozzi
et al. (2009)
Paolo Giannozzi, Stefano
Baroni, Nicola Bonini, Matteo Calandra,
Roberto Car, Carlo Cavazzoni,
Davide Ceresoli, Guido L Chiarotti,
Matteo Cococcioni, Ismaila Dabo,
et al. 2009.
QUANTUM ESPRESSO: a modular and opensource software project for quantum simulations of materials.
Journal of physics: Condensed matter 21, 39 (2009), 395502.  Golub and Loan (1996) G. H. Golub and C. F. Van Loan. 1996. Matrix Computations (3rd ed.). The Johns Hopkins University Press, Baltimore, MD.
 Grimes et al. (1994) Roger G Grimes, John G Lewis, and Horst D Simon. 1994. A shifted block Lanczos algorithm for solving sparse symmetric generalized eigenproblems. SIAM J. Matrix Anal. Appl. 15, 1 (1994), 228–272.
 Gu and Eisenstat (1995) M. Gu and S. C. Eisenstat. 1995. A divideandconquer algorithm for the symmetric tridiagonal eigenproblem. SIAM J. Matrix Anal. Appl. 16 (1995), 172–191.
 Guttel et al. (2015) Stefan Guttel, Eric Polizzi, Ping Tak Peter Tang, and Gautier Viaud. 2015. Zolotarev quadrature rules and load balancing for the FEAST eigensolver. SIAM Journal on Scientific Computing 37, 4 (2015), A2100–A2122.
 Haidar et al. (2012) A. Haidar, H. Ltaief, and J. Dongarra. 2012. Toward a high performance tile divide and conquer algorithm for the dense symmetric eigenvalue problem. SIAM J. Sci. Comput. 34, 6 (2012), C249–C274.
 Hernandez et al. (2005) V. Hernandez, J. E. Roman, and V. Vidal. 2005. SLEPc: A Scalable and Flexible Toolkit for the solution of eigenvalue problems. ACM Trans. Math. Softw. 31, 3 (2005), 351–362.
 Imamura et al. (2011) T. Imamura, S. Yamada, and M. Yoshida. 2011. Development of a highperformance eigensolver on a petascale nextgeneration supercomputer system. Prog. Nucl. Sci. Technol. 2 (2011), 643–650.
 Jansen and Freeman (1984) HJF Jansen and Arthur J Freeman. 1984. Totalenergy fullpotential linearized augmentedplanewave method for bulk solids: Electronic and structural properties of tungsten. Physical Review B 30, 2 (1984), 561.
 Kaufman (2007) Linda Kaufman. 2007. The retraction algorithm for factoring banded symmetric matrices. Numerical Linear Algebra with Applications 14, 3 (2007), 237–254.
 Keçeli et al. (2018) Murat Keçeli, Fabiano Corsetti, Carmen Campos, Jose E Roman, Hong Zhang, Álvaro VázquezMayagoitia, Peter Zapol, and Albert F Wagner. 2018. SIESTASIPs: Massively parallel spectrumslicing eigensolver for an ab initio molecular dynamics package. Journal of Computational Chemistry 39, 22 (2018), 1806–1814.
 Keçeli et al. (2016) Murat Keçeli, Hong Zhang, Peter Zapol, David A Dixon, and Albert F Wagner. 2016. Shiftandinvert parallel spectral transformation eigensolver: Massively parallel performance for densityfunctional based tightbinding. Journal of computational chemistry 37, 4 (2016), 448–459.
 Kestyn et al. (2016) James Kestyn, Vasileios Kalantzis, Eric Polizzi, and Yousef Saad. 2016. PFEAST: a high performance sparse eigenvalue solver using distributedmemory linear solvers. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 178–189.
 Knyazev (2001) Andrew V Knyazev. 2001. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM journal on scientific computing 23, 2 (2001), 517–541.
 Kohn (1999) Walter Kohn. 1999. Nobel Lecture: Electronic structure of matter—wave functions and density functionals. Reviews of Modern Physics 71, 5 (1999), 1253.
 Lehoucq et al. (1998) Richard B Lehoucq, Danny C Sorensen, and Chao Yang. 1998. ARPACK users’ guide: solution of largescale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM.
 Li et al. (2019) Ruipeng Li, Yuanzhe Xi, Lucas Erlandson, and Yousef Saad. 2019. The eigenvalues slicing library (EVSL): Algorithms, implementation, and software. SIAM Journal on Scientific Computing 41, 4 (2019), C393–C415.
 Li et al. (2016) Ruipeng Li, Yuanzhe Xi, Eugene Vecharynski, Chao Yang, and Yousef Saad. 2016. A ThickRestart Lanczos algorithm with polynomial filtering for Hermitian eigenvalue problems. SIAM Journal on Scientific Computing 38, 4 (2016), A2512–A2534.
 Li et al. (2021a) Shengguo Li, Hao Jiang, Dezun Dong, Chun Huang, Jie Liu, Xia Liao, and Xuguang Chen. 2021a. Efficient data redistribution algorithms from irregular to block cyclic data distribution. submitted to IEEE Transactions on parallel and distributed systems (Minor revision).
 Li et al. (2021b) Shengguo Li, Xia Liao, Yutong Lu, Jose E. Roman, and Xiaoqiang Yue. 2021b. A parallel structured banded DC algorithm for symmetric eigenvalue problems. Submitted to CCF Transactions on High Performance Computing (2021).
 Liao et al. (2016) X. Liao, S. Li, L. Cheng, and M. Gu. 2016. An improved divideandconquer algorithm for the banded matrices with narrow bandwidths. Comput. Math. Appl. 71 (2016), 1933–1943.
 Liao et al. (2014) X. Liao, L. Xiao, C. Yang, and Y. Lu. 2014. MilkyWay2 supercomputer: system and application. Front. Comput. Sci. 8, 3 (2014), 345–356.
 Lin et al. (2016) Lin Lin, Yousef Saad, and Chao Yang. 2016. Approximating spectral densities of large matrices. SIAM review 58, 1 (2016), 34–65.
 Liu et al. (2016) Y. Liu, C. Yang, F. Liu, X. Zhang, Y. Lu, Y. Du, C. Yang, M. Xie, and X. Liao. 2016. 623 Tflop/s HPCG run on Tianhe2: Leveraging millions of hybrid cores. International Journal of High Performace Computing Applications 30, 1 (2016), 39–54.
 Lloyd (1982) Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137.
 Marek et al. (2014) A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H. Bungartz, and H. Lederer. 2014. The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. J. Phys.: Condens. Matter 26 (2014), 1–15.
 Parlett (1998) B. N. Parlett. 1998. The symmetric eigenvalue problem. SIAM, Philadelphia.
 Peter Tang and Polizzi (2014) Ping Tak Peter Tang and Eric Polizzi. 2014. FEAST as a subspace iteration eigensolver accelerated by approximate spectral projection. SIAM J. Matrix Anal. Appl. 35, 2 (2014), 354–390.
 Petschow et al. (2013) M. Petschow, E. Peise, and P. Bientinesi. 2013. Highperformance solvers for dense Hermitian eigenproblems. SIAM J. Numer. Anal. 35 (2013), C1–C22.
 Polizzi (2009) Eric Polizzi. 2009. Densitymatrixbased algorithm for solving eigenvalue problems. Physical Review B 79, 11 (2009), 115112.
 Polizzi and Sameh (2007) Eric Polizzi and Ahmed Sameh. 2007. SPIKE: A parallel environment for solving banded linear systems. Computers & Fluids 36, 1 (2007), 113–120.
 Poulson et al. (2013) J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Software 39, 2 (2013), 13:1–13:24.
 Saad (2011) Y. Saad. 2011. Numerical Methods for Large eigenvalue problems. SIAM, Philadelphia, PA.
 Sakurai and Sugiura (2003) T. Sakurai and H. Sugiura. 2003. A projection method for generalized eigenvalue problems using numerical integration. J. Comput. Appl. Math. 159 (2003), 119–128.
 Sakurai and Tadano (2007) Tetsuya Sakurai and Hiroto Tadano. 2007. CIRR: a RayleighRitz type method with contour integral for generalized eigenvalue problems. Hokkaido Mathematical Journal 36, 4 (2007), 745–757.
 Salas et al. (2015) Pablo Salas, Luc Giraud, Yousef Saad, and Stéphane Moreau. 2015. Spectral recycling strategies for the solution of nonlinear eigenproblems in thermoacoustics. Numerical Linear Algebra with Applications 22, 6 (2015), 1039–1058.
 Silver and Röder (1994) RN Silver and H Röder. 1994. Densities of states of megadimensional Hamiltonian matrices. International Journal of Modern Physics C 5, 04 (1994), 735–753.
 Sylvester (1852) James Joseph Sylvester. 1852. XIX. A demonstration of the theorem that every homogeneous quadratic polynomial is reducible by real orthogonal substitutions to the form of a sum of positive and negative squares. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 4, 23 (1852), 138–142.
 Tisseur and Dongarra (1999) F. Tisseur and J. Dongarra. 1999. A parallel divide and conquer algorithm for the symmetric eigenvalue problem on distributed memory architectures. SIAM J. Sci. Comput. 20, 6 (1999), 2223–2236.

Wang (1994)
LinWang Wang.
1994.
Calculating the density of states and opticalabsorption spectra of large quantum systems by the planewave moments method.
Physical Review B 49, 15 (1994), 10154.  Willems and Lang (2013) Paul R Willems and Bruno Lang. 2013. A Framework for the Algorithm: Theory and Implementation. SIAM Journal on Scientific Computing 35, 2 (2013), A740–A766.
 WilliamsYoung et al. (2020) David B WilliamsYoung, Paul G Beckman, and Chao Yang. 2020. A Shift Selection Strategy for Parallel ShiftInvert Spectrum Slicing in Symmetric SelfConsistent Eigenvalue Computation. ACM Transactions on Mathematical Software (TOMS) 46, 4 (2020), 1–31.
 Winkelmann et al. (2019) Jan Winkelmann, Paul Springer, and Edoardo Di Napoli. 2019. ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems. ACM Transactions on Mathematical Software (TOMS) 45, 2 (2019), 1–34.
 Xi et al. (2018) Yuanzhe Xi, Ruipeng Li, and Yousef Saad. 2018. Fast computation of spectral densities for generalized eigenvalue problems. SIAM Journal on Scientific Computing 40, 4 (2018), A2749–A2773.
 Yu et al. (2020) Victor Wenzhe Yu, Carmen Campos, William Dawson, Alberto García, Ville Havu, Ben Hourahine, William P Huhn, Mathias Jacquelin, Weile Jia, Murat Keçeli, et al. 2020. ELSI–An open infrastructure for electronic structure solvers. Computer Physics Communications 256 (2020), 107459.
 Zhang et al. (2007) Hong Zhang, Barry Smith, Michael Sternberg, and Peter Zapol. 2007. SIPs: Shiftandinvert parallel spectral transformations. ACM Transactions on Mathematical Software (TOMS) 33, 2 (2007), 1–19.