GCGE: A Package for Solving Large Scale Eigenvalue Problems by Parallel Block Damping Inverse Power Method

11/12/2021
by   Yu Li, et al.
Chinese Academy of Science
0

We propose an eigensolver and the corresponding package, GCGE, for solving large scale eigenvalue problems. This method is the combination of damping idea, subspace projection method and inverse power method with dynamic shifts. To reduce the dimensions of projection subspaces, a moving mechanism is developed when the number of desired eigenpairs is large. The numerical methods, implementing techniques and the structure of the package are presented. Plenty of numerical results are provided to demonstrate the efficiency, stability and scalability of the concerned eigensolver and the package GCGE for computing many eigenpairs of large symmetric matrices arising from applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/15/2020

On Relaxed Filtered Krylov Subspace Method for Non-Symmetric Eigenvalue Problems

In this paper, by introducing a class of relaxed filtered Krylov subspac...
12/23/2016

BSEPACK User's Guide

This is the user manual for the software package BSEPACK (Bethe--Salpete...
08/27/2019

An Eigenwise Parallel Augmented Subspace Method for Eigenvalue Problems

A type of parallel augmented subspace scheme for eigenvalue problems is ...
05/18/2021

Yet another eigenvalue algorithm for solving polynomial systems

In latest years, several advancements have been made in symbolic-numeric...
08/16/2019

A Shift Selection Strategy for Parallel Shift-Invert Spectrum Slicing in Symmetric Self-Consistent Eigenvalue Computation

The central importance of large scale eigenvalue problems in scientific ...
05/04/2021

Backward Stability of Explicit External Deflation for the Symmetric Eigenvalue Problem

A thorough backward stability analysis of Hotelling's deflation, an expl...
06/29/2020

AMG preconditioners for Linear Solvers towards Extreme Scale

Linear solvers for large and sparse systems are a key element of scienti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A fundamental and challenging task in modern science and engineering is to solve large scale eigenvalue problems. Although high-dimensional eigenvalue problems are ubiquitous in physical sciences, data and imaging sciences, and machine learning, there is no so many classes of eigensolvers as that of linear solvers. Compared with linear equations, there are less efficient numerical methods for solving large scale eigenvalue problems, which poses significant challenges for scientific computing

(Bai et al., 2000). In particular, the eigenvalue problems from complicated systems bring strong demand for eigensolvers with good efficiency, stability and scalability at the same time (Fan et al., 2014, 2015; Yu et al., 2018).

The Krylov subspace methods such as Arnoldi and Lanczos methods are always used to design the eigensolvers (Saad, 1992). In order to use explicitly and implicitly restarted techniques for generalized eigenvalue problems, it is necessary to solve the included linear equations exactly to produce upper Hessenberg matrices. But this requirement is always very difficult for large scale sparse matrices with poor conditions. Based on this consideration, LOBPCG is designed based on some types of iteration processes which do not need to solve the included linear equations exactly (Knyazev and Neymeyr, 2003; Knyazev, 2006; Hetmaniuk and Lehoucq, 2006; Knyazev et al., 2007; Duersch et al., 2018). This property makes LOBPCG be reasonable candidate for solving large scale eigenvalue problems on parallel computers. But the subspace generating method and orthogonalization way lead to the unstability of LOBPCG algorithm (Li et al., 2020; Zhang et al., 2020).

The appearance of high performance computers brings new issues for computing plenty of eigenpairs of large scale matrices, which has not so good efficiency and scalability as solving large scale linear equations. Solving eigenvalue problems on high performance computers needs new considerations about the stability and scalability of orthogonalization for plenty of vectors, efficiency and memory costing for computing Rayleigh-Ritz problems. The aim of this paper is to develop a method and the corresponding package for solving symmetric eigenvalue problems. This method is the combination of damping idea, subspace projection method and inverse power method with dynamic shifts. The package GCGE (Generalized Conjugate Gradient Eigensolver) is written by C language and constructed with the way of matrix-free and vector-free. In order to improve the efficiency, stability and scalability, we also introduce new efficient implementing techniques for orthogonalization and computing Rayleigh-Ritz problems. A recursive orthogonalization method with SVD (Singular Value Decomposition) is proposed in order to improve parallel efficiency. In addition, we also provide a moving mechanism to reduce the dimensions of projection subspaces when solving Rayleigh-Ritz problems. The source code can be downloaded from GitHub with the address

https://github.com/Materials-Of-Numerical-Algebra/GCGE.

The rest of the paper is organized as follows. In Section 2, we present the concerned algorithm for eigenvalue problems. The implementing techniques are designed in Section 3. In Section 4, plenty of numerical tests are provided to demonstrate the efficiency, stability and scalability of the proposed algorithm and the associated package. Concluding remarks are given in the last section.

2. GCG algorithm

For simplicity, in this paper, we are concerned with the following generalized algebraic eigenvalue problem: Find eigenvalue

and eigenvector

such that

(1)

where is real symmetric matrix and is real symmetric positive definite (SPD) matrix.

The generalized conjugate gradient (GCG) algorithm is a type of subspace projection method, which uses the block damping inverse power idea to generate triple blocks , where saves the current eigenvector approximation, saves the information from previous iteration step, and saves vectors from by the inverse power iteration with some CG steps. We refer to this method as generalized conjugate gradient algorithm since the structure of triple blocks is similar to that of conjugate gradient method. Assuming that it is desired to compute the smallest numEigen eigenpairs, the corresponding GCG algorithm is defined by Algorithm 1, where numEigen stands for the number of desired eigenpairs.

1. Choose numEigen vectors to build the block and two null blocks , .
2. Define and do orthogonalization to in the sense of inner product deduced by the matrix .
3. Solve the Rayleigh-Ritz problem to obtain and , then get new approximate eigenvectors .
4. Check the convergence of eigenpair approximations . If the smallest numEigen eigenpairs converge, the iteration will stop.
5. Otherwise, compute and update .
6. Generate by solving linear equations by some CG steps with the initial guess , where the shift is selected dynamically.
7. Then go to STEP 2.
Algorithm 1 GCG algorithm

The main difference of Algorithm 1 from LOBPCG is the way to generate and orthogonalization to . The GCG algorithm uses the inverse power method with dynamic shifts to generate . Meanwhile, the full orthogonalization to is implemented in order to guarantee the numerical stability. In addition, a new type of recursive orthogonalization method with SVD is designed in the next section.

In Step 3 of Algorithm 1, solving Rayleigh-Ritz problem is a sequential process which can not be accelerated by using normal parallel computing. Furthermore, it is well known that the computing time is superlinearly dependent on the number of desired eigenpairs (Saad, 1992). Then in order to accelerate this part, reducing the dimensions of Rayleigh-Ritz problems is a reasonable way. We will compute the desired eigenpairs in batches when the number of desired eigenpairs is large. In each iteration step, the dimensions of and are set to be or . Moreover, a moving mechanism is presented for computing large number of desired eigenpairs. These two strategies can further not only reduce the time proportion of the sequential process for solving Rayleigh-Ritz problems but also reduce the amount of memory required by STEP 3. In addition, the Rayleigh-Ritz problem is distributed to multi computing processes and each process only computes a small part of desired eigenpairs. In other words, the Rayleigh-Ritz problem is solved in parallel. More details of implementing techniques will be introduced in Sections 3.2 and 3.3.

In STEP 6 of Algorithm 1, though the matrix may not be SPD, the CG iteration method is adopted for solving the included linear equations due to the warm start and the shift . Please see Section 3.2 for more details. Furthermore, it is suggested to use the algebraic multigrid method as the preconditioner for STEP 6 of Algorithm 1 with the shift , when the concerned matrices are sparse and come from the discretization of partial differential operators by finite element, finite difference or finite volume, etc.

3. Implementing techniques

In this section, we introduce implementing techniques to improve efficiency, scalability and stability for the concerned eigensolver in this paper. Based on the discussion in the previous sections, we focus on the methods for doing the orthogonalization and computing Rayleigh-Ritz problems. A recursive orthogonalization method with SVD and a moving mechanism are presented. In addition, the package GCGE is introduced, which is written by C language and constructed with the way of matrix-free and vector-free.

3.1. Improvements for orthogonalization

This subsection is devoted to introducing the orthogonalization methods which have been supported by GCGE. So far, we have provided modified block orthogonalization method and recursive orthogonalization method with SVD. The criterion for choosing the orthogonalization methods should be based on the number of desired eigenpairs and the scales of the concerned matrices. The aim is to keep the balance among efficiency, stability and scalability.

The modified Gram-Schmidt method (Stewart, 2008) is designed to improve the stability of classical orthogonalization method. The modified block orthogonalization method is the block version of modified Gram-Schmidt method, which can be defined by Algorithm 2. They have the same accuracy and stability, but the modified block orthogonalization method has better efficiency and scalability.

Let us consider the orthogonalization for and assume in Algorithm 2. We divide into blocks, i.e., , where , . The orthogonalization process is to make be orthogonal to and do orthogonalization for itself, where has already been orthogonalized, i.e., .

Firstly, in order to maintain the numerical stability, the process of deflating components in from is repeated until the norm of is small enough. Secondly, the columns of in blocks of columns are orthogonalized through the modified Gram-Schmidt method. For each in Algorithm 2, when is linear dependent, the rearmost vectors of are copied to the corresponding location. In addition, Algorithm 2 needs global communications in each iteration. In other words, the total number of global communications is

In fact, in modified block orthogonalization method, we deflate the components in previous orthogonalized vectors successively for all unorthogonalized vectors in each iteration step. This means Algorithm 2 uses block treatment for the unorthogonalized vectors to improve efficiency and scalability without loss of stability. As default, is set to be .

1:repeat
2:     Compute ;
3:until the norm of is small enough;
4:for   do
5:     Orthogonalize by modified Gram-Schmidt method;
6:     if  then
7:         break;
8:     end if
9:     repeat
10:         Compute ;
11:         Compute ;
12:     until the norm of are small enough;
13:end for
Algorithm 2 Modified block orthogonalization

In order to improve efficiency and scalability further, we design a type of recursive orthogonalization method with SVD and the corresponding scheme is defined by Algorithm 3. The aim here is to take full use of level-3 BLAS operations. We also find the paper (Yokozawa et al., 2006) has discussed the similar orthogonalization method without SVD. The contribution here is to combine the recursive orthogonalization method and SVD to improve the scalability.

1:Compute ;
2:if  then
3:     repeat
4:         Compute ;
5:         Compute SVD of ;
6:         Compute ;
7:     until the norm of is small enough;
8:else
9:     ; ;
10:     ; ;
11:     Call ;
12:     repeat
13:         Compute ;
14:         Compute ;
15:     until the norm of are small enough;
16:     Call ;
17:end if
Algorithm 3 X

Let us consider and in Algorithm 3. The orthogonalization of is completed by calling recursively. We use to stand for the s-th column to e-th column of . When , SVD is applied to computing , where is set to be as default. In order to maintain the numerical stability, computing with SVD is repeated until the matrix

is close to the identity matrix. In general, the above condition is satisfied after two or three iterations. If

has eigenvalues close to zero, i.e., the set of vectors is linearly dependent, the subsequent vectors will be copied to the corresponding location.

If and we compute with SVD three times when , the total number of global communications is

which is much less than the total number of global communications of Algorithm 2.

The recursive orthogonalization method with SVD is recommended and it is the default choice in our package for the orthogonalization to long vectors. In fact, Algorithms 2 and 3 can both reach the required accuracy for all numerical examples in this paper. In the case of solving generalized eigenvalue problems, -orthogonalization should be considered. Algorithm 3 is more efficient than Algorithm 2 in most cases, which will be shown in Section 4.5.

3.2. Computation reduction for Algorithm 1

In this subsection, let us continue considering the whole computation procedure for Algorithm 1. The aim here is to design efficient ways to compute the Rayleigh-Ritz problem in STEP 3 which include

  • Orthogonalizing to ;

  • Computing the small scale matrix ;

  • Solving the standard eigenvalue problem .

Except for the moving mechanism shown in Section 3.3 and the inverse power method with dynamic shifts for solving , the techniques here are almost the same as that in (Li et al., 2020; Zhang et al., 2020). But for easier understanding and completeness, we also introduce them here using more concise expressions. In conclusion, the following main optimization techniques are implemented:

  • The converged eigenpairs do not participate the subsequent iteration;

  • The sizes of and are set to be blockSize, which is equal to numEigen/5 as default;

  • The shift is selected dynamically when solving ;

  • The large scale orthogonalization to is transformed into the small scale orthogonalization to and a large scale orthogonalization to ;

  • The submatrix of corresponding to can be obtained by ;

  • The submatrix of corresponding to can be computed by multiplication of small scale dense matrices;

  • The Rayleigh-Ritz problem is solved in parallel;

  • The moving mechanism is presented to reduce the dimension of further.

According to STEP 2 of Algorithm 1, we decompose into three parts

where denotes the converged eigenvectors and denotes the unconverged ones. The number of vectors in is blockSize. Based on the structure of , the block version has the following structure

with . And the eigenpairs and can be decomposed into the following form

(2)

where is the diagonal matrix.

Then in STEP 3 of Algorithm 1, the small scale eigenvalue problem

has the following form

(3)

where , and , , have following structures

(4)

In addition, Ritz vectors is updated as

In STEP 4 of Algorithm 1, the convergence of the eigenpairs is checked. Due to (2), we set

and the diagonal of inclues the new converged eigenvalues. Then all convergened eigenvectors are in

and the unconverged ones are in

If is equal to numEigen, the iteration will stop. Otherwise, and the iteration will continue. Here, the length of is

In STEP 5 of Algorithm 1, in order to produce for the next GCG iteration, from the definition of in (4) and the orthonormality of , i.e., , we first set

where

(5)

In order to compute to satisfy , we come to do the orthogonalization for small scale vectors in according to the inner product. Since vectors in are already orthonormal, the orthogonalization only needs to be done for against to get a new vectors . Thus let denote the orthogonalized block, i.e.,

(6)

Then,

Moreover, it is easy to check that

and

In STEP 6 of Algorithm 1, is obtained by some CG iterations for the linear equations

(7)

with the initial guess , where the shift is set to be the largest converged eigenvalue in the convergence process. It is noted that the shift is not fixed and the matrix may not be SPD, but the initial guess is perpendicular to the eigenvectors of corresponding to all negtive eigenvalues, i.e.,

since reaches the convergence criterion. In other words, is SPD in the orthogonal complement space of . Then the CG iteration method can be adopted for solving the included linear equations. Due to the shift , the multiplication of matrix and vector of each CG iteration takes more time, but the convergence of GCG algorithm is accelerated. In addition, there is no need to solve linear equations (7) with high accuracy, and only - CG iterations are enough during each GCG iteration. In Remark 3.1, an example is presented to explain why the convergence of GCG algorithm with dynamic shifts is accelerated after one CG iteration. In Section 4.1, we give some numerical results to show the performance of GCGE with dynamic shifts and the convergence procedure under different number of CG iterations. In order to produce for the next GCG iteration, using Algorithm 3, we need to do the orthognalization to according to , i.e.,

Remark 3.1 ().

We give an example to present the accelerating convergence of GCG algorithm with dynamic shifts after one CG iteration. Assuming that the first eigenpair has been found for the standard eigenvalue problem

we have the approximate eigenvector of the second eigenvector, where

For the linear equations

we can obtain the new approximate eigenvector

after the first CG iteration with the initial guess , where . It is noted that the convergence rate is

which is less than the case of .

Backing to STEP 2 of Algorithm 1, we denote

During solving the Rayleigh-Ritz problem, we need to assemble the small scale matrices and . Since the orthogonalization to the vectors in has been done by the inner product deduced by the matrix , is an identity matrix. Then we only need to compute the matrix , which is equal to

(8)

From (3), the submatrix does not need to be computed explicitly since it satisfies the following formula

(9)

Based on the basis in and (6), we have

(10)

and

(11)

Thus from (8), (9), (10) and (11), we know the matrix has the following structure

(12)

where

It is noted that since reaches the convergence criterion, we assume the equation

is satisfied. Then

is satisfied approximately since .

After assembling matrix , the next task is to solve the new small scale eigenvalue problem:

(13)

in STEP 3. Due to the converged eigenvectors in , there are already converged eigenvectors of and they all have the form

( stays in the position of associated converged eigenvalue). We only need to compute the unconverged eigenpairs corresponding to for the eigenvalue problem (13). The subroutine dsyevx from LAPACK (Anderson et al., 1999) is called to compute the only -th to -th eigenvalues and their associated eigenvectors.

In order to reduce time consuming of this part, this task is distributed to multi computing processes and each process only computes a small part of desired eigenpairs. After all processes finish their tasks, the subroutine is adopted to gather all eigenpairs from all processes and deliver them to all. This way leads to an obvious time reduction for computing the desired eigenpairs of (13). Since more processes lead to more communicating time, we choose the number of used processes for solving (13) such that each process computes at least eigenpairs.

Remark 3.2 ().

In order to accelerate the convergence, the size of , sizeX, is always chosen to be greater than numEigen, which is set to be the minimum of and the dimension of , as default.

Remark 3.3 ().

Since the converged eigenpairs do not participate in the subsequent iterations, in real implementation, is computed as follows

and the corresponding eigenpairs have the forms

In other words, the internal locking (deflation) is implemented to prevent computing over again the eigenpairs which have been found.

3.3. The moving mechanism

In Algorithm 1, the small scale eigenvalue problem (13) needs to be solved, in which the dimension of the dense matrix is

where the size of , sizeX, is equal to . When numEigen is large, e.g., , with , dsyevx should be called to solve eigenpairs for a dense matrix of -dimension. In this case, the time of STEP 3 of Algorithm 1 is always dominated.

In order to improve efficiency further for the above case, we present a moving mechanism. Firstly, the maximum project dimension is set to be in moving procedure, i.e., the size of is set to be and the sizes of and are both . Secondly, when eigenpairs converged, all the eigenpairs of will be solved, i.e, is decomposed into

where

In addition, the new is equal to , and can be used to construct the new in the next STEP 3. In other words, and have been integrated into . Then, the new and will be computed and stored behind the new . When there are new converged eigenpairs again, and will be integrated into again, and so on. The above process is shown in Figure 1. It is noted that the dimension of the dense matrix is at most in the small scale eigenvalue problem (13).

Figure 1. Moving , when eigenpairs converged.

Moving , when eigenpairs converged.

Moreover, the moving mechanism can greatly reduce memory requirements, which allows more eigenpairs to be computed. Specifically speaking, the double array, of which the size is

(14)

is required to be stored in each process. The first two terms denote the sizes of the two arrays which are used to store the eigenpairs and the dense matrix in the small scale eigenvalue problem (13). The third term is the size of workspace for dsyevx. The last term is the size of the array which is used in STEP 5. In Figure 2, the required memory computed by (14) is shown with and without the moving mechanism.

Figure 2. Requested memory in each process

3.4. Matrix-free and vector-free operations

Based on Algorithm 1 and its implementing techniques presented in above sections, we develop the package GCGE, which is written by C language and constructed with the way of matrix-free and vector-free. So far, the package has included the eigensolvers for the matrices which are stored in dense format, compressed row/column sparse format or are supported in MATLAB, Hypre (Falgout et al., 2006), PETSc (Balay et al., 1997), PHG (Zhang, 2009) and SLEPc (Hernandez et al., 2005). Table 1 presents the currently supported matrix-vector structure. It is noted that there is no need to copy the built-in matrices and the vectors from these softwares/libraries to the GCGE package.

matrix structure name vector structure name
MATLAB sparse distributed matrix full stored matrix
Hypre hypre_ParCSRMatrix hypre_ParVector
PETSc Mat Vec
PHG MAT VEC
SLEPc Mat BV
Table 1. Supported matrix-vector structures

A user can also build his own eigensolver by providing the matrix, vector structures and their operations. The following six matrix-vector operations should be provided by the user:

  • VecCreateByMat

  • VecDestroy

  • VecLocalInnerProd

  • VecSetRandomValue

  • VecAxpby

  • MatDotVec

They realize creating and destroying vector according to matrix, computing local inner product of vectors and , setting random values for vector , computing vector , computing vector , respectively. VecInnerProd, i.e., computing inner product of vectors and , has been provided through calling VecLocalInnerProd and MPI_Allreduce.

The default matrix-multi-vector operations are invoked based on the above matrix-vector operations and the additional two operations: GetVecFromMultiVec and RestoreVecForMultiVec, which are getting/restoring one vector from/to multi vectors. For higher efficiency, it is strongly recommended that users should provide the following six matrix-multi-vector operations:

  • MultiVecCreateByMat

  • MultiVecDestroy

  • MultiVecLocalInnerProd

  • MultiVecSetRandomValue

  • MultiVecAxpby

  • MatDotMultiVec

In addition, if user-defined multi-vector is stored in dense format, BLAS library can be used to implement (1)-(5) operators easily, which has been provided in the GCGE package. In other words, only one operator, i.e., computing the multiplication of matrix and multi-vector needs to be provided by users.

In order to improve the parallel efficiency of computing inner products of multi-vector and , i.e., the operation MultiVecInnerProd, a new MPI data type with the corresponding reduced operation has been created by

            MPI_Type_vector, MPI_Op_create.

The variable MPI_IN_PLACE is used as the value of sendbuf in MPI_Allreduce at all processes.

Although SLEPc (Hernandez et al., 2005) provides an inner product operation for BV structure, we still recommend using our own multi-vector inner product operation. Let us give an example to illustrate the reason. For instance, we need to compute the inner products

and the results are stored in the following submatrix

(15)

Always, the vectors and come from the multi-vector

and the dense matrix (15) is one submatrix of the following matrix

which is stored by column. Thus, it can be noted that the above mentioned submatrix (15) is not stored continuously.

The result of the SLEPc’s inner product operation, BVDot, must be stored in a sequential dense matrix with dimensions at least. In other words, regardless of the values of , , and , in each process, the additional memory space is required, of which the size is . In general, and are set to be in the GCG algorithm, while and are much less than and , respectively.

In the GCGE package, the operation MultiVecInnerProd is implemented as follows:

  • Through MultiVecLocalInnerProd, local inner products are calculated and stored in the above mentioned submatrix for each process;

  • A new MPI_Datatype named SUBMAT is created by

    int MPI_Type_vector( int count, int length, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype)

    with

        count=q-p+1, length=j-i+1, stride=s;
    
  • Through MPI_Op_create, the operation of sum of SUBMAT is created, which is named as SUM_SUBMAT;

  • Then

    int MPI_Allreduce(
        void *sendbuf, void *recvbuf, int count,
        MPI_Datatype datatype, MPI_Op op,
        MPI_Comm comm)
    

    is called with

            sendbuf=MPI_IN_PLACE, count=1,
            datatype=SUBMAT, op=SUM_SUBMAT
    

    to gather values from all processes and distribute the results back to all processes.

Obviously, no extra workspace is needed here. The memory requirements are reduced for each process.

4. Numerical results

The numerical experiments in this section are carried out on LSSC-IV in the State Key Laboratory of Scientific and Engineering Computing, Chinese Academy of Sciences. Each computing node has two 18-core Intel Xeon Gold 6140 processors at 2.3 GHz and 192 GB memory. For more information, please check http://lsec.cc.ac.cn/chinese/lsec/LSSC-IVintroduction.pdf. We use numProc to denote the number of processes in numerical experiments.

In this section, the GCG algorithm defined by Algorithm 1 and the implementing techniques in Section 3 are investigated for thirteen standard eigenvalue problems and one generalized eigenvalue problem. The first thirteen matrices are available in Suite Sparse Matrix Collection111https://sparse.tamu.edu, which have clustered eigenvalues and many negative eigenvalues. The first matrix named Andrews is provided by Stuart Andrews at Brown University, which has seemingly random sparsity pattern. The second to the thirteenth matrices are generated by the pseudo-potential algorithm for real-space electronic structure calculations (Kronik et al., 2006; Natan et al., 2008; Saad et al., 2010). The FEM matrices and come from the finite element discretization for the following Laplace eigenvalue problem: Find such that

(16)

where . The discretization of the eigenvalue problem (16) by the conforming cubic finite element (P3 element) with 3,145,728 elements leads to the stiffness matrix and the mass matrix . The concerned matrices are listed in Table 2, where the density is defined by

The proposed GCG algorithm given by Algorithm 1 based on BV structure from SLEPc is adopted to solve eigenpairs of the concerned matrices in Table 2.

ID Matrix Dimension Non-zero Entries Density
1 Andrews 60,000 760,154 2.11e-4
2 CO 221,119 7,666,057 1.57e-4
3 Ga10As10H30 113,081 6,115,633 4.78e-4
4 Ga19As19H42 133,123 8,884,839 5.01e-4
5 Ga3As3H12 61,349 5,970,947 1.59e-3
6 Ga41As41H72 268,096 18,488,476 2.57e-4
7 Ge87H76 112,985 7,892,195 6.18e-4
8 Ge99H100 112,985 8,451,395 6.62e-4
9 Si34H36 97,569 5,156,379 5.42e-4
10 Si41Ge41H72 185,639 15,011,265 4.36e-4
11 Si5H12 19,896 738,598 1.87e-3
12 Si87H76 240,369 10,661,631 1.85e-4
13 SiO2 155,331 11,283,503 4.68e-4
14 FEM matrices and 14,045,759 671,028,055 3.40e-6
Table 2. Testing matrices

The convergence criterion is set to be

for the first thirteen matrices and

for FEM matrices, where the tolerance, tol, is set to be as default. Moreover, we set for the first thirteen matrices and for FEM matrices.

In order to confirm the efficiency, stability and scalability of GCGE, we investigate the numerical comparison between GCGE and LOBPCG. We will find that GCGE has better efficiency, stability than LOBPCG and they have almost the same scalability. In addition, Krylov-Schur method is also compared in Sections 4.2 and 4.5.

4.1. About dynamic shifts and the number of CG iterations

In this subsection, we give some numerical results to show the performance of GCGE with dynamic shifts and the convergence procedure under different number of CG iterations.

In STEP 6 of Algorithm 1, the linear equations (7) are solved by some CG iterations. Due to the shift , the multiplication of matrix and vector of each CG iteration takes more time, but the convergence of GCG algorithm is accelerated. For the standard eigenvalue problems, i.e., , because the additional computation is only the linear operations on vectors, each GCG iteration with dynamic shifts takes a little more time than the case of no shift. As shown in Figure 3, the performance of GCGE with dynamic shifts is greatly improved. In addition, the total number of GCG iterations is presented in Table 3.

Figure 3. , , and
ID Matrix Dynamic Shifts No Shift Ratio
1 Andrews 102 281 36.29%
2 CO 97 195 49.74%
3 Ga10As10H30 105 213 49.29%
4 Ga19As19H42 110 216 50.92%
5 Ga3As3H12 81 165 49.09%
6 Ga41As41H72 133 236 56.35%
7 Ge87H76 78 212 36.79%
8 Ge99H100 77 206 37.37%
9 Si34H36 79 207 38.16%
10 Si41Ge41H72 87 208 41.82%
11 Si5H12 86 201 42.78%
12 Si87H76 89 232 38.36%
13 SiO2 90 164 54.87%
Table 3. The total number of GCG iterations

For the generalized eigenvalue problems, there is no significant improvement for the overall performance of GCGE with dynamic shifts by the additional computation of the multiplication of matrix and vectors. When the matrix can be modified, we recommend users to generate explicitly and do CG steps for directly. In this event, GCGE with dynamic shifts will perform better for the generalized eigenvalue problem and the results for and numEigen = 5000 are shown in Tables 4 and 7 respectively.

The Total Number CPU Time (in seconds)
of GCG Iterations
Dynamic Shifts 83 1669.19
No Shift 88 1777.87
Ratio 94.31% 93.88%
Table 4. FEM matrices with , and

In addition, the GCG algorithm do not need to solve linear equations exactly in STEP 6. In the rest of this subsection, the total time of GCGE and the average time per each GCG iteration are presented under different number of CG iterations. Because the first thirteen matrices have similar density, we choose SiO2 with and FEM matrices with as test objects.

For SiO2 with and , as shown in Figures 4 and 5, when the number of CG iterations is increased from to in each GCG iteration, the number of GCG iterations decreases and the average time per each GCG iteration increases. And the total time reaches a minimum near CG iterations according to Figure 6. In fact, from Andrews to SiO2, there have similar conclusions.

Figure 4. Convergence procedure for SiO2 with
Figure 5. Average time for SiO2 with
Figure 6. Total time for SiO2 with

Figures 7 and 8 show the corresponding results for FEM matrices with and . When the number of CG iterations is increased from to , the number of GCG iterations decreases and the average time per each GCG iteration increases. The best performance is achieved at - CG iterations as shown in Figure 9.

Figure 7. Convergence procedure for FEM matrices with
Figure 8. Average time for FEM matrices with
Figure 9. Total time for FEM matrices with

It is noted that the number of CG iterations in each GCG iteration affects the efficiency of the algorithm deeply as presented in Figures 5 and 8. The average time per each GCG iteration is linearly associated with the number of CG iterations. So, the number of CG iterations is a key parameter for trading off between the number of GCG iterations and the average time of GCG iterations. In fact, the total time of GCG algorithm is nearly equal to the multiplication of the number of GCG iterations and the average time of GCG iterations. In other words, though increasing the number of CG iterations can accelerate convergence, it takes more time in each GCG iteration.

In fact, the sparsity, the condition number and the dimension of the matrix all affect the convergence rate of the CG iteration. In the GCGE package, we set two stop conditions of the CG iteration. When the residual of the solution is less than one percent of the initial residual, or the number of CG iterations is greater than , the CG iteration will be stopped.

4.2. About different tolerances

In this subsection, we will compare the performance of GCGE, LOBPCG and Krylov-Schur methods under different tolerances.

In Figures 10 and 11, GCGE, LOBPCG and Krylov-Schur methods with are compared under and , respectively. Under the tolerance , LOBPCG can not converge after iterations, which means that the LOBPCG has no good stability. So only the performance of GCGE and Krylov-Schur methods are compared under and the results are presented in Figure 12. Here, MUMPS (Amestoy et al., 2001, 2019) is used as linear solver for Krylov-Schur method.

Figure 10. , , and
Figure 11. , , and
Figure 12. , , and

Obviously, GCGE is always more efficient than LOBPCG under different tolerances. In addition, when and , GCGE is much faster than Krylov-Schur method. Under tolerances , the CPU time of GCGE and Krylov-Shur method is similar and GCGE is slighly faster.

In addition, the convergence procedure of GCG algorithm with for the first thirteen matrices is shown in Figure 13. As the number of GCG iterations increases, the number of converged eigenpairs increases. In Figure 14, the absolute residual of the first unconverged eigenpair is presented.

For FEM matrices, the performances of GCGE are shown in Figures 15 and 16. Due to , there are four noticeable pauses for the case of when the number of converged eigenpairs is close to , , , and at around the 40th, 60th, 80th, and 100th GCG iteration. Roughly speaking, the eigenpairs can be converged once every twenty GCG iterations.

Figure 13. , , and
Figure 14. , , and
Figure 15. and
Figure 16. and

4.3. Scaling for the number of eigenpairs

Here, we investigate the dependence of computing time on the number of desired eigenpairs. For this aim, we compute the first - eigenpairs of matrices listed in Table 2.

The test for the first thirteen matrices is performed on a single node with 36 processes. The results in Figures 17 and 18 show that just like LOBPCG, GCGE has almost linear scaling property, which means the computing time is linearly dependent on the number of desired eigenpairs. Moreover, GCGE has better efficiency than LOBPCG. From Andrews to SiO2, the total time ratios of GCGE to LOBPCG are

Figure 17. GCGE with and
Figure 18. LOBPCG with and

Since the scales of FEM matrices are large, the test is performed with processes on nodes. The dependence of CPU time (in seconds) for FEM matrices on the number of eigenpairs is shown in Figure 19, which implies that GCGE has better efficiency than LOBPCG for large scale matrices. Moreover, GCGE and LOBPCG both have almost linear scaling property for large scale FEM matrices.

Figure 19. CPU time for FEM matrices with and
Remark 4.1 ().

In fact, Krylov-Schur method is low efficient for FEM matrices on multi-nodes. In Table 5, for , the generalized eigenvalue problem is tested, which is the discretization of the eigenvalue problem (16) for the conforming linear finite element (P1 element) with 3,145,728 elements. The dimensions of the matrices and are both 512,191.

Method 50 100 200
GCGE 20.15 38.98 71.49
Krylov-Schur 1032.33 1360.56 2180.28
LOBPCG 63.99 114.65 286.67
Table 5. FEM matrices (P1 element) with and

4.4. Scalability test

In order to do the scalability test, we use - processes to compute the first eigenpairs of the first thirteen matrices listed in Table