Multigrid/multilevel methods including geometric multigrid (GMG) and algebraic multigrid (AMG) is one of the most popular approaches for solving a large sparse linear system of equations, , arising from the discretization of partial different equations (Smith et al. (2004); Stuben (1999)). The methods involves generating a sequence of linear systems of decreasing size during the setup phase, and iteratively improving the solution to the original system using the sequence of coarse systems in the solve phase. One critical component of the multigrid/multilevel methods is to form a coarse operator through a sparse matrix triple product of restriction , fine operator
, and interpolation. Here is the transpose of when using the Galerkin method, and can be generated either geometrically (Kong and Cai (2016, 2017, 2018); Kong et al. (2019a)) or algebraically (Kong et al. (2018b, 2019b)). It is challenging to design an efficient parallel algorithm and develop its corresponding software for the sparse matrix triple product since we have to consider both the memory efficiency and the compute time efficiency. Beside the multigrid/multilevel methods, the sparse matrix triple product is also used in other areas such as the Schur complement algorithms so that developing an efficient triple product algorithm becomes an active research topic.
Let us briefly review a few of these developments, and interested readers are referred to (Ballard et al. (2016b); Buluç and Gilbert (2012); McCourt et al. (2013)) for more literature reviews. Many previous works have considered parallel algorithms for sparse matrix-matrix multiplications for the general-purpose use (Akbudak and Aykanat (2014)) and the particular application (Challacombe (2000)). In (Ballard et al. (2016b)), the authors propose and analyze a sparse SUMMA algorithm that was originally used for dense matrices. (Borštnik et al. (2014)) uses a similar idea to convert a dense matrix algorithm (Cannon) to its parallel sparse version for quantum-chemical applications with special optimizations and tuning. In the context of the multigrid/multilevel methods, (Tuminaro and Tong (2000)) concerns on the smoothed aggregation algebraic multigrid on distributed-memory supercomputers, where a row-wise algorithm is used for the sparse matrix-matrix multiplications of the two-step matrix triple product. (McCourt et al. (2013)) proposes a matrix coloring technique to cast a sparse matrix-matrix multiplication to a sparse-matrix dense-matrix operation, and the method is used in the multigrid/multilevel methods. (Ballard et al. (2016a)) studies a hyper graph to represent the communication costs of the parallel algorithms for general sparse matrix-matrix multiplications, where a Gakerkin triple product is employed as a case study, and the authors conclude that the row-wise algorithm is communication efficient for the first matrix-matrix multiplication, but inefficient for the second one.
In most of the literatures, typically, the sparse matrix triple product, , is formed using two separate sparse matrix-matrix multiplications, and (Ballard et al. (2016a, b); Balay et al. (2019)). This two-step method works well for many applications, but it is difficult to use the two-step approach for certain memory-intensive applications such as neutron transport problems (Kong et al. (2018b, 2019b)) since the two-step method needs to create some auxiliary matrices causing a high memory overhead in order to efficiently cary out the calculations. To overcome this difficulty, we develop and study two all-at-once algorithms (including a merged version) that form the coarse operator with one pass through , and without creating any auxiliary matrices. Here a row-wise algorithm is employed for the first matrix-matrix multiplication, and an outer product is adopted for the second matrix-matrix multiplication. Note that in the new all-at-once approaches, the two matrix-matrix multiplication are carried out simultaneously, which will be discussed for more details in Section 3. Compared with the traditional two-step method, the new all-at-once algorithms and their implementations result in a reduction of the memory usage by a factor of for a model problem and a factor of for a realistic neutron transport application, meanwhile the new all-at-once approaches are able to maintain a good computational efficiency. We numerically show that the proposed algorithms work efficiently for both the model problem and the realistic problem with up to 27 billions of unknowns on supercomputers using up to processor cores.
The rest of this paper is organized as follows. In Section 2, a row-wise matrix-matrix multiplication that serves as part of the calculations in a triple product is described in detail. Two new all-at-once algorithms based on hash tables for the sparse matrix triple products in multigrid/multilevel methods are described in Section 3, and numerical results for a model problem and a realistic neutron transport problem with up to processor cores are given and discussed in Section 4. The results are summarized and conclusions are drawn in Section 5.
2 Sparse matrix-matrix multiplications
There are different types of algorithms for sparse matrix-matrix multiplications. The algorithms are generally classified into 1D, 2D and 3D according to the partitioning schemes used to assign the computations of matrix to each processor core. Based on the current infrastructures in PETSc, we consider 1D algorithms only in this work. In the 1D approaches, each processor core owns a chunk of the rows of matrix using a row partition or a chunk of the columns of matrix using a column partition. The 1D matrix-matrix multiplication algorithms are further divided into row-wise, column-wise, and outer product. A matrix in PETSc is distributed in row wise so that a row-wise algorithm is chosen. Next we describe the row-wise algorithm that is employed in the first matrix-matrix multiplication of our new all-at-once approaches.
Let , and , where is associated to the number of unknowns on the fine mesh, and is determined by the coarse mesh. A matrix-matrix multiplication operation is defined such that the th column of the th row of output matrix , denoted as , is calculated as follows
The atomic task in the row-wise algorithm is the calculation of a row of , that is,
Here the th row of , , is a linear combination of the rows of , some of which are not local, corresponding to the nonzero columns of the th row of . In the row-wise algorithm, the matrix is equally distributed to processors in block rows, where each processor owns consecutive rows shown in Eq. (1). Here is the number of processor cores. For simplicity of discussion, we assume is divisible by . In reality, may be not divisible by , and if it is the case, some processor cores will own a few more rows than others. Let denote the rows of owned by the th processor core, that is,
For simplicity of description, we also define the th block columns of as as shown in Eq. (1).
Note that is owned by the th processor core, and represents the relationship with the th processor core. If , there is no communication from processor to . A good partition of should make most of submatrices zero to minimize the communication. We do not partition directly, instead, the partition of is derived from the finite element mesh partition that is implemented using the approaches in (Karypis and Schloegel (2013); Kong et al. (11-16 November, 2018)) since arises from the discretization of partial differential equations on a finite element mesh. is also partitioned in the same way, that is, , and , where . For the th processor core, the computation of the local matrix (the output matrix has the same row partition as , and the same column partition as ) is written as
As stated earlier, some of submatrices do not have nonzero elements since is a sparse matrix, and therefore the calculation of does not involve all processor cores, instead, only communicates with the processor cores corresponding to nonzero submatrices. Define the number of the nonzero submatrices of as , and the index function as that returns the index of the th nonzero submatrix of . Eq. (2) is reformulated as
To explain this clearly, we present a simple example shown in Eq. (4).
Here the horizontal lines represent the row partition of matrix, and the vertical dashed line is the virtual column partition that matches the row partition of the right matrix if it exists. For instance, in the matrix (the first matrix on the right side of Eq. (4)), the first processor core takes the first and the second rows, the second processor core owns the third and the forth rows, and the third processor core has the fifth and the sixth rows. In this toy example, the first row of is the linear combination of the first, the second, and the fifth rows of since the first, the second and the fifth columns of are nonzero. To calculate the first row of , a remote row of , the fifth row, is needed, which is implemented by communication. The calculation of different rows of local , different remote rows of are involved. If we did a message exchange for each row calculation individually, it would be inefficient. Thus, we extract all the required remote rows (forming a matrix ) that corresponds to nonzero columns of () up front. For example, in Eq. (4), the third, the fifth remote rows of are extracted for the first processor core because the nonzero off-processor columns of are the third and the fifth columns. In PETSc (Balay et al. (2019)), the local part of matrix, e.g., , and , is physically implemented as two blocks, diagonal block (e.g., ) and off-diagonal block (e.g., ), which are stored in two sequential matrices in a compressed sparse row (CSR) format. We here drop the subindex for and without confusion. The formulation of the calculations of is rewritten as follows
The calculation of the matrix-matrix multiplication is generally split into two parts, symbolic calculation and numeric calculation. In the symbolic calculation, the task is to accurately preallocate the memory for the output matrix with going through the matrices , and without involving any floating-point operations. In the numeric process, the actual numeric calculation is carried out and the results are added to the allocated space of . The symbolic calculation is summarized in Alg. 1 and 2.
In Alg. 1, and mathematically are integer sets, and they can be implemented based on different data structures such as linked list, binary tree, hash table, etc. In this work, we use the hash table to implement and for our new all-at-once algorithms since the hash table has an average lookup time and also simplifies the implementation. Other implementations such as the linked list is also available in PETSc. For simplifying the description, we assume that the hash table is used during the following discussion.
In Alg. 2, and are integer arrays to store the numbers of nonzero columns for the diagonal and the off-diagonal parts of , respectively. represents the number of elements in a set. The memory of and could be reused for each row of , and “clear” simply resets a flag in the data structure so that the memory is ready for next row. Note that line 2 of Alg. 2 involves one message exchange, and all other parts are proceeded independently for each processor core. The numeric calculation of the matrix-matrix multiplication using the row-wise algorithm is shown in Alg. 3 and 4.
In Alg. 3, is a hash table that associates a key with its numeric value. “” represents that if already exists in then the value will be added to the current value otherwise a pair consisting of and its value is inserted into the hash table.
In Alg. 4, we extract keys and their corresponding values from , and call the matrix function, MatSetValues, in PETSc directly to add the values to the allocated space in . “Clear” at line 8 of Alg. 4 does not deallocate memory, instead, it resets a flag so that the memory can be reused for next row.
3 All-at-once algorithms for sparse matrix triple products
In multigrid/multilevel methods, to construct a scalable parallel solver, coarse spaces are included, which can be built either geometrically (Kong and Cai (2016, 2017)) or algebraically (Kong et al. (2018b, 2019b)). One of critical tasks is to form a coarse operator, , based on the interpolation, , and the fine level operator, , as follows
where the notation is reused to denote the output matrix of the sparse matrix triple product without confusion. It is straightforward to apply Alg. 2 and 4 twice for computing the sparse matrix triple product, that is,
where is an auxiliary matrix. More precisely, is explicitly implemented using Alg. 2 for the symbolic calculation and Alg. 4 for the numeric computation, and then the second product is formed using the same algorithms again, . The procedure (referred to as “two-step” method) is summarized in Alg. 5 for the symbolic caculation and Alg. 6 for the numeric calculation.
The lines 4 and 6 of Alg. 5 are computed using a similar algorithm as Alg. 2 but without extracting any remote rows of . The communication at the lines 5 and 7 is implemented using nonblocking MPI techniques so that the computation and the communication are overlapped. “+=” at the line 8 of Alg. 5 represents adding data from to .
Similarly, the lines 4 and 6 of Alg. 6 are implemented using a similar algorithm as Alg. 4 without extracting any remote rows of . “+=” operator at the line 8 is implemented using MatSetValues in PETSc that efficiently adds values to the diagonal and off-diagonal matrices. The advantage of the two-step algorithm is that it can be efficiently implemented using the row-wise algorithm. However, this algorithm needs to build auxiliary matrices such as (explicit transpose of ) and , which leads to a high memory overhead. The two-step algorithm works well for some applications that do not require much memory, but it is not suitable for memory-intensive problems such as seven-dimensional neutron transport simulations. To overcome the difficulty, in this work, we introduce new all-at-once algorithms that do not involve the auxiliary matrices for saving memory. The basic idea of the all-at-once algorithms is to form with one pass through , and all at once without any auxiliary steps. We also want to mention that the similar idea has been also used in (Adams et al. (2004), Van Emden Henson (2002)). However, we propose a new version of the all-at-once algorithms based on hash tables that is implemented in PETSc as part of this work, where the second matrix-matrix multiplication is based on an outer product algorithm instead of the row-wise approach. More precisely, the th element of the th row of is obtained using the following formula
where we do not need to explicitly form . The atomic task of the all-at-once algorithms is to compute the th row of as follows
Eq. (8) potentially is memory efficient, but there is no a good way to access column by column since is distributed in a row-wise manner. To overcome this difficulty, we use an outer product for the second matrix-matrix multiplication, , to access row by row instead of column by column. is formed all at once with a summation of outer products,
In Eq. (9), is the summation of the outer product of the th row of and the th row of . Note that (Ballard et al. (2016b)) shows that, in their two-step approach, the row-wise algorithm is suitable for , and the outer product is the best for in terms of the communication cost. We adopt the outer product in the second matrix-matrix multiplicaiton not only for reducing communication cost but also for saving memory. The detailed algorithm is shown in Alg. 7 and 8.
At the line 3 of Alg. 7, is implemented as a set of hash sets that stores the indices of the nonzero columns for each row. The outer product at the line 11 is carried out by inserting the th row of , , into the rows of corresponding to the nonzero columns of . The first loop consisting of lines 5-13 computes the remote rows that are sent to its remote owners. The second loop consisting of lines 16-25 calculates the local part of . The calculation is split into two parts for overlapping the computation and the communication.
In Alg. 8, adding the outer product of at the line 10 is implemented by inserting the numerical values directly into the preallocated space, and the outer product at the line 21 is added to the preallocated matrix using PETSc routine MatSetValues. Again, the communications at the lines 14 and 26 of Alg. 7 are based on nonblocking MPI techniques. It is obviously observed that Alg. 7 and 8 do not involve auxiliary matrices and . Generally, Alg. 7 and 8 work great when the amount of the calculations in the first loop is small (it is true for most of the sparse matrices arising from the discretization of partial differential equations). If the computation of the first loop is heavy, the algorithm can be modified to save the time on the calculations. In this case, we propose an alternative choice that merges the calculations in the first loop and that in the second loop together. This modified algorithm is referred to as “merged all-at-once”. The merged all-at-once algorithm is described in detail in Alg. 9. and 10.
The performance differences between the all-at-once algorithm and its merged version are totally problem dependent. For some applications, the performances may be close to each other. If the commutation in the first loop is expensive, we may prefer the all-at-once to the merged all-at-once.
4 Numerical results
In this section, we report the performance of the proposed algorithms in terms of the memory usage and the compute time. For understanding the algorithms, we study two tests cases; a model problem based on structured meshes for mimicking geometric muligrid/multilevel methods, and a realistic neutron transport simulation in which AMG is employed. Let us define some notations that are used in the rest of discussions for convenience. “two-step” represents the approach described in Alg. 5 and 6. “allatonce” is the agorithm described in Alg. 7 and 8. “merged” denotes the method in Alg. 9 and 10. “
” is the number of processor cores, “Mem” is the estimated memory usage per processor core, in Megabyte, for the matrix triple products, “Mem” is the estimated total memory per processor core, in Megabyte, used in the entire simulation, “Time” is the compute time, in second, of the matrix triple products, “Time” is the total compute time, in second, in the simulation, “Time” is the time spent on the symbolic calculations of the triple products, “Time” is the time on the numeric computations of the triple products, and “EFF” is the parallel efficiency of the overall simulation. The proposed algorithms, all-at-once and its merged version, for the matrix triple products are implemented in PETSc (Balay et al. (2019)) as part of this work.
4.1 Model problem
This test is conducted for processor counts between 8,192 and 32,768 on the Theta supercomputer at Argonne National Laboratory. Theta is a massively parallel, many-core system with second-generation Intel Xeon Phi processors. Each compute node has a 64-core processor with 16 gigabytes (GB) of high-bandwidth in-package memory (MCDRAM), 192 GB of DDR4 RAM, and a 128 GB SSD. The test is designed to mimic a geometric two-level method based on a fine mesh and a coarse mesh. A 3D structured grid is employed as the course mesh, and the fine mesh is an uniform refinement of the coarse mesh. Each grid point is assigned with one unknown. An interpolation is created from the coarse mesh to the fine mesh using a linear function. The dimensions of on the fine mesh are , and these of from the coarse mesh to the fine mesh are . One symbolic and eleven numeric triple products are implemented to mimic the use case in which the numeric calculations need to be repeated multiple times. “Time” is the time on all eleven numeric triple products. We compare the performance of our new algorithms in terms of the compute time and the memory usage with that obtained using the traditional two-step method. Numerical results are summarized in Table 1.
We observed that the memory usage of the new algorithms is roughly of that consumed by the two-step method from Table 1, which indicates that the new algorithms produce a very low memory overhead. For example, the all-at-once algorithm takes only M memory at 8,192 processor cores, while the two-step method uses M. For all cases, the all-at-once and the merged all-at-once approaches use exactly the same amount of memory. All the algorithms are scalable in terms of the memory usage in the sense that the memory usages are proportionally decreased when the number of processor cores is increased. The memory usages are halved to M for the all-at-once algorithm and M for the two-step method, respectively, when the number of processor cores is doubled from to . They continue being reduced to M and M, respectively, when we increase the number of processor cores to . The memory (denoted as “Mem” in Table 1) spent on the triple products includes the storage for the output matrix . In order to understand how much memory overhead we have on the triple product algorithms, the memories used to store matrices , and using , , and processor cores are shown in Table 2. It is easily found that takes M at processor cores, and then the overhead of the all-at-once algorithm is M since the corresponding total memory on the triple product for the all-at-once approach is M. On the other hand, the two-step method has a much higher memory overhead, M ( M - M). The memory usages for storing all the matrices are scalable, that is, they are ideally halved when we double the number of processor cores. For example, the memory required to store is M at processor cores, and it is reduced to M when we increase the number of processor cores from to . It continues being reduced to M and M, when we increase the number of processor cores to and . We observed that the same trend in the memory usages for storing and , that is, they are perfectly halved when the number of processor cores is doubled. Similarly, the compute times on the triple products are also perfectly scalable for all the algorithms. The compute times using are s for the all-at-once method and s for the two-step method, and they are reduced to s and s when processor cores are used. This leads to parallel efficiencies of for the all-at-once method and for the two-step method. When we continue increasing the number of processor cores to and , the compute times are reduced to s and s for the all-at-once algorithm, and s and s for the two-step method. A perfect scalability with parallel efficiencies of above for all the algorithms is achieved even when the number of processor cores is up to . The compute time of the all-at-once approach is very close to that using the merged all-at-once algorithm, and there are no meaningful differences. The two-step method is slightly faster than the all-at-once and the merged all-at-once algorithms. These speedups and parallel efficiencies are also shown in Fig. 1.
One more interesting thing is that the time spent on the symbolic calculation for the all-at-once method is less than that spent by the two-step method. For example, for the -core case, s is used for the symbolic calculations in the all-at-once approach while the two-step method takes s. The times spent on both the symbolic and the numeric calculations are ideally scalable. In all, the new algorithms are perfectly scalable in terms of the memory usage and the compute time, and they also use much less memory than the traditional two-step method. The memory comparison between all the algorithms are also summarized in Fig. 2, where it is easily found that the two-method approach consumes much more memory than the all-at-once and the merged all-at-once methods for all processor counts.
To further confirm the performance of the new algorithms, larger meshes are used in the following test. A 3D structured grid is used as the coarse mesh, and the fine mesh is an uniform refinement of the coarse mesh. The dimensions of on the fine mesh are , and the dimensions of from the coarse mesh to the fine mesh are . The same configuration as before is adopted. The numerical results are shown in Table 3. The memories used to store matrices , and are recored in Table 4.
The two-step method was not able to run using processor cores since it was attempting to allocate too much memory beyond the physics memory. The parallel efficiencies for the two-step method are computed based on the compute time using processor cores. Again, the memory usage of the all-at-once algorithm is the same as that of the merged all-at-once algorithm. It is M at processors cores, and reduced to M, M and M when the number of processor cores is increased to , and . In fact, it is ideally halved when we double the number of processor cores. The two-step method consumes nine times as much memory as the all-at-once algorithm. The detailed comparison of the memory usages for all the algorithms is drawn in Fig. 4. It is obviously observed that the memory used in the two-step method is significantly more than that used in the all-at-once and the merged all-at-once algorithms. The times spent on the symbolic and the numeric calculations are scalable for all the algorithm with up to processor cores. For example, the all-at-once algorithm uses s for the symbolic calculations at processor cores, and the time is reduced to s, s, and s when we increase the number of processor cores to , and . The time on the numeric calculations for the all-at-once algorithm at processor core is s, and it perfectly is halved to s when we double the number of processor cores to . When we continue increasing the number of processor cores to and , and the time of the numeric calculations is further reduced to s and s. For the two-step method, the time on the symbolic computations at processor core is s, and continues being reduced to s and s when we use and processor cores. For the numeric calculations of the two-step method, the time is s at , and it is reduced to s and s when the number of processor cores is increased to and . Again, the merged all-at-once algorithm has a similar performance in terms of the memory usage and the compute time as the all-at-once algorithm. The all-at-once and the merged all-at-once algorithms are faster in the symbolic calculations than the two-step method, and are a little slower than the two-step method for the numeric calculations. The overall calculation time is scalable because both the symbolic and the numeric calculations are perfectly scalable. The perfect scalabilities in terms of the compute time for all the algorithms are obtained with all processor counts. The corresponding speedups and parallel efficiencies are also drawn in Fig. 3.
4.2 Neutron transport problem
Next, we present the performance of the proposed algorithms in the realistic neutron transport simulations. The numerical experiments in this section are carried out on a supercomputer at INL (Idaho National Laboratory), where each compute node has two 20-core processors with 2.4 GHz and the compute nodes are connected by an OmniPath network.
4.2.1 Problem description
We consider the multigroup neutron transport equations here, and the neutron transport equations is a memory-intensive application because there are many variables associated with each mesh vertex arising from the discretization of the neutron flying direction. The multigroup neutron transport equations defined in ( is the 3D spatial domain, e.g, shown in Fig. 5, and is the 2D unit sphere representing the neutron flying directions) reads as
where , and is the number of energy groups. The fundamental quantity of interest is neutron angular flux . In Eq. (10a), denotes the independent angular variable, is the independent spatial variable , is the macroscopic total cross section , is the macroscopic scattering cross section from group to group , is the macroscopic fission cross section , is the scalar flux defined as , is the scattering phase function, is the averaged neutron emitted per fission, is the prompt fission spectrum, and
is the eigenvalue (sometimes referred to as a multiplication factor). In Eq. (10b),
is the outward unit normal vector on the boundary,is the boundary of , , is the specular reflectivity on , and is the diffusive reflectivity on . The first term of (10a) is the streaming term, and the second is the collision term. The first term of the right hand side of Eq. (10a) is the scattering term, which couples the angular fluxes of all directions and energy groups together. The second term of the right hand side of (10a) is the fission term, which also couples the angular fluxes of all directions and energy groups together. Eq. (10b) is the boundary conditions. The multigroup neutron transport equations are discretized using the first-order Lagrange finite element in space, and the discrete ordinates method (that can be thought of as a collocation method) in angle (Wang et al. (2018); Lewis and Miller (1984); Kong et al. (2019b)). The discretizations are implemented based on RattleSnake (Wang et al. (2018)), MOOSE (Multiphysics Object Oriented Simulation Environment) (Peterson et al. (2018)) and libMesh (Kirk et al. (2006)). After the angular and spatial discretizations, we obtain a large-scale sparse eigenvalue system that is solved using an inexact Newton-Krylov method (Cai et al. (1998); Knoll and Keyes (2004); Kong (2016); Kong et al. (2018a, 2019b)). During each Newton iteration, the Jacobian system is computed using GMRES (Saad and Schultz (1986)), and to speedup the convergence of GMRES, the multilevel Schwarz preconditioner (Kong et al. (2019b)) is employed. The triple products are performed in the setup phase of the multilevel Schwarz preconditioner. We next report the performance of the new all-at-once algorithms for the triple products within the multilevel method.
4.2.2 Memory usage and scalability study
As below, we study the memory usages of the all-at-once and the merged all-at-once algorithms compared with the two-step approach. In this test, a mesh with 25,856,505 nodes and 26,298,300 elements is employed, and 96 variables are associated with each mesh vertex. The large-scale sparse eigenvalue system has 2,482,224,480 unknowns, which are computed with 4,000, 6,000, 8,000, and 10,000 processor cores, respectively. A flux moment of the physics solution is shown in Fig. 6. The preconditioning matrix is coarsened algebraically, and a twelve-level method is generated using 11 sparse matrix triple products, where interpolations and operator matrices are formed. Interested readers are referred to (Kong et al. (2019b)) for the coarsening algorithms and the preconditioner setup. The details of 12 operator matrices and 11 interpolation matrices for the twelve-level method are shown in Table 5 and 6. Here “level” is numbered from the fine level to the coarse levels, that is, “level 0” is the finest level and “level 11” is the coarsest level. “rows” is the number of the rows of matrix, “cols” is the number of the columns of matrix, “nonzeros” denotes the number of nonzero elements of matrix, “cols” is the minimum number of nonzero columns among all rows, “cols” is the maximum number of nonzero columns among all rows, and “cols” is the averaged number of nonzero columns for all rows.
The numerical results are summarized in Table 7 and 8. It is easily observed, in Table 7, that the traditional two-step algorithm uses twice as much memory as the all-at-once and the merged all-at-once algorithms do for all processor counts. For instance, the two-step algorithm uses M memory at 4,000, while the all-at-once and the merged all-at-once algorithms use M memory that is only of that used in the two-step method. This behavior is similar for all processor counts. The memory usages for all the algorithms are scalable when the number of processor cores is increased. They are M, M and M at 4,000 processor cores for the two-step, the all-at-once and the merged all-at-once algorithms, respectively, and proportionally reduced to M, M, M when we increase the number of processor cores to 6,000. The memory usages continue being decreased to M, M, and M, when the number of processor cores is increased to . The memory usages at processor cores are slightly increased, but this does not affect the overall memory scalability. Meanwhile, the compute times of the all-at-once and the merged all-at-once approaches are almost the same as that spent using the two-step method. More important, the compute times for all three algorithms are aggressively optimized so that they account for a negligible portion of the total compute time. The triple products do not affect the overall scalability and performance of the simulations. The compute times spent on the sparse matrix triple products sometimes are in the machine noise range so that we observed the total compute time is slightly larger even when the compute time of the triple products is smaller. For example, the total compute time using the merged all-at-once algorithm is s at 6,000 processor cores while that of the two-step method is s even though the compute time of the triple product using the two-step method is smaller than that obtained using the merged all-at-once algorithm. We therefore conclude that the proposed new algorithms including the all-at-once and the merged all-at-once approaches significantly save memory without any compromise on the compute time. The total compute time at 4,000 for the all-at-once algorithm is s, and it is reduced to s, s, s for 6,000, 8,000, 10,000 processor cores, respectively. The compute times for the merged all-at-once and the two-step method algorithms are close to that of the all-at-once approach for all processor counts, that is, they are scalable as well. Good scalabilities and parallel efficiencies in terms of the compute time for all the algorithms are maintained as shown in Fig. 7. A parallel efficiency of at least is achieved at processor cores.
The memory usage comparison for different triple product algorithms is also shown in Fig. 8, where it is obvious that the memory used in the two-step method is twice as much as used in the new algorithms for all processor counts.
In the previous test, the intermediate data is free after the preconditioner setup. But in some applications, the preconditioner setup is repeated for each nonlinear iteration and the corresponding triple products are also repeated. In this case, we could choose to cache necessary intermediate data, and do not need to do the symbolic calculations from the scratch for saving compute time. In Table 8, we concern on the amount of the memory required for all the algorithms when we cache the intermediate data. It is found that the memory is almost doubled, compared with that used without storing temporary data. For example, at 4,000 processor cores, M memory is allocated for the two-step method when caching the intermediate data, while it is M when we do not store temporary data. For the new algorithms, it is also similar and the memory usage is increased by . However, the memory usages of the new algorithms are still much lower than that used in the two-step method. More precisely, the memory usages of the new algorithms are half of that in the two-step method for all processor counts. While the compute time spent on the triple products is a tiny portion of the total compute time, the corresponding memory usage takes a large chunk of the total memory. This is shown in Fig. 10. For example, at 10,000 processor cores, for the two-step method, the memory usage on the triple products takes of the total memory, and for the new algorithms, the triple-products account for of the total memory. It is exactly the motivation to optimize the memory usage in the triple products.
Again, for all cases, the simulation is scalable in terms of the compute time and the memory usage. The corresponding parallel efficiencies and speedups are drawn in Fig. 9. We finally conclude that the new proposed algorithms including the all-at-once and the merged all-at-once use similar compute times as the two-step method does while they consume much less memory.
5 Final remarks
Two memory-efficient all-at-once algorithms are introduced, implemented and discussed for the sparse matrix triple products in multigrid/multilevel methods. The all-at-once triple product methods based on hash tables form the output matrix with one pass through the input matrices, where the first matrix-matrix multiplication is carried out using the row-wise approach and the second multiplication is accomplished with an outer product. For saving memory and reducing communications, the all-at-once approach and its merged version strategy do not either explicitly transpose the interpolation matrix or create any auxiliary matrices. Compared with the traditional two-step method, the proposed new algorithms use much less memory while it is able to maintain the computational efficiency; e.g., it consumes only 10% of the memory used in the two-step method for the model problem and 30% for the realistic neutron transport simulation. We have shown that the all-at-once algorithms are scalable in terms of the compute time and the memory usage for the model problem with 27 billions of unknowns on supercomputer with up to processor cores and the realistic neutron transport problem with 2 billions of unknowns using processor cores.
This manuscript has been authored by Battelle Energy Alliance, LLC under Contract No. DE-AC07-05ID14517 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paid-up, irrevocable, and worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.
This research made use of the resources of the High-Performance Computing Center at Idaho National Laboratory, which is supported by the Office of Nuclear Energy of the U.S. Department of Energy and the Nuclear Science User Facilities under Contract No. DE-AC07-05ID14517. This research also made use of the resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
Adams et al. (2004)
Adams MF, Bayraktar HH, Keaveny TM and Papadopoulos P (2004) Ultrascalable implicit finite element analyses in solid mechanics with over a half a billion degrees of freedom.In: Proceedings of the 2004 ACM/IEEE conference on Supercomputing. IEEE Computer Society, p. 34.
- Akbudak and Aykanat (2014) Akbudak K and Aykanat C (2014) Simultaneous input and output matrix partitioning for outer-product–parallel sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing 36(5): C568–C590.
- Balay et al. (2019) Balay S, Abhyankar S, Adams MF, Brown J, Brune P, Buschelman K, Dalcin L, Dener A, Eijkhout V, Gropp WD, Karpeyev D, Kaushik D, Knepley MG, May DA, McInnes LC, Mills RT, Munson T, Rupp K, Sanan P, Smith BF, Zampini S, Zhang H and Zhang H (2019) PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.11, Argonne National Laboratory. URL http://www.mcs.anl.gov/petsc.
- Ballard et al. (2016a) Ballard G, Druinsky A, Knight N and Schwartz O (2016a) Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Transactions on Parallel Computing (TOPC) 3(3): 18.
- Ballard et al. (2016b) Ballard G, Siefert C and Hu J (2016b) Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing 38(3): C203–C231.
- Borštnik et al. (2014) Borštnik U, VandeVondele J, Weber V and Hutter J (2014) Sparse matrix multiplication: The distributed block-compressed sparse row library. Parallel Computing 40(5-6): 47–58.
- Buluç and Gilbert (2012) Buluç A and Gilbert JR (2012) Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments. SIAM Journal on Scientific Computing 34(4): C170–C191.
- Cai et al. (1998) Cai XC, Gropp WD, Keyes DE, Melvin RG and Young DP (1998) Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation. SIAM Journal on Scientific Computing 19(1): 246–265.
- Challacombe (2000) Challacombe M (2000) A general parallel sparse-blocked matrix multiply for linear scaling SCF theory. Computer physics communications 128(1-2): 93–107.
- Karypis and Schloegel (2013) Karypis G and Schloegel K (2013) PARMETIS: Parallel Graph Partitioning and Sparse Matrix Ordering Library. Technical Report Version 4.0, University of Minnesota. URL http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview.
- Kirk et al. (2006) Kirk BS, Peterson JW, Stogner RH and Carey GF (2006) libMesh: A C++ Library for Parallel Adaptive Mesh Refinement/Coarsening Simulations. Engineering with Computers 22(3–4): 237–254. https://doi.org/10.1007/s00366-006-0049-3.
- Knoll and Keyes (2004) Knoll DA and Keyes DE (2004) Jacobian-free Newton-Krylov methods: a survey of approaches and applications. Journal of Computational Physics 193(2): 357–397.
- Kong (2016) Kong F (2016) A Parallel Implicit Fluid-structure Interaction Solver with Isogeometric Coarse Spaces for 3D Unstructured Mesh Problems with Complex Geometry. PhD Thesis, University of Colorado Boulder.
- Kong and Cai (2016) Kong F and Cai XC (2016) A highly scalable multilevel Schwarz method with boundary geometry preserving coarse spaces for 3D elasticity problems on domains with complex geometry. SIAM Journal on Scientific Computing 38(2): C73–C95.
- Kong and Cai (2017) Kong F and Cai XC (2017) A scalable nonlinear fluid-structure interaction solver based on a Schwarz preconditioner with isogeometric unstructured coarse spaces in 3D. Journal of Computational Physics 340: 498–518.
- Kong and Cai (2018) Kong F and Cai XC (2018) Scalability study of an implicit solver for coupled fluid-structure interaction problems on unstructured meshes in 3D. The International Journal of High Performance Computing Applications 32(2): 207–219.
- Kong et al. (2018a) Kong F, Kheyfets V, Finol E and Cai XC (2018a) An efficient parallel simulation of unsteady blood flows in patient-specific pulmonary artery. International Journal for Numerical Methods in Biomedical Engineering 34(4): 29–52.
- Kong et al. (2019a) Kong F, Kheyfets V, Finol E and Cai XC (2019a) Simulation of unsteady blood flows in a patient-specific compliant pulmonary artery with a highly parallel monolithically coupled fluid-structure interaction algorithm. International Journal for Numerical Methods in Biomedical Engineering URL https://doi.org/10.1002/cnm.3208.
- Kong et al. (11-16 November, 2018) Kong F, Stogner RH, Gaston DR, Peterson JW, Permann CJ, Slaughter AE and Martineau RC (11-16 November, 2018) A general-purpose hierarchical mesh partitioning method with node balancing strategies for large-scale numerical simulations. In: 2018 IEEE/ACM 9th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). Dallas Texas.
- Kong et al. (2019b) Kong F, Wang Y, Gaston DR, Permann CJ, Slaughter AE, Lindsay AD and Martineau RC (2019b) A highly parallel multilevel Newton-Krylov-Schwarz method with subspace-based coarsening and partition-based balancing for the multigroup neutron transport equations on 3D unstructured meshes. arXiv preprint arXiv:1903.03659 .
- Kong et al. (2018b) Kong F, Wang Y, Schunert S, Peterson JW, Permann CJ, Andrš D and Martineau RC (2018b) A fully coupled two-level Schwarz preconditioner based on smoothed aggregation for the transient multigroup neutron diffusion equations. Numerical Linear Algebra with Applications 25(3): 21–62.
- Lewis and Miller (1984) Lewis EE and Miller WF (1984) Computational Methods of Neutron Transport. New York, NY: John Wiley and Sons Inc.
- McCourt et al. (2013) McCourt M, Smith B and Zhang H (2013) Efficient sparse matrix-matrix products using colorings. SIAM Journal on Matrix Analysis and Applications .
- Peterson et al. (2018) Peterson JW, Lindsay AD and Kong F (2018) Overview of the incompressible Navier–Stokes simulation capabilities in the MOOSE framework. Advances in Engineering Software 119: 68–92.
- Saad and Schultz (1986) Saad Y and Schultz MH (1986) GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing 7(3): 856–869.
- Smith et al. (2004) Smith B, Bjorstad P, Gropp WD and Gropp W (2004) Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge university press.
- Stuben (1999) Stuben K (1999) Algebraic multigrid (AMG): An Introduction with Applications. GMD-Forschungszentrum Informationstechnik.
- Tuminaro and Tong (2000) Tuminaro RS and Tong C (2000) Parallel smoothed aggregation multigrid: Aggregation strategies on massively parallel machines. In: SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing. IEEE, pp. 5–5.
- Van Emden Henson (2002) Van Emden Henson UMY (2002) BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics 41(1): 155–177.
- Wang et al. (2018) Wang Y, Schunert S and Laboure V (2018) Rattlesnake Theory Manual. Technical report, Idaho National Laboratory(INL), Idaho Falls, ID (United States).