1 Introduction
In fluid flow simulations, the incompressibility constraint is often used to bypass fast acoustic waves and thereby allow the solution to evolve on a convective timescale that is most relevant for many engineering problems. This model leads to a Poisson problem for the pressure that is invariably the stiffest substep in timeadvancement of the NavierStokes equations. For large 3D problems, which mandate the use of iterative methods, the pressure solve thus typically encompasses the majority of the solution time.
For the spectral element (SE) discretization, the Poisson system matrix contains nonzeros for elements with a polynomial degree of (i.e., approximately
unknowns.) Through the use of tensorproductsum factorization, however, the SE matrixvector product can be effected in only
reads and operations, even in the case of complex geometries [parter_spectral_1979, deville_highorder_2002]. Consequently, the key to fast SEbased flow simulations is to find effective preconditioners tailored to this discretization. Many methods, including geometric multigrid approaches with pointwise Jacobi and Chebyshevaccelerated Jacobi smoothers [sundar_comparison_2015, adams_parallel_2003, Kronbichler2019], geometric multigrid with overlapping Schwarz smoothers [lottes_hybrid_2005, stiller_nonuniformly_2017, loisel_hybrid_2008], and preconditioning via loworder discretizations [parter_spectral_1979, olson_algebraic_2007, bellomaldonado_scalable_2019, canuto_finiteelement_2010], have been considered.In this work, we explore the parallel performance of these methods and extend the Schwarzsmoothing based multigrid of [lottes_hybrid_2005] to support restrictiveadditive Schwarz ideas of Cai and Sarkis [ras99] as well as Chebyshevaccelerated smoothing. In addition, we extend the loworder preconditioning strategy of [bellomaldonado_scalable_2019] to run on GPU architectures through the use of AmgX [naumov_amgx_2015]
as an algebraric multigrid (AMG) solver for the sparse system. Numerical results are shown for the SEbased pressure Poisson problem as well as for the NavierStokes equation. All methods considered are implemented by the authors in the scalable opensource CFD code, nekRS
[fischer_nekrs_2021]. nekRS started as a fork of libParanumal [libp] and uses highly optimized kernels based on the Open Concurrent Compute Abstraction (OCCA) [medina2014occa]. Special focus is given to performance and scalability on largescale GPUbased platforms such as OLCF’s Summit.The structure of this paper is as follows. Section 2 outlines the spectral element formulation for the Poisson problem in , as well as describe the various preconditioners considered. A brief description of several model probems of interest are presented in section 3. Numerical results are highlighted in section 4. Finally, a brief summary of the survey of solver techniques considered is provided in section 6.
2 Background and Implementation
We introduce here basic aspects of the SE Poisson discretization and associated preconditioners.
2.1 SE Poisson Discretization
Consider the Poisson equation in ,
(1) 
Boundary conditions for the pressure Poisson equation are either periodic, Neumann, or Dirichlet, with the latter typically applicable only on a small subset of the domain boundary, , correspondong to an outflow condition. Consequently, Neumann conditions apply over the majority (or all) of the domain boundary, which makes the pressure problem more challenging than the standard allDirichlet case.
The SE discretization of (1) is based on the weak form: Find such that,
(2) 
where is a finitedimensional approximation comprising the basis functions used in the SE discretization, , , that vanish on and represents the discrete inner product based on GaussLobattoLegendre quadrature in the reference element, . The basis functions allow us to represent the solution, , as , leading to a linear system of unknown basis coefficients,
(3) 
with respective mass and stiffnessmatrix entries, , and .
is tesellated into nonoverlapping hexahedral elements, , for , with isoparametric mappings from to provided by , for . Each is a thorder Lagrange cardinal polynomial on the GaussLobattoLegendre (GLL) qudrature points, . Similarly, the test and trial functions are written in local form as . Continuity is ensured across the interface between adjacent elements by enforcing when
. From this, a globaltolocal degreeoffreedom mapping,
, can be represented by a Boolean matrix , such that . The assembled stiffness matrix is then , where comprises the local stiffness matrices, . Similarly, . The SE formulation uses coincident GLL quadrature and nodal points, such that is diagonal. Moreover, is never formed, as it would contain nonzeros in the general case. Rather, the tensorproductsum factorization [parter_spectral_1979] allows for to be evaluated in time with storage, as described in detail in [deville_highorder_2002].2.2 Preconditioners
In the current study, all preconditioners are applied in the context of restarted GMRES. Although is symmetric positive definite (SPD), many of the preconditioners are asymmetric. Further, a recent study [fischer_highly_2021] has shown the benefits of projectionbased GMRES over flexible conjugate gradients (FCG) because the effectiveness of the overall pressure solution strategy (which includes projection onto prior solutions [fischer_projection_1998]) generally ensures a low enough iteration count, , such that the costs in GMRES are not overly onerous. We note, however, that FCG may yield very low iteration counts in certain cases (e.g., 1–2 iterations per step, as in [fischer_nekrs_2021]), in which case we use FCG rather than GMRES if it yields faster runtimes.
2.2.1 Semfem
In [parter_spectral_1979], Orszag suggested that constructing a sparse preconditioner based on the loworder discretizations with nodes coinciding with those of the highorder discretization would yield bounded condition numbers and, under certain constraints, can yield for secondorder Dirichlet problems. This observation has led to the development of preconditioning techniques based on solving the resulting loworder system [olson_algebraic_2007, bellomaldonado_scalable_2019, canuto_finiteelement_2010].
In the current work, we employ the same loworder discretization
considered in [bellomaldonado_scalable_2019]. Each of the vertices of
the hexahedral element is used to form one loworder, tetrahedral element,
resulting in a total of eight loworder elements for each GLL subvolume
in each of the highorder hexahedral elements.
This loworder discretization is then used to form the sparse operator, .
The socalled weak preconditioner, , is used to precondition the system.
Algebraric multigrid (AMG), implemented in CUDA in AmgX [naumov_amgx_2015],
is used with the following setup to solve the low order system:

PMIS coarsening

0.25 strength threshold

Damped Jacobi relaxation ()

One Vcycle for preconditioning

Smoothing on the coarsest level
We denote this preconditioning strategy as SEMFEM.
2.2.2 multigrid, Schwarz Smoothers
Another preconditioning strategy for the SEbased Poisson problem is to use geometric multigrid (pMG). The classical single pass Vcycle is summarized in Algorithm 1. In our application, we limit pMG preconditioning to a single Vcycle pass. Alternative strategies, such as F or Wcycle multigrid are not considered.
The SEbased additive Schwarz method (ASM) presented in [lottes_hybrid_2005, loisel_hybrid_2008] solves local Poisson problems on subdomains that are extensions of the spectral elements. The formal definition of the ASM preconditioner (or pMG smoother) is
(4) 
where is the restriction matrix that extracts nodal values of the residual vector that correspond to each overlapping domain, as indicated in Fig. 1b. To improve the smoothing properties of the ASM, we introduce the diagonal weight matrix, , which scales each nodal value by the inverse of the number of subdomains that share that node. Although it compromises symmetry, postmultiplication by was found to yield superior results to pre and postmultiplication by [stiller_nonuniformly_2017, lottes_hybrid_2005].
In a standard Galerkin ASM formulation, one would use , but such an approach would compromise the storage complexity of the SE method. To construct fast inverses for , we approximate each deformed element as a simple boxlike geometry, as demonstrated in Fig. 1a. These boxes are then extended by a single degreeoffreedom in each spatial dimension to form overlapping subdomains with interior degreesoffreedom in each domain. (See Fig. 1). The approximate box domain enables the use of the fast diagonalization method (FDM) to solve for each of the overlapping subdomains, which can be applied in time in [lottes_hybrid_2005]. The extendedbox Poisson operator is separable, with the 3D form
(5) 
where each represents the extended 1D massstiffness matrix pairs along the given dimension [lottes_hybrid_2005]. The FDM begins with a preprocessing step of solving a series of small,
, generalized eigenvalue problems,
(6) 
and defining and , to yield the similarity transforms
(7) 
From these, the inverse of the local Schwarz operator is
(8) 
where
(9) 
is a diagonal matrix. This process is repeated for each element, at each multigrid level save for the coarsest one. Note that the perelement storage is only for the matrices and for . At each multigrid level, the local subdomain solves are used as a smoother. On the coarsest level (), however, BoomerAMG [henson_boomeramg_2002] is used to solve the system with the same parameters as the AmgX solver in section 2.2.1, except using Chebyshev smoothing. Unless otherwise noted, all pMG preconditioners use a single BoomerAMG Vcycle iteration in the coarsegrid solve.
Presently, we also consider a restrictive additive Schwarz (RAS) version of (4), wherein overlapping values are not added after the action of the local FDM solve, following [ras99]. RAS has the added benefit of reducing the amount of communication required in the smoother. (Formally, RAS can be implemented by simply changing .) Note that, in the case of ASM and RAS smoothers without Chebyshevacceleration, an additive Vcycle with no postsmoothing is used to avoid residual reevaluation in Algorithm 1 [fischer04].
2.2.3 Chebyshev Acceleration
A notable improvement over standard Jacobismoothed multigrid is to use Chebyshevacceleration [sundar_comparison_2015, adams_parallel_2003], as described in Algorithm 2 for a given surrogate smoother, . While is typically based on Jacobi smoothing (e.g.,[Kronbichler2019]), it is also possible to consider the use of overlappingSchwarz (4) as the smoother, which is a new approach that we explore here.
Algorithm 2 requires approximate spectral bounds, , of the smoothed operator, . Using 10 rounds of Arnoldi iteration, we generate as an initial proxy for . The bounds employed in Algorithm 2 are then determined through sensitivity analysis similar to that performed by Adams and coworkers [adams_parallel_2003]. The analysis is conducted using a challenging model problem, namely the Kershaw case of subsection 3.1, with , , and relative (norm) residual tolerance of . A secondorder Chebyshevaccelerated ASM smoother is used in the pMG Vcycle preconditioner, which we denote as the current default preconditioner in nekRS.
The results of the sensitivity study are shown in Table 1. The required iteration counts are seen to be sensitive to underestimation of , as also found by Adams and coworkers [adams_parallel_2003]
. On the other hand, results are less sensitive to the minimum eigenvalue estimate, provided
, as this delimits between the low frequencies that are handled by the coarse grid correction and the high frequencies that are eliminated by the action of the smoother. The choice of the bounds (relative to ) is not universal: (1/30,1.1) [adams_parallel_2003], (0.3,1) [baker_multigrid_2011], (0.25,1) [sundar_comparison_2015], and (1/6,1) [zhukov_multigrid_2015] have all been considered. Based on the results of Table 1, we set as a conservative choice.            
      110  64  48  
      50  40  38  
    124  40  38  38  
159  45  43  42  43  44  
47  45  44  44  45  46 
3 Test Cases
We describe four model problems that are used to test the SE preconditioners. The first is a standalone Poisson solve, using variations of the Kershaw mesh. The others are modestscale NavierStokes problems, where the pressure Poisson problem is solved over multiple timesteps. The problem sizes are listed in Table 2 and range from relatively small (=21M points) to moderately large (=645M).^{1}^{1}1Larger cases for recent fullscale runs on Summit with =51B are reported in [fischer_highly_2021].
3.1 Poisson
The Kershaw family of meshes [kolev_ceed_2021, kershaw_differencing_1981]. has been proposed as the basis for a highorder Poissonsolver benchmark by Center for Efficient Exascale Discretization (CEED) within the DOE Exascale Computing Project (ECP). This family is parametrized by an anisotropy measure, , that determines the degree of deformation in the and directions. As decreases, the mesh deformation and aspect ratio increase along with it. The Kershaw mesh is shown in Fig. 2 for several values of . The domain with Dirichlet boundary conditions on . The right hand side for (1) is set to
(10) 
where is a random, continous vector vanishing on . The linear solver terminates after reaching a relative residual reduction of . Since this test case solves the Poisson equation, there is no timestepper needed for the model problem.
3.2 NavierStokes
Case Name  

146 pebble (Fig. 3a)  62K  7  21M 
1568 pebble (Fig. 3b)  524K  7  180M 
67 pebble (Fig. 3c)  122K  7  42M 
Speed bump (Fig. 3d)  885K  9  645M 
For the pressurePoisson tests, four flow cases are considered, as depicted in Fig. 3. The first three cases corresponds to turbulent flow through a cylindrical packedbed with 146, 1568, and 67 spherical pebbles. The 146 and 1568 pebble cases are from Lan and coworkers [lan_allhex_2021]. The 67 pebble case is constructed using a tettohex meshing strategy by Yuan and coworkers [yuan2020spectral]. The first two bed flows are at Reynolds number , based on sphere diameter, , while the 67 pebble case is at Reynolds number . Time advancement is based on a twostage 2ndorder characteristics timestepper with CFL=4 ( , and for the 146, 1568, and 67 pebble cases). An absolute pressure solver tolerance of is used. A restart at , , and convective time units is used for the 146, 1568, and 67 pebble cases, respectively, to provide an initially turbulent flow.
The fourth case, shown in Fig. 3d, is a direct numerical simulation (DNS) of seperated turbulent flow over a speed bump at . This test case was designed by Boeing to provide a flow that exhibits separation. A DNS of the full 3D geometry, however, remains difficult [shur_direct_2021]. Therefore, this smaller example proves a useful application for benchmarking solver performance. This case uses a 2ndorder timestepper with CFL=0.8 () and an absolute pressuresolve tolerance of . A restart at convective time units is used for the initial condition.
In all cases, solver results are collected over 2000 timesteps. At each step, the solution is projected onto a space of up to 10 prior solution vectors to generate a highquality initial guess, . Projection is standard practice in nekRS as it can reduce the initial residual by orders of magnitude at the cost of just one or two matrixvector products in per step [fischer_projection_1998].
The perturbation solution, , is typically devoid of slowly evolving low wavenumber content. Moreover, the initial residual is oftentimes sufficiently small that the solution converges in iterations, such that the overhead of GMRES is small. Testing the preconditioners under these conditions ensures that the conclusions drawn are relevant to the application space.
4 Results
Here we consider the solver performance results for the test cases of Section 3. We assign a single MPI rank to each GPU and denote the number of ranks as . All runs are on Summit. Each node on Summit consists of 42 IBM Power9 CPUs and 6 NVIDIA V100 GPUs. We use 6 GPUs per node unless . In the following, we denote a pMG preconditioner using order Chebyshevaccelerated smoother with a multigrid schedule of as Cheby(),. A wide range of preconditioning strategies is considered.
4.1 Kershaw Mesh
The Kershaw study comprises six tests. For each of two studies, we consider the regular box case (
), a moderately skewed case (
), and a highly skewed case (). The first study is a standard weakscale test, where and are increased, while the polynomial order is fixed at and the number of gridpoints per GPU is set to . The range of processors is =6 to 384. The second study is a test of the influence of polynomial order on conditioning, with and fixed, while ranges from 3 to 10. Both cases use GMRES(20).The results of the weakscaling study are shown in Fig. 4. For all values of , the iteration count exhibits a dependence on problem size, as seen in Fig. 4a,d,g, especially in the highly skewed case (). The timepersolve also increases with , in part due to the increase in iteration count, but also due to increased communication overhead as increases. This trend is not necessarily monotonic, as shown in Fig. 4c,f. In the case of ChebyRAS(2),(7,3,1), a minor fluctuation in the iteration count from the to case causes the greater than unity parallel efficiency. For ChebyJac(2),(7,5,3,1), the effects of system noise on the run causes the greater than unity parallel efficiency for the run.
Table 3 indicates the maximum number of of neighboring processors for the assembly () graph of , which increases with . In addition, the number of grid points per GPU () is relatively low. These two factors cause an increased sensitivity of the problem to the additional communication overhead as the number of GPUs is initially increased. The number of neighbors, however, will saturate at larger (i.e., productionlevel) processor counts.
Lastly, the relative preconditioner performance depends on . Fig. 4b, e, h show that, for the easy case, a pMG scheme with a smoother that is cheap to apply is best, such as ChebyRAS(1),(7,3,1), ChebyJac(2),(7,5,3,1), and ASM,(7,3,1). However, as decreases, more robust pMG smoothers such as ChebyRAS(2),(7,3,1) and SEMFEM result in lower time to solution. Once (Fig. 4h), the problem is sufficiently challenging that SEMFEM overtakes the pMG based preconditioning schemes. This indicates that, in the highly skewed case in which the maximum element aspect ratio increases, the pMG preconditioner is not as effective as the SEMFEM preconditioner.
Nodes  1  2  4  8  16  32  64 
Max Neighbors  5  11  16  24  20  24  29 
The influence of polynomial order is illustrated in Fig. 5. For , iteration counts are essentially independent, as seen in Fig. 5a. For , however, a slight upward trend in the iteration count is observed for SEMFEM and for the pMG preconditioners with ASM, RAS, and ChebyshevJacobi smoothing (Fig. 5b). Similarly, exhibits a dependence between the iteration count and the polynomial order for all preconditioners (Fig. 5c). Figures 5d, e and f demonstrate that the time per pressure solve is strongly dependent polynomial order. For SEMFEM, there is an increase in time as a result of increased overhead for the underlying AMG solver. For the pMGbased methods, higher orders are generally faster. This performance gain can be attributed to surfacetovolume effects in the evaluation of the operators and smoothers. Application of and (smoother) is highly vectorizable, whereas application of (assembly) involves a significant amount of indirect addressing. The discrete surface to volume ratio for a spectral element of order is 296/512 60%. For lower this value is larger. This situation is exacerbated in the case of Schwarzbased smoothers, where the overlap contributes substantially to the work and communication for the smallelement (i.e., low) cases. In addition, nekRS [fischer_nekrs_2021] uses a single element per thread block, which limits the amount of work available for a streaming multiprocessor for relatively small polynomial orders.
4.2 NavierStokes Results
We consider scalability of nekRS for the cases of Fig. 3. All simulations except one use GMRES(15) with an initial guess generated by conjugate projection onto 10 prior solutions [fischer_projection_1998]. Due to memory constraints, the 1568pebble case with uses GMRES(10) with only 5 solutionprojection vectors. For each case, two pMG schedules are considered: and for ; and and for . Other parameters, such as the Chebyshev order and the number of coarse grid BoomerAMG Vcycles are also varied. Results are shown in Figs. 6,7. The plots relate the effective work rate per node, measured as the gridpoints (as shown in Table 2) solved per second per node, to the timetosolution. The yaxis notes the drop in the relative work rate, which corresponds to a lower parallel efficiency, as the strong scale limit is reached, while the xaxis denotes the timetosolution. Each node consists of 6 GPUs, hence .
In all the performance tests conducted, the pMG preconditioner with ChebyshevJacobi smoothing is outperformed by the other preconditioners, whether using one or two Vcycle iterations in the AMG coarsegrid solve. For each case, the fastest preconditioner scheme varies. In the 146 pebble case (Fig. 7a), using ChebyRAS(2),(7,5,3,1) yields the smallest time per pressure solve. However, in the 1568 pebble case (Fig. 7b), SEMFEM is a moderate improvement over the second best preconditioner, ChebyASM(2),(7,5,3,1). pMG with a schedule and ChebyshevRAS (of any order) yield the best scalability and lowest time per pressure solve for the Boeing speed bump case (Fig. 6). The Chebyshevaccelerated Schwarz schemes are not always the fastest, however. For the 67 pebble case (Fig. 7c), ChebyJac(2),(7,5,3,1) is comparable to ChebyRAS(2),(7,5,3,1) and are the two fastest pMG based preconditioners. However, SEMFEM is significantly faster than the other preconditioners for this case.
Also considered is a hyrbid twolevel pMG/SEMFEM approach wherein SEMFEM is used as the solver for the coarse level. For , a schedule with 2nd order Chebyshevaccelerated ASM smoothing on the level and SEMFEM solver on the level, denoted as ChebyASM(2),(7,6) + SEMFEM, is used. Similarly, ChebASM(2),(7,5) + SEMFEM and ChebASM(2),(7,3) + SEMFEM are considered. For , a , , , and hybrid pMG/SEMFEM approach is considered, denoted as ChebASM(2),(9,8) + SEMFEM, ChebASM(2),(9,7) + SEMFEM, ChebASM(2),(9,5) + SEMFEM, and ChebASM(2),(9,3) + SEMFEM, respectively. In the pebble cases shown in Fig. 7a,c, this hybrid approach performs somewhere between the SEMFEM and ChebyASM(2),(7,3,1) preconditioners. In the Boeing speed bump case, however, Fig. 6 demonstrates that this approach is not as performant as either the SEMFEM or ChebyASM(2),(9,5,1) preconditioners.
Solver performance degrades whenever is sufficiently small, regardless of the solver considered. For the various preconditioners considered, at 2 nodes (), the strongscale limit of 80% efficiency is far surpassed in the 146 pebble case. This leaves the effective strong scale limit at 12 nodes ( to ). For the 1568 pebble case, 12 nodes () yields a parallel efficiency around 6070%, depending on the specific solver. The 67 pebble cases reaches 70% effieciency on 3 nodes (). The parallel efficiency for the Boeing speed bump case for the fastest timetosolution preconditioners drops below 70% when using more than 48 nodes ().
While the effect of mesh quality metrics, such as the max aspect ratio and scaled Jacobian, on solver convergence has been studied by Mittal et al. [mittal2019mesh], predicting the optimal preconditioner settings from mesh quality metrics is not obvious. While the maximum aspect ratio is an important metric for mesh quality, it alone cannot explain the apparent poor performance of the pMG preconditioners in the 67 pebble case, Fig. 7c. Consider, for example, that the Boeing speed bump case has a larger maximum aspect ratio (255) than the 67 pebble case (204), but does not exhibit this poor performance in the pMG preconditioners. The minimum scaled Jacobian, however, is at least an order of magnitude smaller in the 67 pebble case () as compared to the other cases (e.g., .996 for the Boeing speed bump case, and for the 146, 1568 pebble cases, respectively). This may, in turn, explain why the pMG preconditioners in the 67 pebble case were signficiantly suboptimal compared to the SEMFEM preconditioner. This demonstrates, however, that a user cannot simply rely on, e.g., the maximum aspect ratio when deciding whether or not to use SEMFEM as a preconditioner. This inability to correctly identify preconditioner settings based on mesh quality metrics alone motivates the introduction of an autotuner to choose preconditioner parameters during runtime.
Case Name  GLL Spacing (min/max)  Scaled Jacobian (min/max/avg)  Aspect Ratio (min/max/avg) 

146 pebble  / .32  / .977 / .419  1.07 / 56.9 / 7.14 
1568 pebble  / .3  / .99 / .371  1.12 / 108 / 12.6 
67 pebble  / .145  / .970 / .38  1.17 / 204 / 13.2 
Boeing Speed Bump  /  .996 / 1 / 0.999  6.25 / 255 / 28.1 
5 Autotuning for Production Simulations
The results of the preceding section provides a small window into the varieties of performance behavior encountered in actual production cases, which span a large range of problem sizes, domain topologies, mesh qualities. Moreover, production simulations are solved across a range of architectures having varying onnode and network performance, interconnect topologies, and processor counts. One frequently encounters situations where certain communication patterns might be slower under a particular MPI version on one platform versus another. Unless a developer has access to that platform, it’s difficult to measure and quantify the communication overhead. Processor count alone can be a major factor in preconditioner selection: large processor counts have relatively high coarsegrid solve costs that can be mitigated by doing more smoothing at the fine and intermediate levels. How much more is the open question. The enormity of parameter space, particularly “in the field” (i.e., users working on unknown platforms) limits the effectiveness of standard complexity analysis in selecting the optimal preconditioner for a given user’s application.
From the user’s perspective, there is only one application (at a time, typically), and one processor count of interest. Being able to provide optimized performance—tuned to the application at hand, which includes the processor count—is thus of paramount importance. Autotuning provides an effective way to deliver this performance. Autotuning of preconditioners has been considered in early work by Imakura at al. [imakura_autotuning_2012] and more recently by Yamada at al. [yamada2018preconditioner] and by Brown et al. [brown_tuning_2021]. The latter work couples their autotuning with local Fourier analysis to guide the tuning process.
Parameters  

Solver  preconditioned GMRES 
Preconditioner  pMG, SEMFEM 
pMG  : ChebyASM(2),(7,3,1), ChebyRAS(2), (7,3,1), ChebyJac(2), (7,5,3,1) 
: ChebyASM(2),(9,5,1), ChebyRAS(2), (9,5,1), ChebyJac(2), (9,7,5,1)  
Coarse grid  single boomerAMG Vcycle 
SEMFEM  single AmgX Vcycle 
In largescale fluid mechanics applications, autotuning overhead is typically amortized over – timesteps (i.e., pressure solves) per run (and more, over an entire simulation campaign). Moreover, autotuning is of particular importance for problems at large processor counts because these cases often have long queue times, which preclude making multiple job submissions in order to tweak parameter settings. Failure to optimize, however, can result in significant opportunity costs. For example, in [fischer_highly_2021] the authors realized a factor of 2.8 speedup in timeperstep for a 352,000pebblebed simulation (=51B gridpoints) through a sequence of tuning steps. Even if there were 100 configurations in the preconditioner parameter space, an autotuner could visit each of these in succession 5 times each within the first 500 steps and have expended only a modest increment in overhead when compared to the cost of submitting many jobs (for tuning) or to the cost of 10,000 steps for a production run.
As a preliminary step, the authors consider constructing a small subset of the true search space (which could include, e.g., other cycles, schedules, or smoothers) for use in a nascent autotuner. Across all cases, with exception to the 67 pebble case, pMG preconditioning with Chebyshevaccelerated ASM or RAS smoothing is the fastest solver or is comparable to SEMFEM. The choice of Chebyshev order and multigrid schedule, moreover, contributes only a modest 1020% improvement to the overall timetosolution in most cases, all else being equal. This makes the default ChebyASM(2),(7,3,1) or ChebyASM(2),(9,5,1) preconditioner reasonably performant. However, in order to avoid the situation encountered in the 67 pebble case, SEMFEM is added to the considered search space. The parameters in Table 5 are chosen as they include optimal or nearoptimal preconditioner settings for the results in Figs. 6, 7, while still restricting the search space to something that is amenable to exhaustive search. The authors note, however, that this parameter space may not reflect the various factors affecting the performance of the preconditioners at especially large or on different machine architectures.
During the first timestep, our simple autotuner performs an exhaustive search over the small parameter space identified in Table 5. While this space is limited compared to the true search space, the resultant preconditioners selected are effective. In the 146 pebble case (Fig. 7a), ChebyRAS(2),(7,3,1) is identified as the preconditioner on each of the processor counts, which was comparable in peformance to the best preconditioner. The autotuner chose the optimal SEMFEM for the 1568 pebble case (Fig. 7b). SEMFEM was identified at the preconditioner for the 67 pebble case on all processor counts, which Fig. 7c confirms. In the Boeing speed bump case (Fig. 6), the autotuner chose ChebyRAS(2),(9,5,1) across all processor counts, which was either the optimal or nearoptimal preconditioner for the problem.
Note that the autotuner selects the fastest method at a fixedprocessor count (i.e., whatever the user has selected). Consider, for example, the 67 pebble case on GPUs (dashed black line, Fig. 7c). The goal of the autotuner is to select the preconditioner with lowest timetosolution along the userspecified isoprocessor count line. However, in the results discussed above, there was no change with respect to a change in the processor count. We note that our principal objective is not to squeeze out a few percent over a raft of good choices, but rather to ensure that a case does not run with an unfortunate set of parameters for which the performance is significantly substandard. This primitive autotuning technique proves effective at preventing the selection of highly suboptimal preconditioners. We have implemented this for production use and will continue to update and refine the strategy moving forward.
6 Conclusions and Future Work
In this work, we introduce Chebyshevaccelerated ASM and RAS smoothers for use in multigrid preconditioners for the spectral element Poisson problem. Further, we compare the performance of Schwarzbased smoothers, Chebyshevaccelerated Jacobi smoothing, and SEMFEMbased preconditioning for a suite of challenging test problems. We conclude that the Chebyshevaccelerated Schwarz smoothers with multigrid, as well as the loworder SEMFEM preconditioner solved with AmgX, are performant on GPU machines such as OLCF’s Summit. The authors propose a runtime autotuner preconditioner strategy that, while primitive, can choose reasonable solver parameters.
The authors plan on conducting similar parallel scalability studies on other machines, such as OLCF’s Spock, an AMD MI100 machine with similar hardware and software as the upcoming Frontier system. Additional preconditioner options, such as varying the AMG solver settings for the coarse grid solve as well as the solver location (CPU/GPU), varying the Chebyshev order and number of sweeps on each level of the multigrid heirarchy, and varying the Chebyshev eigenvalue bounds, are avenues for future performance optimization. Local Fourier analysis similar to that done by Thompson et al. in [thompson2021local] is further needed to understand the smoothing properties of the Chebyshevaccelerated Schwarz smoothers, allowing for robust optimization of parameters, as in the work by Brown et al. [brown_tuning_2021].
7 Acknowledgements
This research is supported by the Exascale Computing Project (17SC20SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nation’s exascale computing imperative. This research also used resources of the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract DEAC0500OR22725.
The authors thank YuHsiang Lan, Ramesh Balakrishnan, David Alan Reger, and Haomin Yuan for providing visualizations and mesh files. The authors thank the reviewers of this work for their insightful comments and suggestions.
Comments
There are no comments yet.