Tuning Spectral Element Preconditioners for Parallel Scalability on GPUs

The Poisson pressure solve resulting from the spectral element discretization of the incompressible Navier-Stokes equation requires fast, robust, and scalable preconditioning. In the current work, a parallel scaling study of Chebyshev-accelerated Schwarz and Jacobi preconditioning schemes is presented, with special focus on GPU architectures, such as OLCF's Summit. Convergence properties of the Chebyshev-accelerated schemes are compared with alternative methods, such as low-order preconditioners combined with algebraic multigrid. Performance and scalability results are presented for a variety of preconditioner and solver settings. The authors demonstrate that Chebyshev-accelerated-Schwarz methods provide a robust and effective smoothing strategy when using p-multigrid as a preconditioner in a Krylov-subspace projector. The variety of cases to be addressed, on a wide range of processor counts, suggests that performance can be enhanced by automated run-time selection of the preconditioner and associated parameters.



There are no comments yet.


page 5


NekRS, a GPU-Accelerated Spectral Element Navier-Stokes Solver

The development of NekRS, a GPU-oriented thermal-fluids simulation code ...

Strong Scaling of OpenACC enabled Nek5000 on several GPU based HPC systems

We present new results on the strong parallel scaling for the OpenACC-ac...

Low-order preconditioning of the Stokes equations

Low-order finite-element discretizations are well-known to provide effec...

A GPU Accelerated Discontinuous Galerkin Incompressible Flow Solver

We present a GPU-accelerated version of a high-order discontinuous Galer...

Parallel Performance of Algebraic Multigrid Domain Decomposition (AMG-DD)

Algebraic multigrid (AMG) is a widely used scalable solver and precondit...

GPU Acceleration of a High-Order Discontinuous Galerkin Incompressible Flow Solver

We present a GPU-accelerated version of a high-order discontinuous Galer...

Parallel Element-based Algebraic Multigrid for H(curl) and H(div) Problems Using the ParELAG Library

This paper presents the use of element-based algebraic multigrid (AMGe) ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In fluid flow simulations, the incompressibility constraint is often used to bypass fast acoustic waves and thereby allow the solution to evolve on a convective time-scale that is most relevant for many engineering problems. This model leads to a Poisson problem for the pressure that is invariably the stiffest substep in time-advancement of the Navier-Stokes equations. For large 3D problems, which mandate the use of iterative methods, the pressure solve thus typically encompasses the majority of the solution time.

For the spectral element (SE) discretization, the Poisson system matrix contains nonzeros for elements with a polynomial degree of (i.e., approximately

unknowns.) Through the use of tensor-product-sum factorization, however, the SE matrix-vector product can be effected in only

reads and operations, even in the case of complex geometries [parter_spectral_1979, deville_high-order_2002]. Consequently, the key to fast SE-based flow simulations is to find effective preconditioners tailored to this discretization. Many methods, including geometric -multigrid approaches with pointwise Jacobi and Chebyshev-accelerated Jacobi smoothers [sundar_comparison_2015, adams_parallel_2003, Kronbichler2019], geometric -multigrid with overlapping Schwarz smoothers [lottes_hybrid_2005, stiller_nonuniformly_2017, loisel_hybrid_2008], and preconditioning via low-order discretizations [parter_spectral_1979, olson_algebraic_2007, bello-maldonado_scalable_2019, canuto_finite-element_2010], have been considered.

In this work, we explore the parallel performance of these methods and extend the Schwarz-smoothing based -multigrid of [lottes_hybrid_2005] to support restrictive-additive Schwarz ideas of Cai and Sarkis [ras99] as well as Chebyshev-accelerated smoothing. In addition, we extend the low-order preconditioning strategy of [bello-maldonado_scalable_2019] to run on GPU architectures through the use of AmgX [naumov_amgx_2015]

as an algebraric multigrid (AMG) solver for the sparse system. Numerical results are shown for the SE-based pressure Poisson problem as well as for the Navier-Stokes equation. All methods considered are implemented by the authors in the scalable open-source CFD code, nekRS

[fischer_nekrs_2021]. nekRS started as a fork of libParanumal [libp] and uses highly optimized kernels based on the Open Concurrent Compute Abstraction (OCCA) [medina2014occa]. Special focus is given to performance and scalability on large-scale GPU-based platforms such as OLCF’s Summit.

The structure of this paper is as follows. Section 2 outlines the spectral element formulation for the Poisson problem in , as well as describe the various preconditioners considered. A brief description of several model probems of interest are presented in section 3. Numerical results are highlighted in section 4. Finally, a brief summary of the survey of solver techniques considered is provided in section 6.

2 Background and Implementation

We introduce here basic aspects of the SE Poisson discretization and associated preconditioners.

2.1 SE Poisson Discretization

Consider the Poisson equation in ,


Boundary conditions for the pressure Poisson equation are either periodic, Neumann, or Dirichlet, with the latter typically applicable only on a small subset of the domain boundary, , correspondong to an outflow condition. Consequently, Neumann conditions apply over the majority (or all) of the domain boundary, which makes the pressure problem more challenging than the standard all-Dirichlet case.

The SE discretization of (1) is based on the weak form: Find such that,


where is a finite-dimensional approximation comprising the basis functions used in the SE discretization, , , that vanish on and represents the discrete inner product based on Gauss-Lobatto-Legendre quadrature in the reference element, . The basis functions allow us to represent the solution, , as , leading to a linear system of unknown basis coefficients,


with respective mass- and stiffness-matrix entries, , and .

is tesellated into nonoverlapping hexahedral elements, , for , with isoparametric mappings from to provided by , for . Each is a th-order Lagrange cardinal polynomial on the Gauss-Lobatto-Legendre (GLL) qudrature points, . Similarly, the test and trial functions are written in local form as . Continuity is ensured across the interface between adjacent elements by enforcing when

. From this, a global-to-local degree-of-freedom mapping,

, can be represented by a Boolean matrix , such that . The assembled stiffness matrix is then , where comprises the local stiffness matrices, . Similarly, . The SE formulation uses coincident GLL quadrature and nodal points, such that is diagonal. Moreover, is never formed, as it would contain nonzeros in the general case. Rather, the tensor-product-sum factorization [parter_spectral_1979] allows for to be evaluated in time with storage, as described in detail in [deville_high-order_2002].

2.2 Preconditioners

In the current study, all preconditioners are applied in the context of restarted GMRES. Although is symmetric positive definite (SPD), many of the preconditioners are asymmetric. Further, a recent study [fischer_highly_2021] has shown the benefits of projection-based GMRES over flexible conjugate gradients (FCG) because the effectiveness of the overall pressure solution strategy (which includes projection onto prior solutions [fischer_projection_1998]) generally ensures a low enough iteration count, , such that the costs in GMRES are not overly onerous. We note, however, that FCG may yield very low iteration counts in certain cases (e.g., 1–2 iterations per step, as in [fischer_nekrs_2021]), in which case we use FCG rather than GMRES if it yields faster runtimes.

2.2.1 Semfem

In [parter_spectral_1979], Orszag suggested that constructing a sparse preconditioner based on the low-order discretizations with nodes coinciding with those of the high-order discretization would yield bounded condition numbers and, under certain constraints, can yield for second-order Dirichlet problems. This observation has led to the development of preconditioning techniques based on solving the resulting low-order system [olson_algebraic_2007, bello-maldonado_scalable_2019, canuto_finite-element_2010].

In the current work, we employ the same low-order discretization considered in [bello-maldonado_scalable_2019]. Each of the vertices of the hexahedral element is used to form one low-order, tetrahedral element, resulting in a total of eight low-order elements for each GLL sub-volume in each of the high-order hexahedral elements. This low-order discretization is then used to form the sparse operator, . The so-called weak preconditioner, , is used to precondition the system. Algebraric multigrid (AMG), implemented in CUDA in AmgX [naumov_amgx_2015], is used with the following setup to solve the low order system:

  • PMIS coarsening

  • 0.25 strength threshold

  • Extended + i interpolation (


  • Damped Jacobi relaxation ()

  • One V-cycle for preconditioning

  • Smoothing on the coarsest level

We denote this preconditioning strategy as SEMFEM.

2.2.2 -multigrid, Schwarz Smoothers

Another preconditioning strategy for the SE-based Poisson problem is to use geometric -multigrid (pMG). The classical single pass V-cycle is summarized in Algorithm 1. In our application, we limit pMG preconditioning to a single V-cycle pass. Alternative strategies, such as F- or W-cycle multigrid are not considered.

   // smooth
   // re-evaluate residual
   // coarsen
   // solve/re-apply V-cycle
   // prolongate
   // update solution
   // post smoothing
Algorithm 1 Single pass multigrid V-cycle
Figure 1: (a) Box-like approximation (b) overlapping domain

The SE-based additive Schwarz method (ASM) presented in [lottes_hybrid_2005, loisel_hybrid_2008] solves local Poisson problems on subdomains that are extensions of the spectral elements. The formal definition of the ASM preconditioner (or pMG smoother) is


where is the restriction matrix that extracts nodal values of the residual vector that correspond to each overlapping domain, as indicated in Fig. 1b. To improve the smoothing properties of the ASM, we introduce the diagonal weight matrix, , which scales each nodal value by the inverse of the number of subdomains that share that node. Although it compromises symmetry, post-multiplication by was found to yield superior results to pre- and post-multiplication by [stiller_nonuniformly_2017, lottes_hybrid_2005].

In a standard Galerkin ASM formulation, one would use , but such an approach would compromise the storage complexity of the SE method. To construct fast inverses for , we approximate each deformed element as a simple box-like geometry, as demonstrated in Fig. 1a. These boxes are then extended by a single degree-of-freedom in each spatial dimension to form overlapping subdomains with interior degrees-of-freedom in each domain. (See Fig. 1). The approximate box domain enables the use of the fast diagonalization method (FDM) to solve for each of the overlapping subdomains, which can be applied in time in [lottes_hybrid_2005]. The extended-box Poisson operator is separable, with the 3D form


where each represents the extended 1D mass-stiffness matrix pairs along the given dimension [lottes_hybrid_2005]. The FDM begins with a preprocessing step of solving a series of small,

, generalized eigenvalue problems,


and defining and , to yield the similarity transforms


From these, the inverse of the local Schwarz operator is




is a diagonal matrix. This process is repeated for each element, at each multigrid level save for the coarsest one. Note that the per-element storage is only for the matrices and for . At each multigrid level, the local subdomain solves are used as a smoother. On the coarsest level (), however, BoomerAMG [henson_boomeramg_2002] is used to solve the system with the same parameters as the AmgX solver in section 2.2.1, except using Chebyshev smoothing. Unless otherwise noted, all pMG preconditioners use a single BoomerAMG V-cycle iteration in the coarse-grid solve.

Presently, we also consider a restrictive additive Schwarz (RAS) version of (4), wherein overlapping values are not added after the action of the local FDM solve, following [ras99]. RAS has the added benefit of reducing the amount of communication required in the smoother. (Formally, RAS can be implemented by simply changing .) Note that, in the case of ASM and RAS smoothers without Chebyshev-acceleration, an additive V-cycle with no post-smoothing is used to avoid residual re-evaluation in Algorithm 1 [fischer04].

2.2.3 Chebyshev Acceleration

A notable improvement over standard Jacobi-smoothed multigrid is to use Chebyshev-acceleration [sundar_comparison_2015, adams_parallel_2003], as described in Algorithm 2 for a given surrogate smoother, . While is typically based on Jacobi smoothing (e.g.,[Kronbichler2019]), it is also possible to consider the use of overlapping-Schwarz (4) as the smoother, which is a new approach that we explore here.

  , , ,
  , ,
  for  do
  end for
Algorithm 2 Chebyshev smoother

Algorithm 2 requires approximate spectral bounds, , of the smoothed operator, . Using 10 rounds of Arnoldi iteration, we generate as an initial proxy for . The bounds employed in Algorithm 2 are then determined through sensitivity analysis similar to that performed by Adams and coworkers [adams_parallel_2003]. The analysis is conducted using a challenging model problem, namely the Kershaw case of subsection 3.1, with , , and relative (-norm) residual tolerance of . A second-order Chebyshev-accelerated ASM smoother is used in the pMG V-cycle preconditioner, which we denote as the current default preconditioner in nekRS.

The results of the sensitivity study are shown in Table 1. The required iteration counts are seen to be sensitive to underestimation of , as also found by Adams and coworkers [adams_parallel_2003]

. On the other hand, results are less sensitive to the minimum eigenvalue estimate, provided

, as this delimits between the low frequencies that are handled by the coarse grid correction and the high frequencies that are eliminated by the action of the smoother. The choice of the bounds (relative to ) is not universal: (1/30,1.1) [adams_parallel_2003], (0.3,1) [baker_multigrid_2011], (0.25,1) [sundar_comparison_2015], and (1/6,1) [zhukov_multigrid_2015] have all been considered. Based on the results of Table 1, we set as a conservative choice.

- - - - - -
- - - 110 64 48
- - - 50 40 38
- - 124 40 38 38
159 45 43 42 43 44
47 45 44 44 45 46
Table 1: Iteration counts for the () sensitivity study. Omitted entries failed to converge in 1000 iterations.

3 Test Cases

We describe four model problems that are used to test the SE preconditioners. The first is a stand-alone Poisson solve, using variations of the Kershaw mesh. The others are modest-scale Navier-Stokes problems, where the pressure Poisson problem is solved over multiple timesteps. The problem sizes are listed in Table 2 and range from relatively small (=21M points) to moderately large (=645M).111Larger cases for recent full-scale runs on Summit with =51B are reported in [fischer_highly_2021].

3.1 Poisson

The Kershaw family of meshes [kolev_ceed_2021, kershaw_differencing_1981]. has been proposed as the basis for a high-order Poisson-solver benchmark by Center for Efficient Exascale Discretization (CEED) within the DOE Exascale Computing Project (ECP). This family is parametrized by an anisotropy measure, , that determines the degree of deformation in the and directions. As decreases, the mesh deformation and aspect ratio increase along with it. The Kershaw mesh is shown in Fig. 2 for several values of . The domain with Dirichlet boundary conditions on . The right hand side for (1) is set to


where is a random, continous vector vanishing on . The linear solver terminates after reaching a relative residual reduction of . Since this test case solves the Poisson equation, there is no timestepper needed for the model problem.

3.2 Navier-Stokes

Figure 2: Kershaw, . .
Case Name
146 pebble (Fig. 3a) 62K 7 21M
1568 pebble (Fig. 3b) 524K 7 180M
67 pebble (Fig. 3c) 122K 7 42M
Speed bump (Fig. 3d) 885K 9 645M
Table 2: Problem discretization parameters.
Figure 3: Navier-Stokes cases: pebble-beds with (a) 146, (b) 1568, and (c) 67 spheres; (d) Boeing speed bump.

For the pressure-Poisson tests, four flow cases are considered, as depicted in Fig. 3. The first three cases corresponds to turbulent flow through a cylindrical packed-bed with 146, 1568, and 67 spherical pebbles. The 146 and 1568 pebble cases are from Lan and coworkers [lan_all-hex_2021]. The 67 pebble case is constructed using a tet-to-hex meshing strategy by Yuan and coworkers [yuan2020spectral]. The first two bed flows are at Reynolds number , based on sphere diameter, , while the 67 pebble case is at Reynolds number . Time advancement is based on a two-stage 2nd-order characteristics timestepper with CFL=4 ( , and for the 146, 1568, and 67 pebble cases). An absolute pressure solver tolerance of is used. A restart at , , and convective time units is used for the 146, 1568, and 67 pebble cases, respectively, to provide an initially turbulent flow.

The fourth case, shown in Fig. 3d, is a direct numerical simulation (DNS) of seperated turbulent flow over a speed bump at . This test case was designed by Boeing to provide a flow that exhibits separation. A DNS of the full 3D geometry, however, remains difficult [shur_direct_2021]. Therefore, this smaller example proves a useful application for benchmarking solver performance. This case uses a 2nd-order timestepper with CFL=0.8 () and an absolute pressure-solve tolerance of . A restart at convective time units is used for the initial condition.

In all cases, solver results are collected over 2000 timesteps. At each step, the solution is projected onto a space of up to 10 prior solution vectors to generate a high-quality initial guess, . Projection is standard practice in nekRS as it can reduce the initial residual by orders of magnitude at the cost of just one or two matrix-vector products in per step [fischer_projection_1998].

The perturbation solution, , is typically devoid of slowly evolving low wave-number content. Moreover, the initial residual is oftentimes sufficiently small that the solution converges in iterations, such that the overhead of GMRES is small. Testing the preconditioners under these conditions ensures that the conclusions drawn are relevant to the application space.

4 Results

Here we consider the solver performance results for the test cases of Section 3. We assign a single MPI rank to each GPU and denote the number of ranks as . All runs are on Summit. Each node on Summit consists of 42 IBM Power9 CPUs and 6 NVIDIA V100 GPUs. We use 6 GPUs per node unless . In the following, we denote a pMG preconditioner using -order Chebyshev-accelerated smoother with a multigrid schedule of as Cheby(),. A wide range of preconditioning strategies is considered.

4.1 Kershaw Mesh

The Kershaw study comprises six tests. For each of two studies, we consider the regular box case (

), a moderately skewed case (

), and a highly skewed case (). The first study is a standard weak-scale test, where and are increased, while the polynomial order is fixed at and the number of gridpoints per GPU is set to . The range of processors is =6 to 384. The second study is a test of the influence of polynomial order on conditioning, with and fixed, while ranges from 3 to 10. Both cases use GMRES(20).

The results of the weak-scaling study are shown in Fig. 4. For all values of , the iteration count exhibits a dependence on problem size, as seen in Fig. 4a,d,g, especially in the highly skewed case (). The time-per-solve also increases with , in part due to the increase in iteration count, but also due to increased communication overhead as increases. This trend is not necessarily monotonic, as shown in Fig. 4c,f. In the case of Cheby-RAS(2),(7,3,1), a minor fluctuation in the iteration count from the to case causes the greater than unity parallel efficiency. For Cheby-Jac(2),(7,5,3,1), the effects of system noise on the run causes the greater than unity parallel efficiency for the run.

Table 3 indicates the maximum number of of neighboring processors for the assembly () graph of , which increases with . In addition, the number of grid points per GPU () is relatively low. These two factors cause an increased sensitivity of the problem to the additional communication overhead as the number of GPUs is initially increased. The number of neighbors, however, will saturate at larger (i.e., production-level) processor counts.

Lastly, the relative preconditioner performance depends on . Fig. 4b, e, h show that, for the easy case, a pMG scheme with a smoother that is cheap to apply is best, such as Cheby-RAS(1),(7,3,1), Cheby-Jac(2),(7,5,3,1), and ASM,(7,3,1). However, as decreases, more robust pMG smoothers such as Cheby-RAS(2),(7,3,1) and SEMFEM result in lower time to solution. Once (Fig. 4h), the problem is sufficiently challenging that SEMFEM overtakes the pMG based preconditioning schemes. This indicates that, in the highly skewed case in which the maximum element aspect ratio increases, the pMG preconditioner is not as effective as the SEMFEM preconditioner.

Figure 4: Kershaw weak scaling, GMRES(20). .
Figure 5: Kershaw order dependence, GMRES(20). .
Nodes 1 2 4 8 16 32 64
Max Neighbors 5 11 16 24 20 24 29
Table 3: Max neighbors, Kershaw.

The influence of polynomial order is illustrated in Fig. 5. For , iteration counts are essentially -independent, as seen in Fig. 5a. For , however, a slight upward trend in the iteration count is observed for SEMFEM and for the pMG preconditioners with ASM, RAS, and Chebyshev-Jacobi smoothing (Fig. 5b). Similarly, exhibits a dependence between the iteration count and the polynomial order for all preconditioners (Fig. 5c). Figures 5d, e and f demonstrate that the time per pressure solve is strongly dependent polynomial order. For SEMFEM, there is an increase in time as a result of increased overhead for the underlying AMG solver. For the pMG-based methods, higher orders are generally faster. This performance gain can be attributed to surface-to-volume effects in the evaluation of the operators and smoothers. Application of and (smoother) is highly vectorizable, whereas application of (assembly) involves a significant amount of indirect addressing. The discrete surface to volume ratio for a spectral element of order is 296/512 60%. For lower this value is larger. This situation is exacerbated in the case of Schwarz-based smoothers, where the overlap contributes substantially to the work and communication for the small-element (i.e., low-) cases. In addition, nekRS [fischer_nekrs_2021] uses a single element per thread block, which limits the amount of work available for a streaming multiprocessor for relatively small polynomial orders.

4.2 Navier-Stokes Results

We consider scalability of nekRS for the cases of Fig. 3. All simulations except one use GMRES(15) with an initial guess generated by -conjugate projection onto 10 prior solutions [fischer_projection_1998]. Due to memory constraints, the 1568-pebble case with uses GMRES(10) with only 5 solution-projection vectors. For each case, two pMG schedules are considered: and for ; and and for . Other parameters, such as the Chebyshev order and the number of coarse grid BoomerAMG V-cycles are also varied. Results are shown in Figs. 6,7. The plots relate the effective work rate per node, measured as the gridpoints (as shown in Table 2) solved per second per node, to the time-to-solution. The y-axis notes the drop in the relative work rate, which corresponds to a lower parallel efficiency, as the strong scale limit is reached, while the x-axis denotes the time-to-solution. Each node consists of 6 GPUs, hence .

In all the performance tests conducted, the pMG preconditioner with Chebyshev-Jacobi smoothing is outperformed by the other preconditioners, whether using one or two V-cycle iterations in the AMG coarse-grid solve. For each case, the fastest preconditioner scheme varies. In the 146 pebble case (Fig. 7a), using Cheby-RAS(2),(7,5,3,1) yields the smallest time per pressure solve. However, in the 1568 pebble case (Fig. 7b), SEMFEM is a moderate improvement over the second best preconditioner, Cheby-ASM(2),(7,5,3,1). pMG with a schedule and Chebyshev-RAS (of any order) yield the best scalability and lowest time per pressure solve for the Boeing speed bump case (Fig. 6). The Chebyshev-accelerated Schwarz schemes are not always the fastest, however. For the 67 pebble case (Fig. 7c), Cheby-Jac(2),(7,5,3,1) is comparable to Cheby-RAS(2),(7,5,3,1) and are the two fastest pMG based preconditioners. However, SEMFEM is significantly faster than the other preconditioners for this case.

Also considered is a hyrbid two-level pMG/SEMFEM approach wherein SEMFEM is used as the solver for the coarse level. For , a schedule with 2nd order Chebyshev-accelerated ASM smoothing on the level and SEMFEM solver on the level, denoted as Cheby-ASM(2),(7,6) + SEMFEM, is used. Similarly, Cheb-ASM(2),(7,5) + SEMFEM and Cheb-ASM(2),(7,3) + SEMFEM are considered. For , a , , , and hybrid pMG/SEMFEM approach is considered, denoted as Cheb-ASM(2),(9,8) + SEMFEM, Cheb-ASM(2),(9,7) + SEMFEM, Cheb-ASM(2),(9,5) + SEMFEM, and Cheb-ASM(2),(9,3) + SEMFEM, respectively. In the pebble cases shown in Fig. 7a,c, this hybrid approach performs somewhere between the SEMFEM and Cheby-ASM(2),(7,3,1) preconditioners. In the Boeing speed bump case, however, Fig. 6 demonstrates that this approach is not as performant as either the SEMFEM or Cheby-ASM(2),(9,5,1) preconditioners.

Solver performance degrades whenever is sufficiently small, regardless of the solver considered. For the various preconditioners considered, at 2 nodes (), the strong-scale limit of 80% efficiency is far surpassed in the 146 pebble case. This leaves the effective strong scale limit at 1-2 nodes ( to ). For the 1568 pebble case, 12 nodes () yields a parallel efficiency around 60-70%, depending on the specific solver. The 67 pebble cases reaches 70% effieciency on 3 nodes (). The parallel efficiency for the Boeing speed bump case for the fastest time-to-solution preconditioners drops below 70% when using more than 48 nodes ().

Figure 6: Strong scaling results on Summit for the Navier-Stokes case of Fig. 3d.
Figure 7: Strong scaling results on Summit for the Navier-Stokes cases of Fig. 3a,b,c. Iso-processor count line illustrated in (c). A user running on a specified number of processors should use the lowest time-to-solution preconditioner along this line.
Figure 8: Same as Fig. 7c, pMG preconditioners only.

While the effect of mesh quality metrics, such as the max aspect ratio and scaled Jacobian, on solver convergence has been studied by Mittal et al. [mittal2019mesh], predicting the optimal preconditioner settings from mesh quality metrics is not obvious. While the maximum aspect ratio is an important metric for mesh quality, it alone cannot explain the apparent poor performance of the pMG preconditioners in the 67 pebble case, Fig. 7c. Consider, for example, that the Boeing speed bump case has a larger maximum aspect ratio (255) than the 67 pebble case (204), but does not exhibit this poor performance in the pMG preconditioners. The minimum scaled Jacobian, however, is at least an order of magnitude smaller in the 67 pebble case () as compared to the other cases (e.g., .996 for the Boeing speed bump case, and for the 146, 1568 pebble cases, respectively). This may, in turn, explain why the pMG preconditioners in the 67 pebble case were signficiantly suboptimal compared to the SEMFEM preconditioner. This demonstrates, however, that a user cannot simply rely on, e.g., the maximum aspect ratio when deciding whether or not to use SEMFEM as a preconditioner. This inability to correctly identify preconditioner settings based on mesh quality metrics alone motivates the introduction of an auto-tuner to choose preconditioner parameters during runtime.

Case Name GLL Spacing (min/max) Scaled Jacobian (min/max/avg) Aspect Ratio (min/max/avg)
146 pebble / .32 / .977 / .419 1.07 / 56.9 / 7.14
1568 pebble / .3 / .99 / .371 1.12 / 108 / 12.6
67 pebble / .145 / .970 / .38 1.17 / 204 / 13.2
Boeing Speed Bump / .996 / 1 / 0.999 6.25 / 255 / 28.1
Table 4: Mesh quality metrics for cases from Fig. 3.

5 Auto-tuning for Production Simulations

The results of the preceding section provides a small window into the varieties of performance behavior encountered in actual production cases, which span a large range of problem sizes, domain topologies, mesh qualities. Moreover, production simulations are solved across a range of architectures having varying on-node and network performance, interconnect topologies, and processor counts. One frequently encounters situations where certain communication patterns might be slower under a particular MPI version on one platform versus another. Unless a developer has access to that platform, it’s difficult to measure and quantify the communication overhead. Processor count alone can be a major factor in preconditioner selection: large processor counts have relatively high coarse-grid solve costs that can be mitigated by doing more smoothing at the fine and intermediate levels. How much more is the open question. The enormity of parameter space, particularly “in the field” (i.e., users working on unknown platforms) limits the effectiveness of standard complexity analysis in selecting the optimal preconditioner for a given user’s application.

From the user’s perspective, there is only one application (at a time, typically), and one processor count of interest. Being able to provide optimized performance—tuned to the application at hand, which includes the processor count—is thus of paramount importance. Auto-tuning provides an effective way to deliver this performance. Auto-tuning of preconditioners has been considered in early work by Imakura at al. [imakura_auto-tuning_2012] and more recently by Yamada at al. [yamada2018preconditioner] and by Brown et al. [brown_tuning_2021]. The latter work couples their auto-tuning with local Fourier analysis to guide the tuning process.

Solver preconditioned GMRES
Preconditioner pMG, SEMFEM
     pMG : Cheby-ASM(2),(7,3,1), Cheby-RAS(2), (7,3,1), Cheby-Jac(2), (7,5,3,1)
: Cheby-ASM(2),(9,5,1), Cheby-RAS(2), (9,5,1), Cheby-Jac(2), (9,7,5,1)
          Coarse grid single boomerAMG V-cycle
     SEMFEM single AmgX V-cycle
Table 5: Solver parameter space considered in the auto-tuner. A pMG preconditioner using an -order Chebyshev-accelerated smoother with a multigrid schedule of is denoted as Cheby-(),.

In large-scale fluid mechanics applications, auto-tuning overhead is typically amortized over timesteps (i.e., pressure solves) per run (and more, over an entire simulation campaign). Moreover, auto-tuning is of particular importance for problems at large processor counts because these cases often have long queue times, which preclude making multiple job submissions in order to tweak parameter settings. Failure to optimize, however, can result in significant opportunity costs. For example, in [fischer_highly_2021] the authors realized a factor of 2.8 speedup in time-per-step for a 352,000-pebble-bed simulation (=51B gridpoints) through a sequence of tuning steps. Even if there were 100 configurations in the preconditioner parameter space, an auto-tuner could visit each of these in succession 5 times each within the first 500 steps and have expended only a modest increment in overhead when compared to the cost of submitting many jobs (for tuning) or to the cost of 10,000 steps for a production run.

As a preliminary step, the authors consider constructing a small subset of the true search space (which could include, e.g., other cycles, schedules, or smoothers) for use in a nascent auto-tuner. Across all cases, with exception to the 67 pebble case, pMG preconditioning with Chebyshev-accelerated ASM or RAS smoothing is the fastest solver or is comparable to SEMFEM. The choice of Chebyshev order and multigrid schedule, moreover, contributes only a modest 10-20% improvement to the overall time-to-solution in most cases, all else being equal. This makes the default Cheby-ASM(2),(7,3,1) or Cheby-ASM(2),(9,5,1) preconditioner reasonably performant. However, in order to avoid the situation encountered in the 67 pebble case, SEMFEM is added to the considered search space. The parameters in Table 5 are chosen as they include optimal or near-optimal preconditioner settings for the results in Figs. 6, 7, while still restricting the search space to something that is amenable to exhaustive search. The authors note, however, that this parameter space may not reflect the various factors affecting the performance of the preconditioners at especially large or on different machine architectures.

During the first timestep, our simple auto-tuner performs an exhaustive search over the small parameter space identified in Table 5. While this space is limited compared to the true search space, the resultant preconditioners selected are effective. In the 146 pebble case (Fig. 7a), Cheby-RAS(2),(7,3,1) is identified as the preconditioner on each of the processor counts, which was comparable in peformance to the best preconditioner. The auto-tuner chose the optimal SEMFEM for the 1568 pebble case (Fig. 7b). SEMFEM was identified at the preconditioner for the 67 pebble case on all processor counts, which Fig. 7c confirms. In the Boeing speed bump case (Fig. 6), the auto-tuner chose Cheby-RAS(2),(9,5,1) across all processor counts, which was either the optimal or near-optimal preconditioner for the problem.

Note that the auto-tuner selects the fastest method at a fixed-processor count (i.e., whatever the user has selected). Consider, for example, the 67 pebble case on GPUs (dashed black line, Fig. 7c). The goal of the auto-tuner is to select the preconditioner with lowest time-to-solution along the user-specified iso-processor count line. However, in the results discussed above, there was no change with respect to a change in the processor count. We note that our principal objective is not to squeeze out a few percent over a raft of good choices, but rather to ensure that a case does not run with an unfortunate set of parameters for which the performance is significantly substandard. This primitive auto-tuning technique proves effective at preventing the selection of highly suboptimal preconditioners. We have implemented this for production use and will continue to update and refine the strategy moving forward.

6 Conclusions and Future Work

In this work, we introduce Chebyshev-accelerated ASM and RAS smoothers for use in -multigrid preconditioners for the spectral element Poisson problem. Further, we compare the performance of Schwarz-based smoothers, Chebyshev-accelerated Jacobi smoothing, and SEMFEM-based preconditioning for a suite of challenging test problems. We conclude that the Chebyshev-accelerated Schwarz smoothers with -multigrid, as well as the low-order SEMFEM preconditioner solved with AmgX, are performant on GPU machines such as OLCF’s Summit. The authors propose a runtime auto-tuner preconditioner strategy that, while primitive, can choose reasonable solver parameters.

The authors plan on conducting similar parallel scalability studies on other machines, such as OLCF’s Spock, an AMD MI100 machine with similar hardware and software as the upcoming Frontier system. Additional preconditioner options, such as varying the AMG solver settings for the coarse grid solve as well as the solver location (CPU/GPU), varying the Chebyshev order and number of sweeps on each level of the multigrid heirarchy, and varying the Chebyshev eigenvalue bounds, are avenues for future performance optimization. Local Fourier analysis similar to that done by Thompson et al. in [thompson2021local] is further needed to understand the smoothing properties of the Chebyshev-accelerated Schwarz smoothers, allowing for robust optimization of parameters, as in the work by Brown et al. [brown_tuning_2021].

7 Acknowledgements

This research is supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nation’s exascale computing imperative. This research also used resources of the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract DE-AC05-00OR22725.

The authors thank YuHsiang Lan, Ramesh Balakrishnan, David Alan Reger, and Haomin Yuan for providing visualizations and mesh files. The authors thank the reviewers of this work for their insightful comments and suggestions.