In fluid flow simulations, the incompressibility constraint is often used to bypass fast acoustic waves and thereby allow the solution to evolve on a convective time-scale that is most relevant for many engineering problems. This model leads to a Poisson problem for the pressure that is invariably the stiffest substep in time-advancement of the Navier-Stokes equations. For large 3D problems, which mandate the use of iterative methods, the pressure solve thus typically encompasses the majority of the solution time.
For the spectral element (SE) discretization, the Poisson system matrix contains nonzeros for elements with a polynomial degree of (i.e., approximatelyreads and operations, even in the case of complex geometries [parter_spectral_1979, deville_high-order_2002]. Consequently, the key to fast SE-based flow simulations is to find effective preconditioners tailored to this discretization. Many methods, including geometric -multigrid approaches with pointwise Jacobi and Chebyshev-accelerated Jacobi smoothers [sundar_comparison_2015, adams_parallel_2003, Kronbichler2019], geometric -multigrid with overlapping Schwarz smoothers [lottes_hybrid_2005, stiller_nonuniformly_2017, loisel_hybrid_2008], and preconditioning via low-order discretizations [parter_spectral_1979, olson_algebraic_2007, bello-maldonado_scalable_2019, canuto_finite-element_2010], have been considered.
In this work, we explore the parallel performance of these methods and extend the Schwarz-smoothing based -multigrid of [lottes_hybrid_2005] to support restrictive-additive Schwarz ideas of Cai and Sarkis [ras99] as well as Chebyshev-accelerated smoothing. In addition, we extend the low-order preconditioning strategy of [bello-maldonado_scalable_2019] to run on GPU architectures through the use of AmgX [naumov_amgx_2015]
as an algebraric multigrid (AMG) solver for the sparse system. Numerical results are shown for the SE-based pressure Poisson problem as well as for the Navier-Stokes equation. All methods considered are implemented by the authors in the scalable open-source CFD code, nekRS[fischer_nekrs_2021]. Special focus is given to performance and scalability on large-scale GPU-based platforms such as OLCF’s Summit. A significant contribution is the introduction of a poly-algorithmic auto-tuning strategy that is designed to ensure that users realize optimized performance over a broad range of problem formulations, archictectures, and processor counts.
The structure of this paper is as follows. Section 2 outlines the spectral element formulation for the Poisson problem in , as well as describe the various preconditioners considered. A brief description of several model probems of interest are presented in section 3. Numerical results are highlighted in section 4. Finally, a brief summary of the survey of solver techniques considered is provided in section 5.
2 Background and Implementation
We introduce here basic aspects of the SE Poisson discretization and associated preconditioners along with an important autotuning strategy.
2.1 SE Poisson Discretization
Consider the Poisson equation in ,
Boundary conditions for the pressure Poisson equation are either periodic, Neumann, or Dirichlet, with the latter typically applicable only on a small subset of the domain boundary, , correspondong to an outflow condition. Consequently, Neumann conditions apply over the majority (or all) of the domain boundary, which makes the pressure problem more challenging than the standard all-Dirichlet case.
The SE discretization of (1) is based on the weak form, Find such that,
where is a finite-dimensional approximation comprising the basis functions used in the SE discretization, , , that vanish on and represents the discrete inner product based on Gauss-Lobatto-Legendre quadrature in the reference element, . The basis functions allow us to represent the solution, , as , leading to a linear system of unknown basis coefficients,
with respective mass- and stiffness-matrix entries, , and .
is tesellated into nonoverlapping hexahdedral elements, , for , with isoparametric mappings from to provided by , for . Each is a th-order Lagrange cardinal polynomial on the Gauss-Lobatto-Legendre (GLL) qudrature points, . Similarly, the test and trial functions are written in local form as . Continuity is ensured across the interface between adjacent elements by enforcing when
. From this, a global-to-local degree-of-freedom mapping,, can be represented by a Boolean matrix , such that . The assembled stiffness matrix is then , where comprises the local stiffness matrices, . Similarly, . The SE formulation uses coincident GLL quadrature and nodal points, such that is diagonal. Moreover, is never formed, as it would contain nonzeros in the general case. Rather, the tensor-product-sum factorization [parter_spectral_1979] allows for to be evaluated in time with storage, as described in detail in [deville_high-order_2002].
In the current study, all preconditioners are applied in the context of restarted GMRES. Although is symmetric positive definite (SPD), many of the preconditioners are asymmetric. Further, a recent study [fischer_highly_2021] has shown the benefits of projection-based GMRES over flexible conjugate gradients because the effectiveness of the overall pressure solution strategy (which includes projection onto prior solutions [fischer_projection_1998]) generally ensures a low enough iteration count, , such that the costs in GMRES are not overly onerous.
In [parter_spectral_1979], Orszag suggested that constructing a sparse preconditioner based on the low-order discretizations with nodes coinciding with those of the high-order discretization would yield bounded condition numbers and, under certain constraints, can yield for second-order Dirichlet problems. This observation has led to the development of preconditioning techniques based on solving the resulting low order system [olson_algebraic_2007, bello-maldonado_scalable_2019, canuto_finite-element_2010].
In the current work, we employ the same low-order discretization
considered in [bello-maldonado_scalable_2019]. Each of the vertices of
the hexahedral element is used to form one low-order, tetrahedral element,
resulting in a total of eight low-order elements for each GLL sub-volume
in each of the high-order hexahedral elements.
This low-order discretization is then used to form the sparse operator, .
The so-called weak preconditioner, , is used to precondition the system.
Algebraric multigrid (AMG), implemented in CUDA in AmgX [naumov_amgx_2015],
is used with the following setup to solve the low order system:
0.25 strength threshold
Extended + i interpolation ()
Damped Jacobi relaxation ()
One V-cycle for preconditioning
Smoothing on the coarsest level
We denote this preconditioning strategy as SEMFEM.
2.2.2 -multigrid, Schwarz Smoothers
Another preconditioning strategy for the SE-based Poisson problem is to use geometric -multigrid (pMG). The classical single pass V-cycle is summarized in Algorithm 1. In our application, we limit pMG preconditioning to a single V-cycle pass. Alternative strategies, such as F- or W-cycle multigrid are not considered.
The SE-based additive Schwarz method (ASM) presented in [lottes_hybrid_2005, loisel_hybrid_2008] solves local Poisson problems on subdomains that are extensions of the spectral elements. The formal definition of the ASM preconditioner (or pMG smoother) is
where is the restriction matrix that extracts nodal values of the residual vector that correspond to each overlapping domain, as indicated in Fig. 1b. To improve the smoothing properties of the ASM, we introduce the diagonal weight matrix, , which scales each nodal value by the inverse of the number of subdomains that share that node. Although it compromises symmetry, post-multiplication by was found to yield superior results to pre- and post-multiplication by [stiller_nonuniformly_2017, lottes_hybrid_2005].
In a standard Galerkin ASM formulation, one would use , but such an approach would compromise the storage complexity of the SE method. To construct fast inverses for , we approximate each deformed element as a simple box-like geometry, as demonstrated in Fig. 1a. These boxes are then extended by a single degree-of-freedom in each spatial dimension to form overlapping subdomains with interior degrees-of-freedom in each domain. (See Fig. 1). The approximate box domain enables the use of the fast diagonalization method (FDM) to solve for each of the overlapping subdomains, which can be applied in time in [lottes_hybrid_2005]. The extended-box Poisson operator is separable, with the 3D form
where each represents the extended 1D mass-stiffness matrix pairs along the given dimension [lottes_hybrid_2005]. The FDM begins with a preprocessing step of solving a series of small,
, generalized eigenvalue problems,
and defining and , to yield the similarity transforms
From these, the inverse of the local Schwarz operator is
is a diagonal matrix. This process is repeated for each element, at each multigrid level save for coarsest one. Note that the per-element storage is only for the matrices and for . At each multigrid level, the local subdomain solves are used as a smoother. On the coarsest level (), however, BoomerAMG [henson_boomeramg_2002] is used to solve the system with the same parameters as the AmgX solver in section 2.2.1, except using Chebyshev smoothing. Unless otherwise noted, all pMG preconditioners use a single BoomerAMG V-cycle iteration in the coarse-grid solve.
Presently, we also consider a restrictive additive Schwarz (RAS) version of (4), wherein overlapping values are not added after the action of the local FDM solve, following [ras99]. RAS has the added benefit of reducing the amount of communication required in the smoother. (Formally, RAS can be implemented by simply changing .) Note that, in the case of ASM and RAS smoothers without Chebyshev-acceleration, an additive V-cycle with no post-smoothing is used to avoid residual re-evaluation in Algorithm 1 [fischer04].
2.2.3 Chebyshev Acceleration
A notable improvement over standard Jacobi-smoothed multigrid is to use Chebyshev-acceleration [sundar_comparison_2015, adams_parallel_2003], as described in Algorithm 2 for a given surrogate smoother, . While is typically based on Jacobi smoothing (e.g.,[Kronbichler2019]), it is also possible to consider the use of overlapping-Schwarz (4) as the smoother, which is a new approach that we explore here.
Algorithm 2 requires approximate spectral bounds, , of the smoothed operator, . Using 10 rounds of Arnoldi iteration, we generate as an initial proxy for . The bounds employed in Algorithm 2 are then determined through sensitivity analysis similar to that performed by Adams and coworkers [adams_parallel_2003]. The analysis is conducted using a challenging model problem, namely the Kershaw case of subsection 3.2, with , , and relative (-norm) residual tolerance of . A second-order Chebyshev-accelerated ASM smoother is used in the pMG V-cycle preconditioner, which we denote as the current default preconditioner in nekRS.
The results of the sensitivity study are shown in Table 1. The required iteration counts are seen to be sensitive to underestimation of , as also found by Adams and coworkers [adams_parallel_2003]
. On the other hand, results are less sensitive to the minimum eigenvalue estimate, provided, as this delimits between the low frequencies that are handled by the coarse grid correction and the high frequencies that are eliminated by the action of the smoother. The choice of the bounds (relative to ) is not universal: (1/30,1.1) [adams_parallel_2003], (0.3,1) [baker_multigrid_2011], (0.25,1) [sundar_comparison_2015], and (1/6,1) [zhukov_multigrid_2015] have all been considered. Based on the results of Table 1, we set as a conservative choice. Lastly, we denote a pMG preconditioner using -order Chebyshev-accelerated smoother with a multigrid schedule of as Cheb-(), for the remainder of this work.
2.3 On-line Preconditioner Tuner
For any given preconditioner, iteration counts can vary dramatically across the range of fluid dynamics applications and computational meshes used with nekRS. Moreover, even for a fixed problem, the optimal choice of parameters for minimum-time-to-solution can vary with the number of processors, , because the coarse-grid solve costs can dominate the solution for large . In such cases, additional smoothing, which reduces the number of -cycles and thus the number of coarse-grid solves, can be beneficial. Identifying an optimal pMG strategy can be a daunting and expensive task even for sophisticated users, particularly if it requires queuing multiple large jobs, where such optimization is most warranted. Further, as shown in Table 2, the space of solver parameters is large.
Consequently, it makes sense to automate the tuning process, so that the pMG parameters are selected in context, at the correct processor count. We have developed an autotuner where the user can prescribe when the tuning should take place (e.g., on step 1000 or every 500 steps)111In turbulent flow simulations, it may take several hundred or thousand timesteps before the solution is rich in high wavenumber content; tuning too early might lead to suboptimal parameters.. Two tuning strategies are currently available: either an exhaustive search at the prescribed step(s) or a set of exhaustive searches over multiple timesteps. The former ensures that all trials solve the same problem, collecting several trials of each pMG preconditioner for the same step to mitigate the effects of system noise. The latter performs the exhaustive search over multiple timesteps, minimizing the cost,
where and denote the iterations and time per iteration for preconditioner at sample . This approach seeks to minimize the effects of system noise by considering the minimum time per iteration for a candidate method, while also taking into account the convergence rate of the method. The use of projection during the exhaustive search, moreover, is also considered. Once determined, the optimal parameters are used through the rest of the simulation and are saved to a file so that subsequent restarts or simulations in the same class can leverage the results of the initial autotuning.
In the sequel, we consider the on-line tuner with Jacobi, ASM, and RAS as candidate smoothers for Chebyshev-accelerated pMG-based preconditioning. The Chebyshev order is varied between one and three. An exhaustive search is performed over this parameter space in order to minimize the pressure-solve time-to-solution. To mitigate the effects of system noise on the results, three trials of each pMG preconditioner are performed over the search space. We note that our principal objective is not to squeeze out a few percent over a raft of good choices, but rather to ensure that a case does not run with an unfortunate set of parameters for which the performance is significantly substandard.
We further remark that the overall solver space is vast—we list many possible considerations in Table 2—and that the subset chosen here, while based on significant practical experience, is just a first step towards comprehensive coverage for any problem on any architecture.
|Krylov Solver||flexible CG, GMRES|
|GMRES||restart size, classical/modified Gram-Schmidt|
|pMG||cycle type, schedule, smoother, coarse grid solver, coarse grid discretization|
|Schedule||all levels, by halfs, two-level,|
|Smoother||Gauss-Seidel, Jacobi, ASM, RAS, Chebyshev-acceleration|
|ASM/RAS||amount of overlap, local subdomain solve techniques (FDM), weights|
|Chebyshev-acceleration||Order, spectral bounds|
|Coarse grid||direct solve, AMG, smooth|
|Coarse grid discretization||Galerkin, non-Galerkin|
|SEMFEM||low-order discretization, AMG solver settings|
|AMG||cycle type, smoothers, interpolators, coarsening type, strength threshold|
3 Test Cases
We describe four model problems that are used to test the SE preconditioners. The first is a stand-alone Poisson solve, using variations of the Kershaw mesh. The others are modest-scale Navier-Stokes problems, where the pressure Poisson problem is solved over multiple timesteps. The problem sizes are listed in Table 3 and range from relatively small (=16M points) to moderately large (=645M).222Larger cases for recent full-scale runs on Summit with =51B are reported in [fischer_highly_2021].
|Boeing Speed Bump||885K||9||645M|
The Kershaw family of meshes [kolev_ceed_2021, kershaw_differencing_1981]. has been proposed as the basis for a high-order Poisson-solver benchmark by Center for Efficient Exascale Discretization (CEED) within the DOE Exascale Computing Project (ECP). This family is parametrized by an anisotropy measure, , that determines the degree of deformation in the and directions. As decreases, the mesh deformation and aspect ratio increase along with it. The Kershaw mesh is shown in Fig. 2 for several values of . The domain with Dirichlet boundary conditions on . The right hand side for (1) is set to
where is a random, continous vector vanishing on . The linear solver terminates after reaching a relative residual reduction of . Since this test case solves the Poisson equation, there is no timestepper needed for the model problem.
For the pressure-Poisson tests, three flow cases are considered, as depicted in Fig. 3. The first case corresponds to turbulent flow through a cylindrical packed-bed with 146 spherical pebbles. The second is a similar configuration with 1568 pebbles. Both bed flows are at Reynolds number , based on sphere diameter, . Time advancement is based on a two-stage 2nd-order characteristics timestepper with CFL=4 ( for the 146-pebble case, for the 1568-pebble case). An absolute pressure solver tolerance of is used. A restart at and convective time units is used for the 146- and 1568-pebble cases, respectively, to provide an initially turbulent flow.
The third case, shown in Fig. 3c, is direct numerical simulation (DNS) of seperated turbulent flow over a speed bump at . This test case was designed by Boeing to provide a flow that exhibits separation. A DNS of the full 3D geometry, however, remains difficult [shur_direct_2021]. Therefore, this smaller example proves a useful application for benchmarking solver performance. This case uses a 2nd-order timestepper with CFL=0.8 () and an absolute pressure-solve tolerance of . A restart at convective time units is used for the initial condition.
In all cases, solver results are collected over 2000 timesteps. At each step, the solution is projected onto a space of up to 10 prior solution vectors to generate a high-quality initial guess, . Projection is standard practice in nekRS as it can reduce the initial residual by orders of magnitude at the cost of just one or two matrix-vector products in per step [fischer_projection_1998]. The perturbation solution, , is typically devoid of slowly evolving low wave-number content. Moreover, the initial residual is oftentimes sufficiently small that the solution converges in iterations, such that the overhead of GMRES is small. Testing the preconditioners under these conditions ensures that the conclusions drawn are relevant to the application space.
Here we consider the solver performance results for the test cases of Section 3. We assign a single MPI rank to each GPU and denote the number of ranks as . All runs are on Summit. Each node on Summit consists of 42 IBM Power9 CPUs and 6 NVIDIA V100 GPUs. We use 6 GPUs per node unless .
4.1 Kershaw Mesh
The Kershaw study comprises four tests. For each of two studies, we consider the regular box case (
) and a highly skewed case (). The first study is a standard weak-scale test, where and are increased, while the polynomial order is fixed at and the number of gridpoints per GPU is set to . The range of processors is =6 to 384. The second study is a test of the influence of polynomial order on conditioning, with and fixed, while ranges from 3 to 10. Both cases use GMRES(20).
The results of the weak-scaling study are shown in Fig. 5. For both =1.0 and , the iteration count exhibits a dependence on problem size, as seen in Fig. 5a,d. The time-per-solve also increases with , in part due to the increase in iteration count, but also due to increased communication overhead as increases beyond unity. Table 4 indicates the maximum number of of neighboring processors for the assembly () graph of , which increases with . In addition, the number of grid points per GPU () is relatively low. These two factors cause an increased sensitivity of the problem to the additional communication overhead as the number of GPUs is initially increased. The number of neighbors, however, will saturate at larger (i.e., production-level) processor counts.
The influence of polynomial order is illustrated in Fig. 4. For , iteration counts are essentially -independent, as seen in Fig. 4a. For , however, a slight upward trend in the iteration count is observed for SEMFEM and for the pMG preconditioners with ASM, RAS, and Chebyshev-Jacobi smoothing (Fig. 4b). Figures 4c and d demonstrate that the time per pressure solve is strongly dependent polynomial order. For SEMFEM, there is an increase in time as a result of increased overhead for the underlying AMG solver. For the pMG-based methods, higher orders are generally faster. This performance gain can be attributed to surface-to-volume effects in the evaluation of the operators and smoothers. Application of and (smoother) is highly vectorizable, whereas application of (assembly) involves a significant amount of indirect addressing. The discrete surface to volume ratio for a spectral element of order is 296/512 60%. For lower this value is larger. This situation is exacerbated in the case of Schwarz-based smoothers, where the overlap contributes substantially to the work and communication for the small-element (i.e., low-) cases. In addition, nekRS [fischer_nekrs_2021] uses a single element per thread block, which limits the amount of work available for a streaming multiprocessor for relatively small polynomial orders.
4.2 Navier-Stokes Results
We consider scalability of nekRS for the cases of Fig. 3. All simulations except one use GMRES(15) with an initial guess generated by -conjugate projection onto 10 prior solutions [fischer_projection_1998]. Due to memory constraints, the 1568-pebble case with uses GMRES(10) with only 5 solution-projection vectors. For each case, two pMG schedules are considered: and for ; and and for . Other parameters, such as the Chebyshev order and the number of coarse grid BoomerAMG V-cycles are also varied. Results are shown in Fig. 6.
In all the performance tests conducted, the pMG preconditioner with Chebyshev-Jacobi smoothing is outperformed by the other preconditioners, whether using one or two V-cycle iterations in the AMG coarse-grid solve. For each case, the fastest preconditioner scheme varies. In the 146-pebble case, using Cheb-RAS(2),(7,5,3,1) yields the smallest time per pressure solve. However, in the 1568-pebble case, SEMFEM is a moderate improvement over the second best preconditioner, Cheb-ASM(2),(7,5,3,1). Lastly, pMG with a schedule and Chebyshev-RAS (of any order) yield the best scalability and lowest time per pressure solve for the Boeing speed bump case.
Across all cases, pMG with a good coarsening schedule, Chebyshev order, and smoother choice, such ASM or RAS, is better or at least comparable to SEMFEM. The optimal coarsening schedule and smoother choice, however, is not always obvious, especially when the relative ranking depends on the number of nodes used for the simulation. For example, Cheb-ASM(2),(7,5,3,1) is faster than Cheb-RAS(2),(7,5,3,1) for larger node counts. For smaller node counts, however, this order is reversed. These results make apparent the benefit of developing an autotuner, however rudimentary.
Lastly, solver performance degrades whenever is sufficiently small, regardless of the solver considered. For the various preconditioners considered, at 2 nodes (), the strong-scale limit of 80% efficiency is far surpassed in the 146-pebble case. This leaves the effective strong scale limit in the neighborhood of 1-2 nodes ( to ). For the 1568-pebble case, running on 12 nodes () yields a parallel efficiency around 60-70%, depending on the specific solver. Similarly, the parallel efficiency for the Boeing speed bump case for the fastest time-to-solution preconditioners drops below 70% when using more than 48 nodes, placing the strong scale limit around .
4.3 On-line Tuner
Auto-tuning results are presented in Table 5. Auto-tuning trials are performed on Step 1000 and timings are based on the last 1000 timesteps of the simulation, therefore excluding the cost of the tuning step.333The overhead of the parametric search for production simulations will be amortized over the remainder of the run, typically comprising – steps, and over the full simulation campaign, which might require mutiple runs. The table reports , the minimum time per solve from the auto-tuner. Also shown are , the auto-tuning speedup relative to the default multigrid smoother; , the speedup relative to the fastest preconditioner in Fig. 6; and , the speedup relative to the worst preconditioner in Fig. 6. Here, the default multigrid smoother is Cheb-ASM(2),(7,3,1) for and Cheb-ASM(2),(9,5,1) for .
In all cases, Cheb-Jac(2),(7,5,3,1) () Cheb-Jac(2),(9,7,5,1) () with two coarse grid iterations was the worst performer.444Presently, the tuner always uses a single coarse grid iteration. In many cases, the default preconditioner performance is close to the optimal found by the auto-tuner. In the 146-pebble case, switching from the default to Cheb-RAS(2),(7,5,3,1) yields speedups of 15%, 20%, and 10% for 1,2, and 3 nodes, respectively. Moreover, the auto-tuner typically chose a preconditioner comparable to the default. However, in the 1568-pebble case with solution projection turned on, the autotuner selected the rather poor options of Cheby-Jac(2),(7,3,1) or Cheby-Jac(3),(7,3,1). This choice was due to the initial residual being close enough to the specified tolerance that these suboptimal preconditioner happened to provide the fastest solution for that particular timestep. This projection-dependence is evident in Tables 4(a) and 4(b). This sensitivity suggests that one should sample over multiple timesteps, rather than taking several samples from the same timestep. However, as seen in the differences between Tables 4(c) and 4(d), this strategy alone is not always sufficient to avoid erroneous preconditioner selection.
5 Conclusions and Future Work
In this work, we introduce Chebyshev-accelerated ASM and RAS smoothers for use in -multigrid preconditioners for the spectral element Poisson problem. Further, we compare the performance of Schwarz-based smoothers, Chebyshev-accelerated Jacobi smoothing, and SEMFEM-based preconditioning for a suite of challenging test problems. We conclude that the Chebyshev-accelerated Schwarz smoothers with -multigrid, as well as the low-order SEMFEM preconditioner solved with AmgX, are performant on GPU machines such as OLCF’s Summit. We also introduce an on-line auto-tuner for the purposes of preconditioner selection and smoother optimization. Additional tuning options, such as using the SEMFEM discretization as the coarse grid solver, varying the AMG solver settings for the coarse grid solve as well as the solver location (CPU/GPU), varying the Chebyshev order and number of sweeps on each level of the multigrid heirarchy, experimenting with F- and W-cycle multigrid, and varying the Chebyshev eigenvalue bounds, are avenues for future performance optimization.
This research is supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nation’s exascale computing imperative. This research also used resources of the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract DE-AC05-00OR22725. The authors thank YuHsiang Lan for providing visualizations and mesh files for the pebble cases. The Ramesh Balakrishnan for providing mesh files for the Boeing speed bump.