1 Introduction
The computational landscape in highperformance computing (HPC) has undergone a dramatic change in recent years. Generalpurpose CPUs with large caches and sophisticated pipeline executions have been largely replaced by lowpower dataparallel GPU accelerators in the vast majority of leadership class US supercomputers. While this hardware shift has paved the way for largescale simulation codes towards exascale computing [ceed], it has come at a significant development cost, since HPC applications have had to rethink their numerical algorithms and incorporate new programming models and memory management strategies in order to run efficiently on these advanced architectures.
In this paper we discuss our experience with the GPU port of one such HPC application, the MARBL code, and review our codesign strategy to address the challenges of performance and portability across different computing architectures. We present a twofold approach, wherein new hardware is used to motivate both new algorithms and abstraction layers, resulting in a single source application code suitable for a variety of platforms. Our work includes the development of GPUoriented matrixfree algorithms, porting and optimization via miniapps, and the utilization of opensource software libraries, such as MFEM (finite element algorithms) [anderson2021mfem], RAJA (kernel abstraction) [beckingsale2019raja] and Umpire (memory management) [beckingsale2019umpire]. We demonstrate scalability on different platforms and highlight that many of our innovations have been contributed back to the MFEM and RAJA libraries.
Our focus in this work is on MARBL, a nextgeneration multiphysics simulation code developed at Lawrence Livermore National Laboratory for simulating high energy density physics (HEDP) and focused experiments driven by highexplosive, magnetic or laser based energy sources for applications including pulsed power and inertial confinement fusion (ICF) [marbl]. MARBL is built on modular physics and computer science components and makes extensive use of highorder finite element numerical methods. Compared to standard loworder finite volume schemes, highorder numerical methods have more resolution/accuracy per unknown and have higher FLOP/byte ratios meaning that more floatingpoint operations are performed for each piece of data retrieved from memory. This leads to improved strong parallel scalability, better throughput on GPU platforms and increased computational efficiency.
Prior to the work described in this paper, MARBL’s Arbitrary LagrangianEulerian (ALE) multimaterial compressible hydrodynamics capability was an MPIonly massively parallel code based on full matrix assembly. In the remaining sections, we describe the process of restructuring MARBL’s ALE component to an MPI + X code, where X is a shared memory threading model. In addition, we discuss the algorithmic refactoring which was done to convert the full matrix assembly approach into a more efficient and highperformance version based on “matrixfree” partial assembly techniques. A key detail in our success was the codesign effort between the application and tool libraries (MFEM, RAJA and Umpire) to steer future directions in library developments. One of the main lessons learned in this effort was that customization of established portability tools were crucial to developer productivity and application performance.
The rest of this article is organized as follows: We begin with an overview of the highorder finite element ALE hydrodynamics approach in MARBL, and describe key concepts of the finite element formulation and algorithmic ideas in Section 2. We then discuss the approach for achieving performance and portability in Sections 3 and 4. Our twopronged approach includes algorithmic improvements, in particular matrixfree partial assembly, as well as performance improvements achieved by exposing key programming features through our portable kernel layer abstraction and memorybased optimizations through the Umpire library. We demonstrate the overall performance and scalability of MARBL across both GPU and CPU platforms in Section 5, review our miniapp driven development in Section 6, and showcase these developments in the context of highorder 3D multimaterial simulations in Section 7. Finally, we provide some concluding remarks in Section 8.
2 Overview of ALE
MARBL is a multiphysics code including three temperature radiationmagnetohydrodynamics with thermonuclear burn for fusion plasmas. Central to it is the multimaterial Arbitrary Lagrangian Eulerian (ALE) compressible hydrodynamics capability (based on the BLAST code, see for example [anderson2018high]) which we will focus on in this paper. The highorder finite element ALE approach of MARBL splits the computation into three phases as illustrated in Figure 1. The first phase (Lagrange phase) evolves the simulation within a Lagrangian frame of reference until the mesh reaches a level of distortion, the mesh is then relaxed (mesh optimization phase), and the solution is then remapped onto the optimized mesh (remap phase). We provide a highlevel overview of each of the phases in this section and review key algorithmic implementation details in Section 3.
2.1 HighOrder Lagrange Phase
The Lagrange phase solves the time dependent Euler equations of compressible hydrodynamics within a Lagrangian frame of reference [dobrev2012high]. In the case of multimaterials, , we solve the following conservation laws from hydrodynamics:
(1)  
(2)  
(3)  
(4)  
(5) 
Here corresponds to material density, is the common material velocity, serves as an indicator function which keeps track of material volume fraction per zone, is the rate of change in time of , corresponds to specific internal material energy, is a common thermodynamic pressure, see [dobrev2016closure], and is the mesh node location. The formulation makes use of the material derivative
as well as a total stress tensor for each material,
, which is the sum of the physical stresses and stress due to artificial viscosity.Building from the MFEM finite element library, MARBL uses arbitrary highorder finite elements to solve Equations (1)–(5). A continuous Galerkin discretization is used for the kinematic variables (mesh velocity and position), while a discontinuous Galerkin discretization is used for the thermodynamic variables (indicator functions, density, and energy). The approach presented here can be viewed as a generalization of the traditional staggeredgrid discretization for hydrodynamics (SGH) [barlow2016arbitrary].
The finite element algorithm to solve Equations (1)–(5) requires solving a global linear system for the momentum conservation equation. The key operators in this algorithm are the “mass” matrix and a “force” matrix connecting the kinematic and thermodynamic variables. Computation of a hyperviscosity term for shock capturing requires an additional “stiffness matrix” operator as described in [maldonado2020]. The traditional approach of fully assembling corresponding matrices for these operators has been found to have poor scaling characteristics motivating the development of a matrixfree approach.
2.2 Mesh Optimization
The mesh optimization algorithm in MARBL is based on the TargetMatrix Optimization Paradigm (TMOP) [dobrev2019target, knupp2012introducing] for highorder meshes. The method is based on node movement and does not alter the mesh topology. Optimal mesh node positions are determined by minimizing a nonlinear variational functional. One of the main advantages of this approach is that all computations can be expressed through finite element operations, making the algorithm wellsuited for matrixfree partial assembly implementation as discussed in Section 3.4. The TMOP objective function has the form:
(6) 
where
is the vector of mesh positions,
is a Jacobian matrix that represents the transformation between userspecified target elements and physical elements, and is a nonlinear function that measures mesh quality. The second term in (6) is used to limit the node displacements based on the maximum allowed displacements that are defined with respect to the initial mesh position . The normalization constant balances the magnitudes of the two terms; its value is computed once on the initial mesh.The objective is minimized by solving the nonlinear problem with the Newton’s method. This requires the ability to compute the gradient and to apply/invert the Hessian of the objective functional. In addition, TMOP has the ability to perform simulationbased mesh adaptivity by querying the values of discrete function provided by the simulation [dobrev2020adaptivity]. For example, can prescribe local mesh size based on the current state of the physical system. Since is only defined with respect to the initial mesh, the adaptivity process includes a procedure to solve a pseudotime advection equation, namely,
(7) 
which is a method to advect the function between different meshes, as done in the remap phase of the code, see Section 2.3. All methods described in this section are based on standard finite element operations that can be performed in matrixfree manner.
2.3 Remap Phase
Following the mesh optimization, the ALE remap phase is responsible for transferring field quantities from the current Lagrangian mesh to the optimized mesh. The transfer process is posed as the advection of field values between the two meshes over a fictitious time interval, . For each material , we form and solve the following set of advection equations:
(8)  
(9)  
(10)  
(11)  
(12) 
Here corresponds to the mesh displacement, and the field variables are the same as those in the Lagrange phase Equation(1)–(5).
For the transfer of the momentum field (8) we follow a standard continuous Galerkin discretization. Since only the action of the operators are needed, we bypass the assembly of the global matrices. For the transfer of the remaining fields, it is imperative that the transfer procedure is conservative, accurate, and is bounds preserving, i.e. the solution remains in a set of admissible bounds. In our target application of remapping multimaterial fields, preservation of solution bounds plays a critical role at material interfaces in order to achieve highfidelity simulations. For ALE remap we use the approach described in [anderson2015monotonicity]. The underlying strategy is to begin with a loworder solution which is guaranteed to be within bounds, and use a separate highorder solution to make corrections to the low order solution. The correction strategy draws on principles from the flux corrected transport (FCT) work [kuzmin2012flux, anderson2015monotonicity]. These nonlinear corrections require algebraic modifications to the fully assembled advection matrix in the DGFCT scheme, so the current implementation cannot be done matrixfree. However, we are actively researching and developing a new DG remap method which retains all of the numerical benefits but can be done in a highperformance matrixfree manner.
3 MatrixFree Partial Assembly
For many applications, the assembly of large global sparse matrices can be a performance bottleneck. Computation with the common CSR matrix format also leads to indirection access, which further limits performance. In most cases, the assembly of the finite element operators in the ALE phases can be avoided as it is only the action of the operator that is needed. We leverage this by using matrixfree algorithms based on the partialassembly approach described in [anderson2021mfem] and available in the MFEM library. Under partial assembly, FEM operators are decomposed into a sequence of operators with cascading scope only requiring quadraturespecific data. For completeness we provide a brief overview of the approach; we refer the reader to [anderson2021mfem] for additional details and to the MFEM library [mfemweb] for opensource implementation.
3.1 Operator Decomposition
Following the nomenclature introduced in [anderson2021mfem], for a given global finite element operator () distributed over a number of processors, we recognize that its assembly or evaluation can be decomposed into a product of operators with cascading scope corresponding to parallel (), mesh (), basis (), and quadraturespecific parts ().
Starting with a distributed mesh (as depicted in Figure 2), the role of the operators is to perform the transfer from a global to a processor local domain. The operators perform an additional restriction/prolongation at the element level, while hold the values of the basis functions sampled at quadrature points. For simplicity we assume that test and basis functions are the same, but note that this is not required. Lastly, corresponds to elementspecific quadrature data. Under the partialassembly methodology each of these operators are precomputed and stored.
3.2 Accelerated MatrixVector Evaluation Using SumFactorization
Building on the concepts of operator decomposition, we restrict our discussion to quad and hexahedral elements thereby enabling us to exploit the tensor product structure of the basis functions. Continuing with the nomenclature introduced in Section 3.1, we focus on the element local matrixvector product,
(13) 
Using the element level mass matrix, , to lead our discussion we express the construction as
Here corresponds to quadrature weights, are the basis functions sampled at quadrature points, , and is the geometric factor between reference to physical element. When using quad and hex elements, the multidimensional basis functions are simply tensor products of their onedimensional counterparts. In 2D the basis functions have the form
Recognizing this decomposition enables us to decompose the operator in Equation13 into
where are the onedimensional basis functions. Finally, by arranging as a two dimensional array, computing the action of may be expressed as a sequence of tensor contractions
These tensor contractions can be implemented as a sequence of dense matrixmatrix multiplications suitable for single instruction multidata (SIMD) parallelism.
To illustrate the advantages of the partial assembly approach, Table 1 compares storage and algorithmic complexity between full assembly and partial assembly of matrixvector evaluations per element. As an example, when and , the storage for 1000 elements would require around 729K floating point values, while the partial assembly approach would only require storing 27K floating point values; similar trends follow when considering floatingpoint operations for assembly and applying the action of the operator. In summary, partial assembly provides clear advantages in terms of both memory requirements and floatingpoint operations.
Method  Storage  Assembly  Evaluation 

Traditional full assembly + matvec  
Sum factorized full assembly + matvec  
Partial assembly + matrixfree action 
3.3 Partial Assembly in Lagrange Phase
Applying the finite element discretization to Equations (1)–(5) leads to the socalled “Semidiscrete” set of equations, which under the appropriate initial conditions and boundary conditions can be solved to a final time with numerical integration techniques.
(14)  
(15)  
(16)  
(17)  
(18) 
The key operators in equations (14)–(18) are the Mass matrix, , arising from the conservation of momentum in (14), and the force matrix, , connecting the kinematic and thermodynamic spaces. There is an additional operator required for computing the hyperviscosity operator as discussed in [maldonado2020]. Both and are global operators and applying their action requires global communication in a distributed memory setting, for example using domain decomposition with MPI. Based on the memory and algorithmic complexity requirements as presented in Table 1, our algorithms bypass the full matrixassembly, and rely on the partialassembly approach when applying the action of these operators.
3.4 MatrixFree Mesh Optimization Phase
The computations in this phase utilize a fully matrixfree approach, i.e., all data that is needed at quadrature points is computed onthefly and there is no matrix . This decision is motivated by the complex structure of the Hessian of and the fact that stored quadrature data cannot be reused in consecutive calculations. The major computational kernels in this phase are related to the first and second derivatives of the integral in (6). These are given here to provide the relevant context for the discussion:
(19) 
(20) 
The above derivatives are taken with respect to the mesh position vector where is the dimension and each component’s expansion is . These integral computations use a standard GaussLegendre quadrature rule with points, are the corresponding quadrature weights, is a matrix, and is the Kronecker delta.
In addition to the standard coefficients and geometric data that is required to compute the integrals, the quadraturebased GPU calculations include the following: the target matrices and the mesh quality metric ; the first derivatives , which are matrices; the second derivatives , which are 4tensors of dimension .
Matrixfree GPU kernels were developed for the computation of the objective (6), the assembly of the nonlinear residual vector (19), and the action of the Hessian (20). Furthermore, diagonal Jacobi preconditioning of the Hessian operator is performed by a matrixfree computation of the diagonal of (20). All of the above kernels utilize the tensor structure of the basis functions to perform efficient tensor contractions and achieve the optimal operation complexity, see Section 3.2.
GPU support is also provided for TMOP’s mesh adaptivity capabilities. In this case, the construction of the matrix depends on a discrete function that must be advected between different meshes through (7). As (7) is a standard advection equation, we make direct calls to MFEM’s existing internal partial assembly kernels for the action of the mass and convection operators.
3.5 Partial Assembly in Remap Phase
Lastly, for the Remap phase the set of semidiscrete equations take the form of
(21)  
(22)  
(23)  
(24)  
(25) 
where and and the “mass” and “convection” matrices corresponding to the continuous Galerkin discretization. As only the action is required, we follow the operator decomposition in Section 3.1 to bypass matrix assembly. The remaining operators, and , are algebraically modified “mass” and “convection” matrices from the Algebraic DGFCT discretization discussed in Section 2.3. Due to algebraic modifications, the operators currently require full matrix assembly. An active area of research is the development of a matrixfree remap framework for MARBL .
4 Platform Portability Through MFEM, RAJA, and Umpire
The effort to refactor MARBL’s ALE algorithms and data structures was a multiyear process that required a number of crossproject collaboration; it was an iterative process that provided many opportunities for innovation and applicationdriven development. In this section, we describe the integration of technology and restructuring of MARBL to improve productivity and code performance.
Since MFEM serves as a toolkit for developing algorithms within MARBL, introducing GPU capabilities into MFEM was a natural starting point for MARBL’s platform portability. With the release of MFEM 4.0, there were a number of features that provided the foundation for applications to begin leveraging GPUs, including fundamental finite element routines such as computing geometric factors, quadrature interpolators, as well as matrixfree action for a subset of its finite element operators.
MFEM also introduced intermediate layers between memory allocations and kernel execution to support GPU acceleration. The intermediate layer for kernels enables users to choose a backend at runtime; supported backends include the RAJA abstraction layer [beckingsale2019raja], native OpenMP, CUDA, HIP, and OCCA [medina2014occa]. Additionally, data structures in MFEM such as vectors, arrays, and matrices are equipped with an intermediate layer, which keeps track of host and device pointers and coordinates transfers between different memory space.
4.1 Development with RAJA in MARBL
After updating to MFEM 4.0, MARBL gained the capability to perform basic memory movement between host and device as well as the ability to offload a select number of MFEM routines to a device. To further enable GPU capabilities, the RAJA abstraction layer was chosen to express computational kernels within MARBL.
With the RAJA abstraction layer, loop bodies are captured in C++ lambdas and template specialization via “execution policies” to select an underlying programming model, i.e., Sequential, OpenMP, CUDA, or HIP. To compliment MFEM’s runtime selectivity, development in MARBL is done through a macro layer that forwards the lambda to the RAJA method that matches the MFEM configured backend (i.e. CPU, OpenMP, CUDA, or HIP). Listings LABEL:C_style_for, LABEL:RAJA_style_for and LABEL:Marbl_style_for compare a basic loop expressed using standard C, native RAJA, and the MARBLRAJA abstraction layer, respectively.
For most of the initial refactoring in MARBL, the RAJA “forall” method was the work horse for developing portable kernels. The simplicity of RAJA forall made it easy to integrate, but the simplicity also limited the exposed parallelism, which limited performance and motivated the search for abstractions that could better utilize the hardware. While more detailed threading was available through the RAJA “kernel” interface, we determined that the associated perkernel policies were too complex to maintain for our ALE hydro application. Through a codesign effort that leveraged expertise from the RAJA, MFEM, and MARBL teams, we developed the “RAJA Teams” API, which is available in RAJA v0.12+ as an experimental feature.
4.2 Enhancing Performance with RAJA Teams
To develop the RAJA Teams abstration layer, we used the following guiding principles:

[leftmargin=1em]
 Simplicity:

The API should be simple to understand and use so that developers can focus on algorithms.
 Expose key features:

For GPUs, we found that the use of device shared memory improved performance and an abstraction is needed so algorithms remain correct when executed on the CPU.
 Lightweight:

Abstraction layers should add minimal overhead and allow the compiler to optimize.
We developed the RAJA Teams framework using a bottomup approach, in which we built an abstraction layer around loop patterns arising from partialassembly based algorithms for highorder finite elements. Listing LABEL:Marbl_style_threads illustrates an example of the RAJA Teams API to load quadrature data from a DenseTensor into a shared memory array.
The RAJA Teams API is composed of three core concepts. The first is the Launch Method, which is responsible for creating a kernel execution space. The launch method uses a runtime value, ExecPlace, and a grid configuration of teams and threads to switch between host or device execution. The teams and threads concept is based on CUDA’s and HIP’s programming model in which computation is performed on a predefined grid composed of threads, which are then grouped into blocks (teams in RAJA). Second, the Launch Context in the launch lambda is used for control flow within the kernel, e.g. to perform thread synchronizations as would be done with CUDA’s or HIP’s __syncthreads() routine. Lastly, within the kernel execution space, algorithms are expressed in terms of a nested loop hierarchy. Hierarchical parallelism is naturally expressed through RAJA Loop Methods, where, in CUDA, ‘outer’ loops are mapped to blockbased policies and inner loops may be mapped to thread policies.
In addition, because the utilization of GPU shared memory is an important cornerstone for kernel optimizations, the RAJA teams API includes a macro, RAJA_TEAM_SHARED, which expands out to device shared memory or host stack memory, depending on the desired execution space.
To simplify development within MARBL, we also developed a macro layer over the RAJA Teams API similar to our RAJA forall macro wrapper. Listings LABEL:CUDA_style_threads, LABEL:RAJA_style_threads and LABEL:Marbl_style_threads compare the native, RAJA, and MARBL kernel APIs. As can be seen in Listing LABEL:Marbl_style_threads, the macros allow us to remove much of the boilerplate code that would be common to all Teamsbased kernels over mesh elements and quadrature points.
4.3 Enhancing Memory Management with Umpire
During MARBL’s initial GPU refactoring effort, we quickly realized that device memory allocations (e.g. device mallocs and device frees) were drastically limiting runtime improvements; many kernels used local stackallocated or fixed lifetime temporary (scratch) memory that introduced expensive global synchronizations in every time step. With application run time in mind we decided to move these large localized allocations into permanent memory (allocations lasting for the duration of the simulation). Finding performanceimpacting allocations was tedious as our approach relied on being able to match up cudaMalloc API calls in the profiler timeline with our insource annotations.
Although this initial approach eliminated the runtime impacts of repeated allocations, it increased our overall device memory usage, which impacted our ability to run larger problems. Since our initial performant implementation resulted in many temporary buffers being moved into permanent memory, we devised a strategy in which we could reuse memory buffers in parts of the code where short term scratch buffers were required. To assist in the effort, we integrated the Umpire memory resource library [beckingsale2019umpire]. With Umpire, developers can create multiple device allocators and configure them to use either the standard device malloc or special purpose allocators like pools. With pools a large chunk of memory is allocated upfront and subsequent allocations are pulled from the pool. The total size of pools in Umpire can change over time, either when more memory is needed than the pool currently has or when a coalesce operation is requested. Pools are specifically designed to enable fast memory allocations and deallocations and are ideal for temporary or scratch memory that may be needed in kernels.
As previously mentioned, MARBL is built on MFEM and nearly all of its data structures related to device computations are MFEM objects. With the MFEM 4.0 release all MFEM data containers are built on an internallydefined Memory<T> memory buffer container. Similar to MFEM’s developerconfigurable Device policy, the memory policies used are configurable at runtime. Each Memory object contains a host pointer and, if a compute device is being used, a device pointer. These Memory objects know how to allocate and copy memory between their associated spaces. They also track the validity of the underlying data in each space (host, device, or both), which greatly aids memory synchronization and debugging. As part of our porting effort, we enhanced MFEM’s memory manager to support Umpire allocators. We also introduced finergrained controls on perallocation setup to make it easy to use a single permanent allocator by default, but a temporary or scratch allocator when appropriate.
In order to measure memory utilization, we ran a representative 3D shaped charge problem on 12 GPUs with ALE and TMOP. When compared to our initial performant implementation that used 181 GB of device memory, our memoryoptimized version used only 97.3 GB, a reduction of nearly 50%, of which 55% (53.6 GB) is now in a shareable temporary allocator. This temporary allocator can be used by other packages in MARBL that, when also configured to use Umpire, further reduce total memory utilization in multiphysics/multipackage simulations.
4.3.1 Thread Local Memory
After we started seeing diminishing returns on managing memory spaces and lifetimes, we began to optimize usage of our allocated memory. The CUDA C API makes it easy to query runtime memory usage but it inform on “how” your memory is actually being used. Notably, while much of the device memory is explicitly allocated by MARBL itself, some is allocated by the CUDA runtime, and for large problems we observed that our explicitly allocated device memory was only around 70% of the total allocated device memory. Further investigation revealed that this was due to CUDA runtime allocations for kernels with large amounts of spilled threadlocalmemory.
As discussed in Section 4.1, RAJA forall was used as the initial abstraction for devicefriendly kernels and while a number of kernels were revisited and ported to RAJA Teams for improved performance, there were still many that existed in their original forall state. A subset of our remaining RAJA forall kernels used large iterationlocal stack arrays of fixed size based on our maximum allowable polynomial order. Our maximum order requires much more iterationlocal memory than is needed in many simulations and, crucially, even if a simulation is running at a higher order the iterationlocal arrays cause a CUDA runtime allocation that is never released; if there is just one kernel that has large threadlocal storage requirements, those requirements will effect the memory availability for the entire simulation. Specifically, the CUDA runtime will allocate (bytes per thread) (max threads per SM) (number of SMs) bytes.
To find the kernels that were causing noticeable increases in device memory usage we intercepted calls to cudaLaunchKernel and measured the device memory availability before and after. We then incrementally refactored these kernels to a teamsstyle shared memory implementation or, for kernels where this wasn’t appropriate, we preallocated the threadlocal memory in our Umpire temporary pool. Preallocating global memory resulted in a roughly 2x slowdown for each kernel modified this way but, for MARBL this was acceptable because the kernels in this group accounted for a negligible amount of runtime both before and after the refactor.
This refactoring removed the use of threadlocal arrays owned by the CUDA runtime and freed up more than 2 GB of additional memory per device, a significant improvement considering that our GPUs have only 16 GB of memory.
Figure 3 shows our overall, temporary, permanent, and CUDA runtime allocations from when Umpire was introduced to after we implemented the threadlocalarray refactoring.
5 Cross Platform Scaling Studies
Scaling studies help characterize how well performance scales across large problems and/or compute resources. This section describes a set of cross platform scaling studies comparing MARBL’s performance across three different architectures, including a Commodity Technology Systems (CTS) cluster at LLNL, which we refer to as CTS1 and two Advanced Technology Systems (ATS) – Sierra, a GPUbased system at LLNL and Astra, an ARMbased system at SNL. These systems have the following characteristics:

[leftmargin=1em]

Sierra is LLNL’s 125 petaflop ATS system. It has a Mellanox EDR InfiniBand interconnect and 4,320 compute nodes, each of which has two sockets containing 20core POWER9 processors, 4 NVIDIA Volta V100 GPUs, and 256 GB of memory. Our scaling studies used up to 2,048 nodes, comprising half of the machine, and ran with 4 MPI ranks per node, with 1 GPU per rank.

Astra is a 2.3 petaflop system deployed under the Sandia Vanguard program. It has a Mellanox EDR InfiniBand interconnect and is composed of 2,592 compute nodes, of which 2,520 are user accessible. Each node contains two sockets containing 28core Cavium ThunderX2 64bit Armv8 processors and 128 GB of memory. Our scaling studies used up to the full machine (2,496 nodes) and utilized RAJA’s OpenMP policy, with 2 MPI ranks per node and 28 threads per rank.

LLNL’s CTS1 clusters are commodity systems with Intel OmniPath interconnect, dual 18core Intel Xeon E52695 2.1GHz processors and 128 gigabytes of memory per node. Our scaling studies used up to 256 nodes, with 36 MPI ranks per node on a system with 1,302 compute nodes and a peak of 3.2 petaflops.
Since we are attempting to characterize and compare performance of our code across different computer architectures, the unit of comparison is an entire compute node of a given architecture. Such studies, which we refer to as nodetonode scaling studies, differ from typical scaling studies which measure the baseline performance on a single processor. Nodetonode scaling studies allow natural comparison between a code’s performance on different architectures and its granularity matches how our users typically view their allocations.
Our study includes three types of comparisons: Strong scaling studies use a fixed global problem size and polynomial order and vary the number of compute nodes. Weak scaling studies use a fixed local problem size and polynomial order per compute node and vary the number of compute nodes. Throughput scaling studies use a fixed compute resource (e.g. 1 compute node) and vary the problem size and polynomial order.
Strong scaling demonstrates the ability of a code to solve a fixed problem faster as one adds additional compute resources, while weak scaling demonstrates the ability of a code to solve a larger problem faster as one adds more resources. In contrast, throughput scaling characterizes the machine performance and demonstrates the balance between parallelism and memory capacity; that is, it measures the work required to saturate memory bandwidth on the compute resources. Commodity systems have modest memory bandwidth and compute resources. As such, they can only accommodate a relatively small amount of parallelism. They are easy to saturate, yielding overall lower performance. On the other hand, GPUbased systems, like Sierra, have very high memory bandwidth and compute resources. As such, they require large amounts of parallel work to saturate, but can yield very high performance.
5.1 Strong Scaling
Our strong scaling study ran the 3D triplepoint shock interaction (TriplePt) benchmark, see for example [dobrev2012high], in ALE mode with TMOP with a total of 112,896 quadratic elements and 7,225,344 quadrature points distributed among varying numbers of compute nodes. Our study starts with one compute node and iteratively doubles the node count in successive runs. For ideal strong scaling, the time per cycle would decrease linearly as we increase the node count. Specifically, as we double the node count, the time per cycle would reduce by a factor of two.
Figure 4 depicts the code’s nodetonode strong scaling performance on Sierra (red), Astra (green) and CTS1 (black). As can be seen in the figure, MARBL exhibits very good strong scaling on the CPUbased architectures up to 128 CTS1 and Astra nodes, when communication costs begin to dominate the runtime. As expected, the code exhibits worse strong scaling for higher node counts on the GPU, where, as the node count increases, there is no longer sufficient work to feed each GPU. As such, for this problem, the strong scaling begins to level off when using more than 32 nodes (128 GPUs).
5.2 Weak Scaling
Our weak scaling study was performed on the same TriplePt 3D problem as Section 5.1 starting with 2 levels of uniform refinement for our first data point. Each subsequent run in the series maintained the same average number of quadrature points per rank by applying an extra level of uniform refinement (multiplying the global problem size by a factor of eight and increasing the node count by a factor of 8). As such, we have up to four runs in each of our weak scaling series, using 4, 32, 256 and 2048 nodes. We note that 2048 nodes is half of Sierra and 80% of Astra.
Figure 5 depicts the code’s weak scaling performance on Sierra (red), Astra (green) and CTS1 (black). Since each rank has the same local problem size, ideal weak scaling would maintain the same compute time as we scale out to more nodes. This is not generally achievable due to the increased internode communication as we scale out to more nodes. In addition to plotting the compute time per cycle, we also annotate each data point with the weak scaling efficiency, measured as the time per cycle of the current data point divided by that of the first data point in the series.
As can be seen in the figure, our code achieves good weak scaling performance as we scale out to the maximum number of nodes in the study. Comparing Sierra performance to the other platforms, we observe 8–16 nodetonode speedup for Sierra runs compared to those for the CPUbased platforms.
5.3 Throughput Scaling
Throughput studies compare the rate an application can process units of work over time at varying problem sizes. In this section we present throughput studies for linear, quadratic, and cubic elements by considering the total number of degrees of freedom for the kinematic variables (H1 basis functions). As can be seen in Figure
6, MARBL’s throughput is significantly higher on GPUbased platforms than on CPUbased ones. Further, throughput increases both with problem size and with increasing order. This is in agreement with results shown for Finite Element benchmark problems in [ceedms6] by the Center for Efficient Exascale Discretizations (CEED).Breaking out the throughput into the three large phases of MARBL’s ALE hydrodynamics reveals that both the Lagrange phase (Figure (a)a) and mesh optimization phase (Figure (b)b) achieve even better throughput with increasing order. As can be seen in Figure 11, which is restricted to the Lagrange phase on Sierra, this trend continues for quartic runs. The remap phase in Figure (c)c sees a decrease in cubic throughput when compared to quadratic but that is expected in our current implementation which requires full assembly. We are currently investigating a matrixfree remap implementation, to improve our throughput for cubic and higher orders.
6 MiniApp Driven Development
Because MARBL was originally developed as a CPUonly code based on full matrix assembly, the introduction of matrixfree methods required significant restructuring and rewriting of its major components. To devise a strategy to refactor MARBL, we developed an MFEMbased miniapp [laghos], which serves as a proxy application of MARBL’s Lagrangian phase. Laghos builds on the approach described in [dobrev2012high] modeling the principle computational kernels of explicit timedependent shockcapturing compressible flow in a simplified setting, without the additional complication of physicsspecific code. Laghos follows a simplified Lagrangian phase of MARBL which consist of four major computations, namely, inversion of a global CG mass matrix, computation of physics and material specific quadrature point data, computation of artificial viscosity coefficients, and application of a force operator.
As an initial step for Laghos we derived a code structure that separates the finite element assembly operations from the quadrature pointbased computations. Such separation is important in the context of matrixfree methods, as it allows the operator splitting that was discussed in Section 3.1. Based on the new code structure, Laghos was used to introduce the first implementation of MARBL’s partially assembled force and mass operators that produced equivalent results as the fully assembled versions, but with improved complexity, see Table 1. Once the matrixfree CPU algorithms were in place, support was added for hardware devices, such as GPUs, and programming models, such as CUDA, OCCA, RAJA and OpenMP, based on MFEM 4.0. Having these different options allowed to perform detailed comparisons of performance and scaling, e.g., see Figure 12, and select the most appropriate options for MARBL, based on the available hardware and the application’s preferences.
The work in the Laghos proxy app was of great benefit to MARBL, as it provided the foundational structure of the code and a baseline version of the major kernels. Consequently, Laghos served as a model for designing more complex kernels for MARBL. A key benefit provided to MARBL from Laghos and MFEM is the ability to drop in replacement kernels for standard finite element operations, for example, the action of the mass operator and diffusion integrator kernels as used in the artificial viscosity calculations [maldonado2020].
7 HighPerformance Applications
We now consider two MARBL examples which illustrate the effectiveness of the algorithm and code improvements previously discussed in the context of 3D highorder multimaterial ALE simulations on GPUs. In each example, we exercise all three phases of ALE using partial assembly techniques on the GPU including:

[leftmargin=1em]

Highorder Lagrange phase multimaterial hydrodynamics on a moving, unstructured, highorder (NURBS) mesh [dobrev2012high], including use of GPU accelerated 3rd party libraries for material equations of state (EOS) evaluation and material constitutive models

Nonlinear, material adaptive, highorder mesh optimization using the TMOP method

Highorder continuous (kinematic) and discontinuous (thermodynamic) Galerkin based remap using flux corrected transport (FCT)
7.1 Idealized ICF Implosion
Here we consider a 3D two material idealized ICF implosion test problem which is based on the simplified ICF benchmark originally described in [GaleraMaireBreil10]. In this variant, we consider a 1/4 symmetry 3D simulation domain using a NURBS mesh and apply a constant velocity drive to the outer radius of the problem (as described in [DobrevEllisKolevRieben12]). In addition, to generate vorticity during the implosion, we apply a single mode cosine wave perturbation to the material interface with amplitude and wavelength , specifically we modify the material interface radius as as described in [maldonado2020]. This idealized test problem illustrates the impact of both the RayleighTaylor and RichtmyerMeshkov fluid instabilities in three dimensions at the interface between the two materials as shown in Figure 13. This simulation consists of quadrature points and was exercised on 256 GPUs (64 nodes, 4 NVIDIA V100’s per node) of LLNL’s Lassen machine. To enhance the highorder mesh resolution near the unstable material interface, we employ the material adaptive capabilities of the TMOP mesh optimization phase at the material interface with a 2:1 size ratio as shown in Figure 14.
7.2 Shaped Charge
The BRL81a shaped charge is a device which focuses the pressures of a high explosive onto a metal “liner” to form a hypervelocity jet which can be used in many applications, including armor penetration, metal cutting and perforation of wells for the oil/gas industry [walters2008brief]. Modeling such a device requires multimaterial compressible hydrodynamics with general equations of state for the various materials, highexplosive burn and elastoplasticity.
In Figure 15, we show results of a MARBL calculation of the BRL81a with an offaxis highexplosive detonation point to illustrate 3D effects. We use the Jones Wilkins Lee (JWL) equation of state for the high explosive, tabular equations of state (provided by the LLNL LEOS library) for the rest of the materials and a combination of elasticperfectly plastic and SteinbergGuinan strength models for the solid materials in the problem. All of the material constitutive models are running on the GPU either directly in MARBL or as a GPU accelerated 3 party library call. The material model libraries perform their operations in batches of element quadrature points in parallel on the GPU during the Lagrange phase.
This problem is a good stress test of the ALE hydrodynamics in MARBL and running this problem at very high resolutions in 3D is important for studying the hydrodynamics of hypervelocity jet formation. There are still outstanding questions as to the cause of jet instabilities which are experimentally observed. This simulation consists of quadrature points and ran on 96 GPUs (24 nodes) of the Lassen machine. To enhance the highorder mesh resolution near the hypervelocity copper jet, we employ the material adaptive capabilities of the TMOP mesh optimization phase at the copper material with a 2:1 size ratio as shown in Figure 16.
8 Conclusion
In this paper we presented a two pronged approach based on improved algorithms and abstractions to refactoring MARBL, a nextgen code at LLNL, for performance and platform portability across commodity and advanced supercomputer architectures such as the GPUbased Sierra cluster at LLNL and the ARMbased Astra cluster at SNL. Recognizing changing trends in computing hardware, we began our endeavor by replacing our matrix based algorithms with matrixfree algorithms wherever possible. The matrixfree algorithms provided improved algorithmic complexity and reduced our memory footprint. To design a single source code which can run on different platforms we integrated the RAJA kernel abstraction layer, Umpire resource memory manager, and MFEM accelerator capabilities. In our efforts we found that extensions and customization of the thirdparty libraries was crucial for developer productivity and application performance, as was having a proxy app to quickly iterate with different ideas.
To reach improved levels of performance on the MARBL code, we codesigned the RAJA Teams API. The RAJA Teams API provided a simplified interface for expressing and reasoning about hierarchical parallelism and portable access to GPU shared memory. Allocating and managing GPU device memory comes with increased complexity due to the high cost of memory allocations and limited GPU memory. By combining MFEM’s memory manager with Umpire, we were able to have MFEM objects quickly draw and release memory from memory pools. The use of memory pools enabled different physics algorithms of MARBL to share memory thereby reducing the total amount of permanent memory required across the whole simulation.
We also demonstrated the code’s performance on several example problems and with large scale “nodetonode” throughput, weak, and strong scaling studies across different supercomputer architectures. Our abstraction model, based on RAJA, Umpire and MFEM, enabled us to run simulations using the same codebase on GPUbased architectures (with MPI+CUDA) as we ran on CPUbased advanced architectures (using MPI+OpenMP) and on commodity platforms using only MPI.
In summary, by incorporating new algorithms and powerful abstraction layers we were able to achieve a high degree of portability and performance with the MARBL code.
Acknowledgments
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DEAC5207NA27344. LLNLJRNL829593. We would like to thank Veselin Dobrev for support on the integration of MFEM GPU capabilities into MARBL, and the RAJA & Umpire teams for their feedback in software integration and extensions.
Comments
There are no comments yet.