1. Introduction and Motivation
Both efficient algorithms and software performing well on modern computing systems are crucial to address current scientific and engineering problems. These tools are important for bridging the gap between theory and realworld data. Such problems often need to tackle fieldscale data sets using parallel computers, parallel algorithms, and programming tools such as OpenMP [Dagum and Menon, 1998] and the Message Passing Interface (MPI) [Gropp et al., 1999a]
and cannot be solved on a standard laptop or desktop. For example, hydrologists and geophysicists need to work with fieldscale reservoirs which could span tens of kilometers and evolve on time scales of hundreds of years. Moreover, such reservoir simulations involve complicated multiphase and multicomponent flows which require multiple complex equations to be solved accurately and efficiently. Atmospheric and climate modelers also require stateoftheart techniques as both data assimilation and parameter estimation need to be performed quickly on mesoscale and globalscale applications. The US Department of Energy has invested in the development of several portable and extensible scientific software packages like PETSc
[Balay et al., 2016, 2014] and PFLOTRAN [Lichtner et al., 2013] that can help address such important largescale problems. The time spent developing parallel computational frameworks is amortized when application scientists employ the packages in their work.However, it is not always known whether the performance of a particular parallel computational framework or software will be satisfactory across a panoply of solvers and computing platforms. How can one really tell whether an algorithm is performing at its highest level? Is there room for improvement? Answering these questions in full is a Herculean task, but questions regarding the algorithmic and computational efficiency of scientific tools and libraries still need to be answered [Keyes et al., 2013]. Hence, we need performance models which enable us to synthesize performance data into an understandable framework. Performance models can include many metrics of importance such as total floating point operations (FLOP), memory usage, inter/intra process/node communication, memory/cache bandwidth, and cache misses/hits. If not carefully optimized, some of the hardware resources can become unnecessary bottlenecks that result in costly and inefficient numerical simulations. Modern computer systems are quite complex and the performance can be difficult to predict with good accuracy. Conducting largescale simulations on stateoftheart supercomputers may require hundreds to thousands of hours of compute time, so it is highly desirable to have a performance model that can predict how a particular parallel computational framework may perform. The application or domain scientist may use software that either is not made in house or is a “blackbox” tool, and it would be too time consuming, or impossible if source code is unavailable, to dissect the code and analyze the design of the subroutines and data structures. It is therefore desirable to analyze these codes as a whole.
1.1. Review of previous works
We now briefly highlight some useful approaches and models one could take to analyze and perhaps improve the performance of any parallel computational framework. One of the simplest measures one can utilize is the STREAM memorybandwidth benchmark [McCalpin, 1995]. This benchmark measures sustainable memorybandwidth on a single server and indicates the number of threads that saturates memory bandwidth. Memorybandwidth is an important limitation to consider on modern machines [Wulf and McKee, 1995; McKee, 2004; Murphy, 2007]
The Roofline model [Williams et al., 2009; Lo et al., 2015]
captures peak achievable performance on a server taking into account both CPU and memorybandwidth capabilities by introducing the Arithmetic Intensity (AI). The AI is simply the measure of the total floatingpoint operations needed, total FLOP, over Total Bytes Transferred (TBT). Higher AI’s indicate that the algorithm or computational framework is more computationally intensive and requires less bandwidth for a given amount of work. One is free to employ any cache model when determining the TBT metric for the roofline model. For example, scientists have developed a Sparse MatrixVector (SpMV) multiplications model
[Gropp et al., 1999b] which is based on “perfect cache” (i.e., matrices and vectors are loaded and stored once from memory). SpMV is an integral part of iterative solvers for solving PDEs. It has been shown in [Chang et al., 2017] that the SpMV “perfect cache” model can also be used to accurately predict and understand the hardware performance of optimizationbased solvers for enforcing discrete maximum principles. In [May et al., 2014], the authors employ matrixfree iterative methods for Stokes equation, which is needed for lithospheric dynamic applications. The authors manually count the TBT based on source code. The advantage of matrixfree methods is that the sparse matrixvector multiplication, which is memorybandwidth limited, is not explicitly stored thus bringing the computational frameworks’ upperbound limit of the roofline closer to the Theoretical Peak Performance (TPP) region. TBT can also be determined based on memory level traffic or cache misses. The same analysis can be carried out for manycore architectures, such as Nvidia GPUs and the Intel Xeon Phi “Knights Landing” (KNL), in [Knepley and Terrel, 2013; Knepley et al., 2016].For a more thorough analysis of performance, advanced software tools such as the HPCToolkit [Adhianto et al., 2010], OpenSpeedShop [Schulz et al., 2008], Scalasca [Geimer et al., 2010], and TAU [Shende and Malony, 2006] are used by scientific software developers and application scientists alike. These tools provide indepth performance analyses of scientific codes and can also be used to debug the codes. Many of them also rely on PAPI [Mucci et al., 1999] which use low level hardware counters for important metrics like FLOPs, total CPU (central processing unit) cycles, and cache misses. These tools have proven to be extremely useful for computational scientists in all areas of computational physics and can provide a good understanding of the hardware performance of any computational framework for solving PDEs.
1.2. Main contributions
In this paper, we provide a simple and easytouse performance model that can be used in addition to the techniques and tools mentioned above. Our performance model, which we refer to as a performance spectrum^{1}^{1}1We borrowed the terminology “performance spectrum” from Dr. Jed Brown, which he used in his presentation at 2016 SIAM Parallel Processing conference, which was held at Paris, France (conference website: http://www.siam.org/meetings/pp16/). takes into account timetosolution, AI based on cache misses, and equations solved per second. This model is applicable to any level of a scientific code, whether it be the entire computational framework or only particular phases or functions such as mesh generation, assembly of a matrix, or the solver step. It is important to note that this tool is not intended to replace any of the aforementioned performance tools or models but to simply augment one’s ability to quickly understand and diagnose the performance from both the hardware, software, and algorithmic stand point. The main contributions of this paper can be enumerated as follows:

We outline common issues pertaining to performance, ways to identify them, and methods to address them.

We present a model called performance spectrum that provides an enhanced understanding of the performance and scalability of algorithms and software.

We demonstrate that the proposed model can be utilized on existing popular software packages and solvers.

We apply the model to a more complicated and nonlinear PDE and document the parallel performance of the computational framework across HPC machines.

We discuss some possible ways in which this performance spectrum model can be extended.
The rest of the paper is organized as follows. In Section 2, we outline some of the key performance issues one may come across when solving PDEs and how to address some of them. In Section 3, we propose a model, performance spectrum, which captures three critical metrics useful for understanding performance and scalability. In Section 4, we demonstrate possible ways one could utilize the proposed model by systematically comparing commonly used finite element packages and solvers. In Section 5, we extend the model to simulate nonlinear hydrostatic ice sheet flow equations. In Section 6, we run the nonlinear hydrostatic ice sheet flow equations across multiple compute nodes and study the performance. Concluding remarks and possible extensions of this work are outlined in Section 7. All the notational conventions employed in this paper are introduced as needed.
2. Common Performance Issues
The performance of any scientific software or algorithm will depend on a myriad of factors. First and foremost, good performance depends on efficient and practical implementation of the code. Application and domain scientists may not be interested in the intricate details of the code framework that they did not design, but they must still be cognizant of important computational issues that may inhibit performance dramatically. We now briefly highlight some common performances issues computational scientist may come across in their line of work:

Core/memory bindings: The simplest way to maximize parallel performance for MPI applications is to properly enforce MPI process and memory bindings. This is particularly important for memory bandwidthlimited applications because, on most CPU architectures, the aggregate core bandwidth exceeds the CPU bandwidth to memory and it is important to use the CPUs in a multi CPU server in a balanced way. Furthermore, if multiple users share a compute node, performance metrics can vary greatly as both memory resources and certain levels of cache are shared by others. Appropriate mapping methodologies for binding ranks to cores is vital for complex hardware architectures as well as for complex topological node layouts. Consider the single dual socket servers and their respective STREAM Triad benchmark results shown in Figure 1. Both the AMD and Intel processors possess two sockets where the physical cores are contiguously ordered. However, when the MPI processes are placed on alternating sockets, the achievable bandwidth is higher for a fixed number of cores by using the memory systems on both CPUs. For multi node performance, different binding techniques are required – memory references on a single node are several times faster than on a remote node. Process allocation must be carefully done so that communication across networks is minimized.
Processor AMD Intel Intel Intel Intel Opteron 2354 Xeon E52680v2 Xeon E52695v2 Xeon E52698v3 Xeon Phi 7250 “Barcelona” “Ivybridge” “Ivybridge” “Haswell” “Knights Landing” Clock rate 2.2 GHz 2.8 GHz 2.4 GHz 2.3 GHz 1.4 GHz Year released 2008 2013 2013 2014 2016 Sockets 2 2 2 2 1 Cores/socket 4 10 12 16 68 Threads/core 1 1 2 2 4 L1 cache 864 KB 2032 KB 2464 KB 3264 KB 6864 KB 2way associativity 8way associativity 8way associativity 8way associativity 8way associativity L2 cache 8512 KB 20256 KB 24256 KB 32256 KB 341 MB 16way associativity 8way associativity 8way associativity 8way associativity 16way associativity L3 cache 22 MB 225 MB 230 MB 240 MB  32way associativity 20way associativity 20way associativity 20way associativity  Memory type DDR2200 MHz DDR31600 MHz DDR31866 MHz DDR42133 MHz DDR42400 MHz, MCDRAM Total memory 16 GB 64 GB 64 GB 128 GB 96 GB (DDR4), 16 GB (MCDRAM) Memory channels 4 8 8 8 6 (DDR4), 8 (MCDRAM) Compiler used GNU GNU Cray Cray Cray STREAM Triad 10.5 GB/s 64.5 GB/s 102 GB/s 116 GB/s 90 GB/s (DDR4), 480 GB/s (MCDRAM) Table 1. Single node specifications from each of the HPC systems used for this study. Note that Intel’s “Knights Landing” (KNL) processor has two different types of memory. 
Hardware architecture: The performance of any benchmark or software depends on the hardware architecture. In this paper, we consider five different HPC systems with single node specifications listed in Table 1. It is evident from the STREAM Triad benchmark that different architectures have different levels of achievable memorybandwidth. Some of the processors are recent (as of the writing of this paper), like the Intel KNL processor whereas others like AMD’s “Barcelona” and Intel’s “Ivybridge” processors are older. With increasing complexity of processors and memory systems, the challenge of good performance of solvers and algorithms has become an area of active research. A computational framework may solve a PDE efficiently on a laptop or small cluster, but that does not mean it will perform efficiently on a supercomputer. Understanding basic computer architectural concepts such as pipelining, instructionlevel parallelism, and cache policies may offer excellent guidelines on how to speedup computations by several orders of magnitude. For example, Intel’s KNL processor has two 512bit vector units per core and may need finegrained parallelism to fully exploit the 68 cores per CPU. If a code is not properly vectorized to utilize the 136 vector units capable of 16 floatingpoint operations per cycle or the 16 GB of onboard MCDRAM, it is possible that the speedup on this system will not be fully realized, and worse yet get outperformed by processors that have faster cores. Also, languages such as Python, which are used in some sophisticated finite element simulation packages, depend on the file system I/O because the interpreter executes system calls to locate the module and may need to open hundreds of thousands of files before the actual computation can begin. Designing and utilizing algorithms/languages/compilers that are compatible with recent stateoftheart HPC architectures is paramount [Miller et al., 2013], otherwise the computational performance may be exceedingly poor.
(a) Default ordering (b) Optimized ordering Figure 2. Assembled sparse matrix where red represents positive numbers, blue represents negative numbers, and cyan represents allocated but unused nonzero entries. 
Domain decomposition: The global ordering and partitioning of the computational elements in a parallel computing environment, particularly for problems with unstructured grids, affect both spatial and temporal cache locality. Consider the assembled sparse matrices shown in Figure 2. If the nonzero data entries are not properly grouped together, the code will invoke expensive cache misses and create little opportunity to use data in a cache line and reuse data in the cache. Consequently, this create serial bottlenecks at the cache/memory levels. Several mesh/graph partitioners such as Chaco [Hendrickson and Leland, 1995], METIS/ParMETIS [Karypis and Kumar, 1999], and PTSCOTCH [Chevalier and Pellegrini, 2008] are designed to optimize locality and balance the workload among MPI processes. Some graph partitioners use a simple model of communication in seeking to achieve load balance with minimum communication while others use a more detailed communication model to better capture the minimum communication needed by using hypergraphs instead of regular graphs as a basis for partitioning. Understanding which type of partitioning to use for the PDE problem at hand (e.g., spectral partitioning, geometric partitioning, multilevel graph partitioning, etc.) can significantly reduce the amount of communication and lead to higher efficiency and degree of concurrency.

Solver convergence: Arguably one of the most important performance factors to consider for solving PDEs is the convergence rate of the solver. Direct methods like Gaussian elimination [Grcar, 2011] as well as its sparse counterparts such as MUMPS [Amestoy et al., 2000] and SuperLU_DIST [Li and Demmel, 2003] can solve problems in parallel but may have huge memory requirements as the problem size is scaled up due to fillin during factorization. Scalable and efficient solvers typically rely on the novel combination of iterative solvers and preconditioners. The Krylov Subspace (KSP) and Scalable Nonlinear Equations Solvers (SNES) features in the PETSc library coupled with robust preconditioners [Smith et al., 1996; Brandt and Livne, 2011] is a popular methodology for solving large and complex PDEs. Novel combinations and tuning of solver parameters provide powerful and robust frameworks that can accurately and quickly converge to a specified residual tolerance, even for complex coupled multiphysics problems [Castelletto et al., 2016; Brown et al., 2012; Brune et al., 2015]. Simple preconditioners such as Jacobi or Incomplete Lower Upper (ILU(0)) factorization may be fast for smaller problems, but the computational cost will soar because the number of solver iterations needed with Jacobi or ILU(0) will rapidly grow with problem size. Scaling up the problem under these choices of preconditioning will be extremely time consuming and may not even converge for larger or more complicated problems. Other more robust preconditioners, like the geometric and algebraic multigrid method, might have a more expensive setup time for smaller problems but have been demonstrated to maintain relatively uniform convergence for larger problems, even those that are nonsymmetric and indefinite [Bramble et al., 1994; Adams et al., 2004].
These important performance issues should not be overlooked when analyzing the performance of a parallel computational framework. There are also several quick strategies for understanding and identifying bottlenecks on a specific HPC system. For example, it is wellknown that SpMV is an operation that is sensitive to the memorybandwidth. These operations have very low AI’s which can present itself as a bottleneck at the memory level on a single node. A simple test one can perform is to run the SpMV operation in parallel, and if it does not scale well on a single server in the strong sense, the memorybandwidth is clearly limiting the performance. One can confirm this by running some simple vector operations like the vector sum and scalar multiplication to see if they experience the same scaling issues. In addition, one can test the vector dot product operation in order to detect problems with the network interconnect or memory latency issues. The PETSc performance summary [Balay et al., 2016] provides comprehensive insight into the performance of many of these important operations including load balancing. The summary also provides information on the functions consuming most of the time. However, not all scientific software have readily available performance summaries, so a performance model amenable to any code implementation is needed to help answer common performance questions.
3. Proposed Performance Spectrum
The general concept of the performance spectrum model is illustrated by Figure 3. This model is designed to simultaneously capture both the hardware/architectural exploitation as well as the algorithmic scalability of a particular parallel computational framework. First and foremost, we need the timetosolution since this is the metric of most importance to application scientists needing to execute largescale simulations on stateoftheart HPC systems. One may optionally document the total number of solver iterations needed for convergence. However, simply knowing the wallclock time a computational framework needs to perform a task tells us little about the computational and algorithmic efficiency. In order to understand how fast (or slow) a simulation is, we need to introduce two more metrics.
Proposed performance spectrum that documents time, intensity and rate. Intensity is defined as arithmetic intensity (FLOP to TBT ratio) based on cache misses, and rate is defined as degreesoffreedom solved per second.
3.1. Intensity
The second metric of interest is the intensity. Specifically, we focus on AI. As described in [Williams et al., 2009], the AI of an algorithm or software is a measure that aids in estimating how efficiently the hardware resources and capabilities can be utilized. For the five machines listed in Table 1, it is wellknown that the limiting factor of performance for many applications is the memorybandwidth. Thus, codes that have a high AI have a possibility of reusing data in cache and have lower memory bandwidth demands. It should be noted, however, that performance depends on many factors such as network latency and file system bandwidth, and the arithmetic intensity alone cannot be used to predict performance.
The general formula for the AI is defined as
(3.1) 
where [Work] is the total amount of computational effort, typically what one would refer to as FLOPs. The [TBT] metric is a measure of data movement between the core/CPU and memory. A cache model is needed in order to not only determine the TBT but also to understand what amount of useful bandwidth is sustained for a given cache line transfer. One can employ any cache model for this purpose, such as perfect cache, total number of load and store instructions at the core level, traffic at the memory level, or data cache misses. Different cache models are useful for interpreting different behavioral trends, and the choice of cache model depends on the application or research problem at hand. In this paper, we base [TBT] on the total number of cache misses and cache line size. The formula for obtaining the TBT for the L1, L2, and L3 cache levels is expressed as
(3.2) 
The simplest way to define [Work] is as the total number of floatingpoint operations, denoted FLOPs. Thus the AI based on Lx cache misses is formally written as
(3.3) 
If a solver or algorithm experiences a large number of cache misses at the last level, memory may impede performance.
Sometimes the exact TBT of a particular algorithm is not of interest. Instead, an application scientist may only care about the relative measure, i.e., whether the AI is higher or lower compared to either another algorithm, a different implementation of the same algorithm, or a different processor. Thus, one may simply look at the ratio of FLOPS and cache misses. Equation (3.3) may be simplified to
(3.4) 
Every machine listed in Table 1 has a cache line size of 64 bytes for all levels of cache. Different CPUs may have different line sizes and hence a cache miss may imply different memory demands on different processor architectures. The remainder of the paper shall refer to the above formula for estimating the intensity metric.
Remark 1.
It should be noted that PAPI’s methodology for counting FLOPS may be highly inaccurate for the “Ivybridge” systems listed in Table 1. The hardware counters only count the instructions issued and not the ones executed or retired. This is paramount for iterative solvers that rely on SpMV operations because as the codes spend time waiting for data to be available from memory, they will reissue the floatingpoint instructions multiple times. These reissues, coupled with incomplete filling of a vector unit instruction, can lead to overcount factors of up to 10 times. For a more thorough discussion on the issue of overcounts, see [Weaver et al., 2013] and the references within. PAPI’s FLOP counters are disabled on the “Haswell” and “KNL” processors due to the aforementioned issues
Remark 2.
The AI can also be counted using other capable software tools and methodologies. For example, Intel’s Software Development Emulator (SDE) can be used to obtain FLOP counts [Tal, ; Raman, 2015] and Intel®VTune™can be used to obtain the TBT. This methodology is used in the commonly used Roofline model. Alternatively, one can approximate the FLOP count of a particular code by inserting counting mechanisms into the code. PETSc provides an interface and guidelines for manual FLOP counting, and thus FLOP counts for computational frameworks using it can be obtained through the performance summary output.
Remark 3.
The correlation between AI and speedup on a single node may not always hold true in a cluster sense (i.e., scaling when communication networks are involved). The mechanisms used for MPI process info exchanged is very different when the processes are on the same node as opposed to on different nodes. An application scientist must be fully cognizant of not only the HPC processor specification but also the network topology as well as the interconnect bandwidth and latency.
3.2. Rate
Although is useful for comparatively estimating the performance a particular parallel framework may attain, it does not necessarily aid in predictions of timetosolution. Consequently, this means that AI can easily be “gamed” to appear high but the code consumes large amounts of wallclock time. For example, small computationally intensive routines such as DGEMM [Dongarra et al., 1990] can be inserted to artificially inflate the computational demands and increase the AI. Other performance models, such as the Roofline model, would indicate that this is a favorable trend while ignoring the fact that more time is spent than necessary. This is also why the traditional FLOP rate metric, which can also easily be gamed, is not helpful either. Instead of measuring the total FLOPS executed per second, we measure the total degreesoffreedom solved per second, hence the Rate metric needed to complete the performance spectrum is defined as
(3.5) 
where simply refers to the total number of degreesoffreedom or discrete componentwise equations that need to be solved.
Definition 1 (Staticscaling).
Equation (3.5) is an integral component of what we refer to as staticscaling, where we increase the problem size but fix the concurrency. This is a complete reversal to the classical definition of strongscaling where we fix the problem size but increase the concurrency. Staticscaling plots timetosolution versus the total degreesoffreedom solved per second for a variety of problem sizes, so it also has characteristics similar to the classical definition of weakscaling where both problem size and concurrency is increased.
Figure 4 contains a pictorial description of a staticscaling plot and illustrates how to visually interpret the data points. A scalable algorithm is where is linearly proportional to , so it is desirable to see a PDE solver maintain a constant rate metric for a wide range of problem sizes. The behavior of parallel computational frameworks for solving PDEs is not simple because 1) problems too small for a given MPI concurrency experience large communication to computation ratios (hence strongscaling effects) and 2) large problems may have unfavorable memory accesses. The staticscaling plots are designed to capture both strongscaling and weakscaling characteristics and can give a good indicator of the ideal range of problem sizes for a given MPI concurrency.
The tailing off to the right of the staticscaling plot has two potential reasons. First, problem size affects how memory is allocated and accessed. Larger problem sizes may see an increase in memory contention as well as affect the access pattern to main memory. Thus more time is spent waiting on data as opposed to performing calculations. However, another reason the tailing off occurs is because solvers for complex PDEs or computational domains may not always be . Suboptimal algorithmic convergence may maintain a consistent level of hardware utilization but require more iterations and FLOPs. To determine whether suboptimal algorithmic convergence plays a role in the deterioration of the staticscaling plot, equation (3.5) can be modified as
(3.6) 
This equation averages out increases in time due to an increase in iteration count. If a flat line is observed using this metric, then poor algorithmic scalability did have a negative impact on the staticscaling results.
Alternatively, if one is more interested in the performance gain for each MPI process, equation (3.5) can also be modified into
(3.7) 
This metric presents the average degreesoffreedom solver per second for each MPI process or core utilized.
3.3. Using the performance spectrum
The arithmetic intensity and staticscaling components of the spectrum offer a variety of strategies for interpreting the performance and scalability of any computational framework. Good performance is achieved when a computational framework achieves low timetosolution, high arithmetic intensity, and flat staticscaling lines. The theoretical peak rate of degreesoffreedom solved per second could be unknown for a particular algorithm, but the intensity metric can help us understand whether the staticscaling lines are good by estimating how well it is efficiently using the available hardware resources. We outline three possible ways one could use the performance spectrum model:

Hardware limitations: As mentioned in Section 2, the hardware configuration of the compute nodes plays a vital role in the performance spectrum because different systems have different core counts, frequencies, and memory architectures. Understanding how PDE solvers behave on different systems is vital for disseminating software to the scientific and HPC communities. The different cache sizes listed in Table 1 will be reflected in equation (3.5). AI is likely to differ on different processors due to possible differences in cache sizes and cache policies. Furthermore, different processors have different clock frequencies, arithmetic capabilities, and memory bandwidth. Moreover, various GNU, Intel, and Cray compilers generate different executables that also depend on optimization flags used. Compiled code also depend on the data structures used as well as code constructs. A particular platform may be better suited for certain PDE applications. The performance spectrum model is useful for quickly visualizing and comparing the impact of platform characteristics, software, compiler options, and algorithms.

Software/solver implementation: There are several software packages suited for sophisticated finite element simulations such as the C++ based DEAL.II package [Bangerth et al., 2007], the Python based Firedrake Project [Rathgeber et al., 2016], the Python/C++ based FEniCS Project [Alnæs et al., 2015], the C++ based LibMesh [Kirk et al., 2006], and MOOSE[Gaston et al., 2009] projects. These scientific libraries all use PETSc’s linear algebra backend, but they can also use other packages such as HYPRE [Falgout, 2006] and Trilinos/ML [Heroux et al., 2005]. How well can specific solvers or software packages solve the same boundary value problem? Algebraic multigrid solvers have various theoretical approaches and implementation strategies, so it is entirely possible that certain solver configurations are better suited for a particular hardware architecture or PDE. Multigrid solvers for optimization remain a difficult research problem, but will be imperative for sustaining a high level of computational performance. Quick visual representations of the AI and equations solved per second can certainly guide programmers and scientists in the right direction when designing or implementing different software and solvers.

Numerical discretization: Finally, various flavors of numerical discretizations such as the finite difference, finite element, and finite volume methods not only have different orders of mathematical accuracy but different number of discrete equations to solve for a given mesh. Consider the Continuous Galerkin (CG) and Discontinuous Galerkin (DG) finite element methods – clearly the DG method has more degreesoffreedom since each element has its own copy of a geometric node, but does that necessarily mean it is more time consuming? For example, if the CG and DG elements each take roughly seconds to attain a solution for the same computational domain, then the latter element clearly has a higher rate metric because it has more degreesoffreedom for a given size, hence a bigger numerator in equation 3.1. This is important for computational scientists and mathematicians that want to compare the convergence rate of various numerical methods particularly if refinement studies are involved. A cost benefit analysis can be performed when comparing the numerical accuracy vs computational cost, often quantified using a workprecision diagram [Knepley and Bardhan, 2015]. One could also compare the impact finite element discretizations have on different geometric elements (e.g., tetrahedra, hexahedra, wedges, etc.). The performance of any numerical method depends on the hardware limitations and software implementations, but this spectrum can be useful for comparing different and available discretizations and polynomial orders in sophisticated finite element simulation packages.
4. Demonstration of the Performance Spectrum
As proofofconcept, we apply the proposed performance spectrum to study the computational performance of a couple of popular finite element packages when used to solve the steadystate diffusion equation. A series of demonstrations shall enrich our current understanding of how hardware limitations, software implementation, numerical discretization, and material properties can impact the performance and scalability. We restrict our studies to the C++ implementation of the FEniCS Project and the Python implementation of the Firedrake Project, both of which leverage several scientific libraries and solvers such as PETSc, HYPRE, and Trilinos/ML solvers. The GMRES iterative solver is used with various algebraic multigrid solvers set to a relative convergence tolerance of . All numerical simulations are performed on a single AMD Opteron 2354 and Intel Xeon E52680v2 node as described in Table 1. In this section, the performance spectrum model is used only to assess the assembly and solve steps.
The steadydiffusion equation gives rise to a secondorder elliptic partial differential equation. To this end, let
denote the computational domain, and let denote its boundary. A spatial point is denoted by . The unit outward normal to the boundary is denoted by . The boundary is divided into two parts: and . The part of the boundary on which Dirichlet boundary conditions are prescribed is denoted by , and the part of the boundary on which Neumann boundary conditions are prescribed is denoted by . For mathematical wellposedness we assume that(4.1) 
The corresponding boundary value problem takes the following form
(4.2a)  
(4.2b)  
(4.2c) 


size  tetrahedra  Vertices  FEniCS error  Firedrake error 

1/20  48,000  9,261  1.48E02  2.96E02 
1/40  384,000  68,921  3.90E03  7.77E03 
1/60  1,296,000  226,981  1.75E03  3.51E03 
1/80  3,072,000  531,441  9.89E04  1.99E03 
1/100  6,000,000  1,010,301  6.34E04  1.28E03 
1/120  10,368,000  1,771,561  4.41E04  8.88E04 
1/140  16,464,000  2,803,221  3.24E04  6.52E04 
slope: 1.97  slope: 1.96 
where is the scalar concentration field, is the diffusivity coefficient, is the volumetric source, is the prescribed concentration on the boundary, and is the prescribed flux on the boundary. Assuming , we consider the following analytical solution and corresponding forcing function on a unit cube:
(4.3)  
(4.4) 
Homogeneous Dirichlet boundary conditions are applied on all faces, and the analytical solution for is presented in Figure 5. These next few studies shall consider the following sizes on a structured tetrahedron mesh: 1/20, 1/40, 1/60, 1/80, 1/100, 1/120, and 1/140. All mesh information and error norms with respect to the FEniCS and Firedrake implementations of the continuous Galerkin (CG1) element is listed in Table 2.
4.1. Demo #1: AMD Opteron 2354 vs Intel Xeon E52680v2
We first compare the AI between a single Intel Xeon E52680v2 and AMD Opteron 2354 compute node for FEniCS’s implementation of the CG1 element coupled with PETSc’s algebraic multigrid preconditioner. The AI, as seen from Figure 6, gradually decreases with mesh refinement. Moreover, increasing the number of MPI processes also reduces the AI. The Intel processor has smaller L1 and L2 caches compared to the AMD processor, which explains why the former processor has lower AIs. It can be concluded that a higher AI on a different machine does not necessarily translate to better performance because clock rates and memory bandwidths differ. The fact that differences in AI does not directly relate to timetosolution can be seen in Figure 6.
The staticscaling plot is shown in Figure 7. It is clear that the Intel processor is capable of solving more degrees of freedom per second than the AMD processor. Increasing the number of MPI processes improves the Rate metric, which is expected since time to solution is amortized. Employing Rate from equation (3.7), as seen in Figure 8 gives us a better insight into the effect adding more MPI processes onto a single node has on the staticscaling performance. We also note that when only one or two MPI processes are used, the degreesoffreedom solved per second metric degrades as the problem size increases. We also observe that the line plots for Intel reach higher apexes as more MPI processes and larger problems are solved. The lines curves “dipping” to the left indicate a degradation in parallel performance – the problems are very small (e.g., size of 1/20 resulting in 9,261 degreesoffreedom distributed among 16 MPI processes means each process solves roughly only 580 equations) thus more of the execution time is spent on interprocess communication and latencies than actual computation. Both the Rate and Rate lines decrease with problem size on the AMD node, whereas the line plots for the Intel node are relatively flat, suggesting that the FEniCS and PETSc combination is in fact an algorithmically scalable combination for the problem at hand.
Figure 9 depicts the parallel efficiency on the two different nodes. The parallel performance for the smaller sizes is significantly worse due to the lack of computation needed for a given MPI concurrency. It is interesting to note that the speedup on the AMD node is slightly greater than on the Intel node. Recalling from Figure 6 that the AI on the AMD node is larger, we can infer that higher AIs indicate a stronger likelihood to experience greater parallel speedup. This behavior is consistent with the strongscaling results of the optimizationbased solvers for the Chromium remediation problem in [Chang et al., 2017] where similar classes of AMD and Intel processors were experimented with.
Remark 4.
We note here that we expect speedup on a single node using MPI processes to be comparable to that achievable by threaded implementations such as OpenMP or Intel TBB, since there are negligible system differences between threads and processes [Knepley et al., 2015]. This kind of behavior has been observed in careful studies of combined MPI+OpenMP finite element simulations [Turcksin et al., 2016].
4.2. Demo #2: FEniCS vs Firedrake
Next, we compare the FEniCS and Firedrake implementations of the CG1 element with 16 MPI processes on a single Intel Xeon E52680v2 node. The same steadystate diffusion equation is considered, but we now investigate how other multigrid solver packages like HYPRE and ML affect the performance.
The AI’s in Figure 10 clearly depend on the software implementation, the solver used, and the problem size. The results in this figure suggest that the FEniCS and Firedrake packages have very similar implementations of the PETSc and HYPRE multigrid solvers. However, the AI for FEniCS’s implementation of the ML solver deteriorates rapidly with problem size. Similar behavior is observed in the staticscaling plot of Figure 11 where the data points with the highest AI also have the highest rate at which equations are solved. Unlike the previous demonstration where different hardware implementations were compared, the AI and rate metrics are strongly correlated to each other, and it is clear that FEniCS’s current implementation of ML has some issues since the tailing off towards the right occurs before either the PETSc or HYPRE lines do.
With these two graphs in mind, one may wonder why the tailing off occurs. Does it occur due to suboptimal algorithmic convergence (i.e., iteration count increases with problem size), or do the data structures needed for important solver steps begin to drop out of cache? A scalable algorithm suggests that the number of solver iterations should not increase by much when the problem size increases, so if the GMRES iteration count increases significant, it is possible that the rate metric will decrease. Figure 12 denotes the number of GMRES iterations needed for every finite element package and solver, and it can be seen that the iteration counts do not increase by much. These plots must be interpreted carefully because although the iteration plots may suggest algorithmic scalability, the degradation in AI with respect to problem size suggests that the current software and solver parameters are not efficiently configured to utilize the hardware. As shown in the previous demonstration, the AI is useful for predicting which algorithms will see greater speedups as the number of MPI processes is increased. Figure 13 compares the AI and parallel performance of Firedrake’s three solver implementations. Regardless of which level of cache is used to determine the AI, HYPRE and ML have the lowest and highest AI’s, respectively. Moreover, HYPRE and ML have the worst and best parallel speedups, respectively, which again supports the fact that the AI metric is useful for predicting which algorithms may achieve the greatest parallel speedup.
We note that the HYPRE solver has relatively bad performance, suggesting that the outofbox parameters are unfit for the problem at hand. One of the best ways to improve the AI and Rate metrics is to simply adjust some of the solver parameters. If, for example, we optimize the parameters by increasing the strong threshold coarsening rate, the performance improves dramatically as we can tell from Figure 14. The AI and Rate metrics are now competitive with Firedrake’s implementation of the PETSc and ML solvers, but it is important to realize that the GMRES iteration counts increased with size. An algorithm that requires fewer iterations yet remains constant when the problem size increase does not necessarily mean it has good performance and scalability. Neither the AI nor rate metrics tail off towards the right, suggesting that the optimized HYPRE solver is scalable despite some minor growth in the GMRES iteration count. As we have discussed in the previous demonstration, answers regarding performance and scalability of various solvers and software will also depend on the hardware.
4.3. Demo #3: Continuous Galerkin vs Discontinuous Galerkin
size  CG1  CG2  DG1  DG2 

1/20  9,261  68,921  192,000  480,000 
1/40  68,921  531,441  1,536,000  3,840,000 
1/60  226,981  1,771,561  5,184,000  12,960,000 
1/80  531,441  4,173,281  12,288,000  30,720,000 
1/100  1,030,301  8,120,601     
size  CG1  CG2  DG1  DG2 

1/20  2.96E02  3.81E04  1.65E02  2.16E04 
1/40  7.77E03  3.79E05  4.35E03  2.26E05 
1/60  3.51E03  1.06E05  1.97E03  6.47E06 
1/80  1.99E03  4.44E06  1.12E03  2.72E06 
1/100  1.28E03  2.25E06     
slope:  1.95  3.19  1.94  3.16 
So far we have only considered the CG1 finite element. What happens if we employ another discretization such as the Discontinuous Galerkin (DG) method? Moreover, what happens if we increase the polynomial order and employ second order CG (CG2) and second order DG (DG2) elements? Various families of elements and their respective levels of refinement will change both the size and numerical accuracy of the numerical solution, so it is desirable to understand both the costs and benefits of these approaches on a particular mesh. Tables 3 and 4 contain the total degreesoffreedom and error norms, respectively, of Firedrake’s various finite element discretizations. The CG elements are studied up to size = 1/100 whereas the DG elements are studied up to size = 1/80. We again employ 16 MPI processes across a single Intel Xeon E52680v2 node, and all finite element discretizations in this demonstration are solved with optimized (i.e., increased strong threshold coarsening) HYPRE parameters.
Figure 15 contains the AI for the CG1, CG2, DG1, and DG2 elements. What we learn from these results is that increasing the polynomial order for the CG elements lowers the AI whereas the AI increases for DG elements. This may not always be the case because different solvers and different hardware architectures may be better tailored to different discretization. Other finite element packages like the FEniCS or DEAL.II projects may have very different results. The Rate metric as seen from Figure 16 depicts the rate at which each discretization solves its equations. Alternatively, one could also compare the Rate metric with respect to the degreesoffreedom as seen in Figure 17. Although DG elements have more degreesoffreedom for a given mesh discretization, it is seen that the DG1 element has the highest Rate metric, suggesting that the optimized HYPRE solver parameters are especially suitable for DG1 elements. Unlike the FEniCS and ML combination example from the previous demonstration, the DG2 discretization experiences significant degradation in the staticscaling plot yet maintains relatively consistent AI’s. This begs the question of whether the tailing off towards the right is due to memory effects or suboptimal algorithmic convergence.
As previously observed from Figure 14, the optimized HYPRE parameters resulted in a slight increase in GMRES iteration count for CG1 elements, and we notice similar trends for the other finite elements in Figure 18. If the iteration count increase is significant enough, it could negatively affect staticscaling. To determine whether this solver iteration growth stymied the rate by which equations are solved, we can employ Rate (i.e., degreesoffreedom solved per second per solver iterate) from equation (3.6) as shown in Figure 19. In this particular demonstration, it makes no difference as we still observe degradation with respect to problem size, hence suggesting that memory bandwidth and cache behavior have an adverse effect on the simulation. Using more compute nodes may certainly ameliorate both the AI and rate metrics for the DG2 element, but it should again be cautioned that comparative studies on the performance of numerical methods and solvers strongly depend on both the code implementation as well as the nature of the computing platform.
4.4. Demo #4: Material properties
size  

1/20  1.48E02  3.45E02  4.83E02  5.71E02  5.86E02 
1/40  3.90E03  9.31E03  1.46E02  1.90E02  1.99E02 
1/60  1.75E03  4.23E03  6.84E03  9.26E03  9.76E03 
1/80  9.89E04  2.40E03  3.94E03  5.43E03  5.75E03 
1/100  6.34E04  1.55E03  2.55E03  3.55E03  3.77E03 
1/120  4.41E04  1.07E03  1.78E03  2.49E03  2.64E03 
1/140  3.24E04  7.88E04  1.31E03  1.84E03  1.96E03 
slope:  1.97  1.94  1.86  1.77  1.75 
So far all three of our demonstrations have been conducted in a homogeneous domain. However, many scientific problems are often heterogeneous in nature, which may complicate the physics of the governing equations and may become more expensive to solve numerically. In [Chang and Nakshatrala, 2017], it was shown that solving heterogeneous problems like chaotic flow resulted in suboptimal algorithmic convergence (i.e., the iteration counts grew with
refinement), so our goal is to demonstrate how physical properties such a heterogeneity and anisotropy may skew how we interpret the performance. Let us now assume that we have a heterogeneous and anisotropic diffusivity tensor that can be expressed as follows
(4.5) 
where is a user defined constant that controls the level of heterogeneity and anisotropy present in the computational domain. By employing the same analytical solution as equation (4.3), the various values of give rise to new forcing functions. The error norms with respect to using Firedrake’s CG1 elements are shown in Table 5. Again a single Intel Xeon E52680v2 compute node with 16 MPI processes is used for this study, and PETSc’s multigrid solver is used to solve these problems.
Figure 20 depicts the AI, Rate, solver iterations, and Rate metrics. The AI is not affected by which suggests that there are no hardware or software implementation issues, only that the Rate metric tails off as is increased. We see that while the iteration growth is significant, the Rate metric is still flat for this heterogeneous and anisotropic steadystate diffusion problem. Thus, the primary reason that the data points in the staticscaling plots decrease with problem size has little to do with memory contention.
5. Case Study Part 1: Single Node
The previous section, which focused entirely on the steadystate diffusion equation, covered the basic ways one can utilize the proposed performance spectrum model to help justify, interpret, or diagnose the computational performance of any algorithm, numerical method, or solver for a particular compute node. In these next two sections, we demonstrate that this performance spectrum model is also useful for more complicated and nonlinear PDEs. We consider PETSc’s toy hydrostatic ice sheet flow example, based on the work of [Brown et al., 2013], with geometric multigrid and apply the performance spectrum to give us a better understanding of how certain HPC platforms scale.
5.1. Hydrostatic ice sheet flow equations
Consider a computational ice domain lying between a Lipschitz continuous bed and surface . The hydrostatic equations are obtained from the nonNewtonian Stokes equations where the horizontal and derivatives of velocity in the vertical direction are small and negligible. Denoting the horizontal component of the velocity field by where and are parallel to the and axes respectively, the governing equations for hydrostatic ice sheet flow is given by
(5.1a)  
(5.1b) 
where is the nonlinear effective viscosity expressed by
(5.2) 
where ice sheet models typically take . The hardness parameter is denoted by , the regularizing strain rate is defined by , and the second invariant is expressed by
(5.3) 
More information on the theoretical derivation of the above equations can be found in [Schoof, 2006; Schoof and Hindmarsh, 2010]. Equation (5.1) is subject to natural boundary conditions at the free surface and either noslip or powerlaw slip conditions with friction parameter
(5.4) 
where , is regularizing velocity, is a “lowspeed” reference friction, and is the slip exponent.
5.2. Problem setup
The hydrostatic ice sheet flow equation is discretized using hexahedron Q1 finite elements on a structured grid and Figure 21 contains the corresponding solution. Details concerning the theoretical derivation of the variational formulation as well as the parameters used for the boundary value problem can be found in [Brown et al., 2013]. Since this example problem is written entirely with PETSc routines and function calls, the [FLOPs] metric in equation (3.4) is determined using PETSc’s manual FLOP counts instead of hardware FLOP counters. This is particularly useful if a thorough comparative study on PETSc’s eclectic suite of linear algebra solvers for a particular PDE were to be conducted.
For this problem, we begin with an initial coarse grid size and successively refine the grid times until we get the desired problem size and numerical accuracy. The “fine grids” produced from this elementwise refinement are solved using the geometric multigrid technique, whereas the initial coarse grid is solved using algebraic multigrid. The assembled Jacobian employs block AIJ format (better known as the compressed sparse row format), where the horizontal velocity components are grouped per grid node. Since iceflow is tightly coupled in the vertical direction, parallel domain decomposition is specially set up so that grid points in the vertical direction are never distributed and are always contiguous in memory. The initial grid size must be chosen carefully because the mesh partition happens at the coarsest level and may cause load balancing issues if the initial grid is not large enough.
5.3. Results
Levels of refinement  Degreesoffreedom  SNES iterations  KSP iterations 

0  16,000  7  39 
1  115,200  8  45 
2  870,400  8  44 
3  6,758,400  8  44 
First, we provide an initial 40405 coarse grid and successively refine this grid up to 3 times. All five processors from Table 1 are studied, and the KNL processor is configured to use MCDRAM in flat mode. Each node has a different number of available cores so in order to maintain relatively consistent mesh partitioning, the Ivybridge and KNL processors will only utilize 16 and 64 cores, respectively. Table 6 presents the problem size as well as the number of total SNES and KSP iterations needed for each level of refinement. Figure 22 depicts the AI metrics with respect to the overall timetosolution. Each data point has the same coarse grid size but has different levels of grid refinement ranging from 0 to 3. The Ivybridge” and Haswell processors have similar AIs and are significantly smaller than their KNL and AMD counterparts. It should be noted that GNU compilers were used to compile the problem on the AMD Opteron 2354 and Intel Xeon E52680v2 processors whereas the other three processors used Cray compilers, which could explain why the AIs between the two Ivybridge processors are slightly different. As with the hardware counter examples in the last section, the AIs are initially small for the coarser problems but eventually stabilize if a problem is sufficiently large. It is interesting to note that the AMD processor is consistently flat for all data points, suggesting that the smaller problem sizes selected for this example have already approached the AMD processor’s achievable peak performance of the computational/memory resource. On the other hand, the KNL’s wider vector instruction sets and caches for smaller problems are not fully utilized, resulting in low AI’s.
The staticscaling plot on each of these compute nodes is shown in Figure 23, and Figure 24 depicts staticscaling based on Rate from (3.7). Unsurprisingly, the AMD processor is outperformed by all of the Intel processors. Both Haswell and KNL have the best performance out of all the systems studied, but we again notice that the KNL processor has poor metrics when the grid is small. Furthermore, KNL’s performance per core is considerably lower as seen from Figure 24. There are many reasons why we noticed such dramatic behavior. First, as we already noted from the AI results, the problem has to be sufficiently large in order to fully utilize the KNL vector instructions. Second, we used 64 of the 68 available cores on the KNL node, which is at least double the amount of cores that the other systems have. The degreesoffreedom per MPI is significantly smaller so it is possible interprocess communication time affects the scaling results.
6. Case Study Part 2: Multiple Nodes
The results from every example in this paper thus far behoove us to now investigate what happens when more than one compute node is needed to solve a PDE. Figures 9 and 11 from the previous section indicate that the AI metrics can be used to predict the strongscaling potential on a single node. Our goal is now to investigate if the correlation holds true even across multiple nodes. To ensure that the problem is sufficiently large to distribute to several nodes, we consider an initial 64646 coarse grid with three levels of refinement (21,495,808 degreesoffreedom). The number of KSP and SNES iterations needed to solve the problem are 62 and 8, respectively.
Nodes  E52695v2 (Ivybridge)  E52698v3 (Haswell)  Xeon Phi 7250 (KNL)  
Cores  Time (s)  Eff. (%)  Cores  Time (s)  Eff. (%)  Cores  Time (s)  Eff. (%)  
1  16  300    32  227    64  193   
2  32  150  100  64  108  105  128  92.4  104 
4  64  72.0  104  128  57.3  99.0  256  47.3  102 
8  128  37.9  98.9  256  28.7  98.9  512  25.2  95.7 
16  256  18.9  99.2  512  13.9  100  1024  15.1  79.9 
32  512  9.65  97.2  1024  8.11  87.5  2048  10.6  56.9 
64  1024  6.75  69.4  2048  4.62  76.8  4096  9.27  32.5 
Levels of refinement  Degreesoffreedom  SNES iterations  KSP iterations 

1  3,014,656  8  85 
2  23,592,960  8  85 
3  186,646,528  8  85 
4  1,484,783,616  8  85 
5  11,844,714,496  8  85 
Table 7 contains the strongscaling results for the HPC systems containing the Intel Xeon E52695v2 (Ivybridge), Intel Xeon E52698v3 (Haswell), and Intel Xeon Phi 7250 (KNL) nodes. All three systems demonstrate near perfect strongscaling performance until 1024 cores are used (roughly 20k degreesoffreedom per core). However, it is difficult to make performance comparisons because different systems employ different numbers of MPI processes per node which affect communication to computation ratios as well as required data bandwidth between nodes. The only concrete conclusion that can be made is that the KNL system takes the least amount of wallclock time on a single compute node but gets outperformed when the problem size per node reduces. Figures 22 and 23 suggest that when the problem size on a KNL node is sufficiently small, parallel performance would degrade drastically, which is exactly what the results of Table 7 portray.
6.1. Example #1: 1024 MPI processes
In this section, we consider what happens when we employ the same MPI concurrency. This second example aims to model the performance when the same hydrostatic ice sheet flow problem is solved utilizing 1024 MPI processes on different systems. We set this problem up by allocating 64 Ivybridge nodes, 32 Haswell nodes, and 16 KNL nodes. An even larger initial 12812812 coarse grid is selected, and we refine the problem 1–3 times. Table 8 presents the problem size as well as the number of nonlinear and linear solver iterations needed for every level of refinement. Figures 25 and 26 contain the intensity and rate metrics, respectively. The AI data points are either relatively flat or do not experience drastic changes upon mesh refinement. The staticscaling plot tells us that the Ivybridge system has the best performance as the problem gets larger. This behavior may seem to contradict the findings of the staticscaling plot in Figure 23, but it is important to realize that this PETSc application is limited by the memorybandwidth and not the TPP for the FLOP rate. The HPC system with Ivybridge processors has the best performance simply because it employs more compute nodes thus more available memory.
6.2. Example #2: 256 compute nodes
The previous example is an elegant demonstration of why comparing HPC machines based on equal MPI concurrency can produce misleading performance metrics, especially for computational frameworks that are limited by the memorybandwidth. What happens if every system employs the same number of compute nodes? In this third example, the HPC systems shall now allocate 256 compute nodes each. Thus, Ivybridge will use 4096 MPI processes (16 out of 24 cores per node), Haswell will use 8192 MPI processes (32 out of 32 cores per node), and KNL will use 16384 processes (64 out of 68 cores per node). We use the same initial coarse grid as in the previous example but now refine the problem 1–5 times. The AI in Figure 27 again indicate relative consistency for finer problems, and we again observe that the AI metric will drop significantly if a problem is not large enough. This trend corroborates the notion that the AI dropping for small problems happens regardless of whether a single node or multiple nodes are used. The staticscaling plot shown in Figure 28 demonstrates that the Ivybridge processor does not beat out the Haswell processor. What’s particularly interesting is that the performance for the KNL processor drastically varies with problem size. KNL cannot beat out Ivybridge for small problems, but KNL will beat both Ivybridge and Haswell when a problem is neither too small nor too large.
The performance spectrum model is useful for understanding performance characteristics across a wide variety of hardware architectures. Although the STREAM Triad measurements from Table 1 suggest that KNL should greatly outperform Ivybridge and Haswell for memorybandwidth dominated applications, the performance spectrum indicates that current and practical implementations of scientific software like PETSc v3.7.4 on KNL may be slow if the problem is dominated by main memory bandwidth. Different platforms require different implementation methodologies in order to maximize performance, so optimizing computational frameworks to fully leverage the power of the KNL processor is still an open research problem. Nonetheless, the performance spectrum model is useful for testing various implementations of PDE solvers and can be utilized to understand hardware architectures trends and algorithms of the future.
7. Concluding Remarks
In this paper, we have proposed a performance model, referred to as the performance spectrum, designed to simultaneously model both the hardware/architectural and algorithmic efficiencies of a variety of parallel PDE solvers. The techniques needed to approximate such efficiency metrics are 1) the arithmetic intensity documenting the ratio of flops over data cache misses, and 2) staticscaling, which scales up the problem while fixing the concurrency. This spectrum enabled us to visualize and enrich our current understanding of performance issues related to hardware limitations, software/solver implementation, and numerical discretization of some popular and stateoftheart finite element simulation packages and solvers. Moreover, it has been shown that this spectrum is also useful for understanding performance and scalability of complex solvers and PDEs for nonlinear problems like hydrostatic ice sheet flow in a largescale environment. Computational scientists have designed and are still designing software and algorithms needed to answer many of today’s pressing scientific problems, so not only do we need to solve these problems accurately but also to solve them fast. In order to understand how fast these solvers and software are, particularly ones that are either blackbox or designed by others, we need a performance model, such as the proposed performance spectrum, to help answer any questions regarding computational performance.
7.1. Potential extension of this work
The performance spectrum model has been proven to be a robust and versatile tool for modeling the performance and scalability of several parallel finite element solvers and packages, but there are still plenty of research endeavors relating to this work that need to be addressed. We briefly provide a list of some potential future tasks and avenues of extension:

Most of the systems chosen for this study are from Intel and have similar performance characteristics, but processors from other vendors such as IBM’s POWER8 [Sinharoy et al., 2015], ARMbased systems [Rajovic et al., 2014], may tell a different story. A logical extension to this work could be to extend this performance spectrum analysis onto GPUs. HPC architecture is constantly evolving, and simple benchmarks and measurements like STREAM Triad may not be sufficient for understanding the performance of complex solvers or algorithms on these stateoftheart machines.

The intensity equations presented in Section 3 are relatively easy to incorporate into any code, but they only provide relative comparisons between various flavors of hardware and software. An improved methodology for documenting the [Work] and [TBT] metrics would certain improve the predictive capabilities of the performance spectrum model. Other popular techniques such as using Intel’s SDE and VTune libraries can be used to measure AI on CPUs.

Certain PDEs, like the advectiondiffusion equation, are notoriously difficult to solve especially for high Péclet numbers. The performance spectrum can be useful to compare not only various numerical discretization (e.g., finite element vs finite volume methods) but also special solvers when physical properties such as velocity and diffusivity vary with length scale. Moreover, complex PDEs may also involve multiple physical components, thus requiring mixed formulations or hybridization techniques that may not be so trivial to solve or implement.

Finally, the performance spectrum is not restricted to solving PDEs. One can examine any computational scientific problem where performance and scalability is limited by problem size. Examples include but are not limited to: boundary integral equations, uncertainty quantification, parameter estimation, adaptive mesh refinement, and parallel domain decomposition.
Acknowledgments
The authors acknowledge Jed Brown (University of Colorado at Boulder) for his invaluable thoughts and ideas which inspired the creation of this paper. Compute time on the Maxwell and Opuntia systems is provided by the Center for Advanced Computing & Data Systems (CACDS) at the University of Houston, and compute time on the Edison and Cori systems is provided by the National Energy Research Scientific Computing Center (NERSC). MGK acknowledges partial support from the Rice Intel Parallel Computing Center and US DOE Contract DEAC0206CH11357. The opinions expressed in this paper are those of the authors and do not necessarily reflect that of the sponsors.
References
 Adams et al. [2004] M. F. Adams, H. H. Bayraktar, T. M. Keaveny, and P. Papadopoulos. Ultrascalable Implicit Finite Element Analyses in Solid Mechanics with Over a Half a Billion Degrees of Freedom. In ACM/IEEE Proceedings of SC2004: High Performance Networking and Computing, 2004. Gordon Bell Award.
 Adhianto et al. [2010] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. MellorCrummey, and N. R. Tallent. HPCToolkit: Tools for Performance Analysis of Optimized Parallel Programs. Concurrency and Computation: Practice and Experience, 22(6):685–701, 2010.
 Alnæs et al. [2015] M. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson, J. Ring, M. E. Rognes, and G. N. Wells. The FEniCS Project Version 1.5. Archive of Numerical Software, 3(100):9–23, 2015.
 Amestoy et al. [2000] P. R. Amestoy, I. S. Duff, J. L’Excellent, and J. Koster. MUMPS: A General Purpose Distributed Memory Sparse Solver. In International Workshop on Applied Parallel Computing, pages 121–130. Springer, 2000.
 Balay et al. [2014] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang. PETSc Web Page, 2014. http://www.mcs.anl.gov/petsc.
 Balay et al. [2016] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang. PETSc Users Manual. Technical Report ANL95/11  Revision 3.7, Argonne National Laboratory, 2016.
 Bangerth et al. [2007] W. Bangerth, R. Hartmann, and G. Kanschat. DEAL.II – A General Purpose Object Oriented Finite Element Library. ACM Trans. Math. Softw., 33(4):24/1–24/27, 2007.
 Bramble et al. [1994] J. H. Bramble, D. Y. Kwak, and J. E. Pasciak. Uniform Convergence of Multigrid VCycle Iterations for Indefinite and Nonsymmetric Problems. SIAM Journal on Numerical Analysis, 31(6):1746–1763, 1994.
 Brandt and Livne [2011] A. Brandt and O. Livne. Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, Revised Edition. SIAM, 2011.
 Brown et al. [2012] J. Brown, M. G. Knepley, D. A. May, L. C. McInnes, and B. F. Smith. Composable Linear Solvers for Multiphysics. In Proceeedings of the 11th International Symposium on Parallel and Distributed Computing (ISPDC 2012), pages 55–62. IEEE Computer Society, 2012. URL http://doi.ieeecomputersociety.org/10.1109/ISPDC.2012.16.
 Brown et al. [2013] J. Brown, B. Smith, and A. Ahmadia. Achieving Textbook Multigrid Efficiency for Hydrostatic Ice Sheet Flow. SIAM Journal on Scientific Computing, 35:B359–B375, 2013.
 Brune et al. [2015] P. R. Brune, M. G. Knepley, B. F. Smith, and X. Tu. Composing Scalable Nonlinear Algebraic Solvers. SIAM Review, 57(4):535–565, 2015. doi: http://dx.doi.org/10.1137/130936725. URL http://www.mcs.anl.gov/papers/P20100112.pdf. http://www.mcs.anl.gov/papers/P20100112.pdf.
 Castelletto et al. [2016] N. Castelletto, J. A. White, and M. Ferronato. Scalable Algorithms for ThreeField Mixed Finite Element Coupled Poromechanics. Journal of Computational Physics, 327:894 – 918, 2016.
 Chang and Nakshatrala [2017] J. Chang and K. B. Nakshatrala. Variational Inequality Approach to Enforce the NonNegative Constraint for AdvectionDiffusion Equations. Computer Methods in Applied Mechanics and Engineering, 320:287–334, 2017.
 Chang et al. [2017] J. Chang, S. Karra, and K. B. Nakshatrala. Largescale OptimizationBased NonNegative Computational Framework for Diffusion Equations: Parallel Implementation and Performance Studies. Journal of Scientific Computing, 70:243–271, 2017.
 Chevalier and Pellegrini [2008] C. Chevalier and F. Pellegrini. PTScotch: A Tool for Efficient Parallel Graph Ordering. Parallel computing, 34(6):318–331, 2008.
 Dagum and Menon [1998] L. Dagum and R. Menon. OpenMP: An Industry Standard API for SharedMemory Programming. IEEE computational science and engineering, 5(1):46–55, 1998.
 Dongarra et al. [1990] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 16(1):1–17, March 1990. ISSN 00983500.
 Falgout [2006] R. Falgout. HYPRE users manual. Technical Report Revision 2.0.0, Lawrence Livermore National Laboratory, 2006.
 Gaston et al. [2009] D. Gaston, C. Newman, G. Hansen, and D. LebrunGrandie. MOOSE: A Parallel Computational Framework for Coupled Systems of Nonlinear Equations. Nuclear Engineering and Design, 239(10):1768–1778, 2009.
 Geimer et al. [2010] M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca Performance Toolset Architecture. Concurrency and Computation: Practice and Experience, 22(6):702–719, 2010.
 Grcar [2011] J. F. Grcar. Mathematicians of Gaussian Elimination. Notices of the American Mathematical Society, 58(6):782–792, 2011. URL http://www.ams.org/notices/201106/rtx110600782p.pdf.
 Gropp et al. [1999a] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the MessagePassing Interface, volume 1. 1999a.
 Gropp et al. [1999b] W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. Toward Realistic Performance Bounds for Implicit CFD Codes. In Proceedings of Parallel CFD ‘99, pages 233–240. Elsevier, 1999b.
 Hendrickson and Leland [1995] B. Hendrickson and R. W. Leland. A MultiLevel Algorithm For Partitioning Graphs. SC, 95(28), 1995.
 Heroux et al. [2005] Michael A. Heroux, Roscoe A. Bartlett, Vicki E. Howle, Robert J. Hoekstra, Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P. Pawlowski, Eric T. Phipps, Andrew G. Salinger, Heidi K. Thornquist, Ray S. Tuminaro, James M. Willenbring, Alan Williams, and Kendall S. Stanley. An Overview of the Trilinos project. ACM Trans. Math. Softw., 31(3):397–423, 2005. ISSN 00983500. doi: http://doi.acm.org/10.1145/1089014.1089021.
 Karypis and Kumar [1999] G. Karypis and V. Kumar. A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 20:359–392, 1999.
 Keyes et al. [2013] D. E. Keyes, L. C. McInnes, C. Woodward, W. Gropp, E. Myra, M. Pernice, J. Bell, J. Brown, A. Clo, J. Connors, E. Constantinescu, D. Estep, K. Evans, C. Farhat, A. Hakim, G. Hammond, G. Hansen, J. Hill, T. Isaac, X. Jiao, K. Jordan, D. Kaushik, E. Kaxiras, A. Koniges, K. Lee, A. Lott, Q. Lu, J. Magerlein, R. Maxwell, M. McCourt, M. Mehl, R. Pawlowski, A. P. Randles, D. Reynolds, B. Rivière, U. Rőde, Tim. Scheibe, J. Shadid, B. Sheehan, M. Shephard, A. Siegel, B. Smith, X. Tang, C. Wilson, and B. Wohlmuth. Multiphysics Simulations. The International Journal of High Performance Computing Applications, 27(1):4–83, 2013.
 Kirk et al. [2006] B. S. Kirk, J. W. Peterson, R. H. Stogner, and G. F. Carey. LibMesh: A C++ Library for Parallel Adaptive Mesh Refinement/Coarsening Simulations. Engineering with Computers, 22(3–4):237–254, 2006. http://dx.doi.org/10.1007/s0036600600493.
 Knepley and Bardhan [2015] M. G. Knepley and J. P. Bardhan. Work/Precision Tradeoffs in Continuum Models of Biomolecular Electrostatics. In Proceedings of ASME 2015 International Mechanical Engineering Congress & Exposition, volume 9, page V009T12A04, 2015. doi: 10.1115/IMECE201553579.
 Knepley and Terrel [2013] M. G. Knepley and A. R. Terrel. Finite Element Integration on GPUs. ACM Transactions on Mathematical Software, 39(2), 2013. URL http://arxiv.org/abs/1103.0066.
 Knepley et al. [2016] M. G. Knepley, K. Rupp, and A. R. Terrel. Finite Element Integration with Quadrature on the GPU, 2016. URL https://arxiv.org/abs/1607.04245.
 Knepley et al. [2015] Matthew G. Knepley, Jed Brown, Lois Curfman McInnes, Barry Smith, Karl Rupp, and Mark Adams. Exascale computing without threads. 2015. URL http://www.orau.gov/hpcor2015/whitepapers/Exascale_Computing_without_ThreadsBarry_Smith.pdf. Whitepaper for the DOE High Performance Computing Operational Review (HPCOR) on Scientific Software Architecture for Portability and Performance.
 Li and Demmel [2003] X. S. Li and J. W. Demmel. SuperLU_DIST: A Scalable DistributedMemory Sparse Direct Solver for Unsymmetric Linear Systems. ACM Transactions on Mathematical Software (TOMS), 29(2):110–140, 2003.
 Lichtner et al. [2013] P. C. Lichtner, G. E. Hammond, C. Lu, S. Karra, G. Bisht, B. Andre, R. T. Mills, and J. Kumar. PFLOTRAN Web Page, 2013. http://www.pflotran.org.
 Lo et al. [2015] Y. J. Lo, S. Williams, B. V. Straalen, T. J. Ligocki., M. J. Cordery, N. J. Wright, M. W. Hall, and L. Oliker. Roofline: An Insightful Visual Performance Model for Multicore Architectures. High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 8966:129–148, 2015.
 May et al. [2014] D. A. May, J. Brown, and L. L. Laetitia. pTatin3D: HighPerformance Methods for LongTerm Lithospheric Dynamics. In Proceedings of the International Conference for High Performance Computing, Network, Storage and Analysis, SC ‘14, pages 274–284. IEEE Press, 2014.
 McCalpin [1995] J. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance Computers, 1995. https://www.cs.virginia.edu/stream/.
 McKee [2004] S. A. McKee. Reflections on the Memory Wall. In Proceedings of the 1st conference on Computing frontiers, page 162. ACM, 2004.
 Miller et al. [2013] C. T. Miller, C. N. Dawson, M. W. Farthing, T. Y. Hou, J. Huang, C. E. Kees, C. T. Kelley, and H. P. Langtangen”. Numerical simulation of water resources problems: Models, methods, and trends. Advances in Water Resources, 51:405–437, 2013.
 Mucci et al. [1999] P. J. Mucci, S. Browne, C. Deane, and G. Ho. PAPI: A Portable Interface to Hardware Performance Counters. In Proceedings of the department of defense HPCMP users group conference, volume 710, 1999.
 Murphy [2007] R. Murphy. On the Effects of Memory Latency and Bandwidth on Supercomputer Application Performance. In Workload Characterization, 2007. IISWC 2007. IEEE 10th International Symposium on, pages 35–43. IEEE, 2007.
 Rajovic et al. [2014] N. Rajovic, A. Rico, N. Puzovic, C. AdeniyiJones, and A. Ramirez. Tibidabo1: Making the Case for an ARMbased {HPC} system. Future Generation Computer Systems, 36:322 – 334, 2014.
 Raman [2015] K. Raman. Calculating “FLOP” using Intel Software Development Emulator (Intel SDE) , 2015. https://software.intel.com/enus/articles/intelsoftwaredevelopmentemulator.
 Rathgeber et al. [2016] F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. McRae, G. T. Bercea, G. R. Markall, and P. H. J. Kelly. Firedrake: Automating the Finite Element Method by Composing Abstractions. ACM TOMS, 43:24:1–24:27, 2016.
 Schoof [2006] C. Schoof. A Variational Approach to Ice Stream Flow. Journal of Fluid Mechanics, 556:227–251, 2006.

Schoof and Hindmarsh [2010]
C. Schoof and R. Hindmarsh.
ThinFilm Flows with Wall Slip: An Asymptotic Analysis of Higher Order Glacier Flow Models.
Quarterly Journal of Mechanics and Applied Mathematics, 63:73–114, 2010.  Schulz et al. [2008] M. Schulz, J. Galarowicz, D. Maghrak, W. Hachfeld, D. Montoya, and S. Cranford. Open— SpeedShop: An Open Source Infrastructure for Parallel Performance Analysis. Scientific Programming, 16(23):105–121, 2008.
 Shende and Malony [2006] S. S. Shende and A. D. Malony. The TAU Parallel Performance System. The International Journal of High Performance Computing Applications, 20(2):287–311, 2006.
 Sinharoy et al. [2015] B. Sinharoy, J. A. Van Norstrand, R. J. Eickemeyer, H. Q. Le, J. Leenstra, D. Q. Nguyen, B. Konigsburg, K. Ward, M. D. Brown, J. E. Moreira, D. Levitan, S. Tung, D. Hrusecky, J. W. Bishop, M. Gschwind, M. Boersma, M. Kroener, M. Kaltenbach, T. Karkhanis, and K. M. Fernsler. IBM POWER8 Processor Core Microarchitecture. IBM Journal of Research and Development, 59(1):2:1–2:21, Jan 2015. ISSN 00188646.
 Smith et al. [1996] B. F. Smith, P. Bjørstad, and W. D. Gropp. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 1996. URL http://www.mcs.anl.gov/~bsmith/ddbook.html.
 [52] A. Tal. Intel Software Development Emulator . https://software.intel.com/enus/articles/intelsoftwaredevelopmentemulator.
 Turcksin et al. [2016] Bruno Turcksin, Martin Kronbichler, and Wolfgang Bangerth. Workstream–a design pattern for multicoreenabled finite element computations. ACM Transactions on Mathematical Software (TOMS), 43(1):2, 2016.
 Weaver et al. [2013] V. M. Weaver, D. Terpstra, and S. Moore. NonDeterminism and Overcount on Modern Hardware Performance Counter Implementations. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 215–224, April 2013.
 Williams et al. [2009] S. Williams, A. Waterman, and D. Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 54:65–76, 2009.
 Wulf and McKee [1995] W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Implications of the Obvious. ACM SIGARCH computer architecture news, 23(1):20–24, 1995.
Comments
There are no comments yet.