A performance spectrum for parallel computational frameworks that solve PDEs

05/10/2017
by   J. Chang, et al.
University of Houston
0

Important computational physics problems are often large-scale in nature, and it is highly desirable to have robust and high performing computational frameworks that can quickly address these problems. However, it is no trivial task to determine whether a computational framework is performing efficiently or is scalable. The aim of this paper is to present various strategies for better understanding the performance of any parallel computational frameworks for solving PDEs. Important performance issues that negatively impact time-to-solution are discussed, and we propose a performance spectrum analysis that can enhance one's understanding of critical aforementioned performance issues. As proof of concept, we examine commonly used finite element simulation packages and software and apply the performance spectrum to quickly analyze the performance and scalability across various hardware platforms, software implementations, and numerical discretizations. It is shown that the proposed performance spectrum is a versatile performance model that is not only extendable to more complex PDEs such as hydrostatic ice sheet flow equations, but also useful for understanding hardware performance in a massively parallel computing environment. Potential applications and future extensions of this work are also discussed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 15

page 18

page 20

page 28

02/21/2018

Comparative study of finite element methods using the Time-Accuracy-Size (TAS) spectrum analysis

We present a performance analysis appropriate for comparing algorithms u...
11/02/2020

Distributed Machine Learning for Computational Engineering using MPI

We propose a framework for training neural networks that are coupled wit...
04/14/2020

Scalability of High-Performance PDE Solvers

Performance tests and analyses are critical to effective HPC software de...
02/09/2018

GPU Accelerated Finite Element Assembly with Runtime Compilation

In recent years, high performance scientific computing on graphics proce...
05/25/2018

A Scalable and Modular Software Architecture for Finite Elements on Hierarchical Hybrid Grids

In this article, a new generic higher-order finite-element framework for...
08/03/2021

Parallel-in-time preconditioners for the Sinc-Nyström method

The Sinc-Nyström method in time is a high-order spectral method for solv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and Motivation

Both efficient algorithms and software performing well on modern computing systems are crucial to address current scientific and engineering problems. These tools are important for bridging the gap between theory and real-world data. Such problems often need to tackle field-scale data sets using parallel computers, parallel algorithms, and programming tools such as OpenMP [Dagum and Menon, 1998] and the Message Passing Interface (MPI) [Gropp et al., 1999a]

and cannot be solved on a standard laptop or desktop. For example, hydrologists and geophysicists need to work with field-scale reservoirs which could span tens of kilometers and evolve on time scales of hundreds of years. Moreover, such reservoir simulations involve complicated multi-phase and multi-component flows which require multiple complex equations to be solved accurately and efficiently. Atmospheric and climate modelers also require state-of-the-art techniques as both data assimilation and parameter estimation need to be performed quickly on meso-scale and global-scale applications. The US Department of Energy has invested in the development of several portable and extensible scientific software packages like PETSc

[Balay et al., 2016, 2014] and PFLOTRAN [Lichtner et al., 2013] that can help address such important large-scale problems. The time spent developing parallel computational frameworks is amortized when application scientists employ the packages in their work.

However, it is not always known whether the performance of a particular parallel computational framework or software will be satisfactory across a panoply of solvers and computing platforms. How can one really tell whether an algorithm is performing at its highest level? Is there room for improvement? Answering these questions in full is a Herculean task, but questions regarding the algorithmic and computational efficiency of scientific tools and libraries still need to be answered [Keyes et al., 2013]. Hence, we need performance models which enable us to synthesize performance data into an understandable framework. Performance models can include many metrics of importance such as total floating point operations (FLOP), memory usage, inter/intra process/node communication, memory/cache bandwidth, and cache misses/hits. If not carefully optimized, some of the hardware resources can become unnecessary bottlenecks that result in costly and inefficient numerical simulations. Modern computer systems are quite complex and the performance can be difficult to predict with good accuracy. Conducting large-scale simulations on state-of-the-art supercomputers may require hundreds to thousands of hours of compute time, so it is highly desirable to have a performance model that can predict how a particular parallel computational framework may perform. The application or domain scientist may use software that either is not made in house or is a “black-box” tool, and it would be too time consuming, or impossible if source code is unavailable, to dissect the code and analyze the design of the subroutines and data structures. It is therefore desirable to analyze these codes as a whole.

1.1. Review of previous works

We now briefly highlight some useful approaches and models one could take to analyze and perhaps improve the performance of any parallel computational framework. One of the simplest measures one can utilize is the STREAM memory-bandwidth benchmark [McCalpin, 1995]. This benchmark measures sustainable memory-bandwidth on a single server and indicates the number of threads that saturates memory bandwidth. Memory-bandwidth is an important limitation to consider on modern machines [Wulf and McKee, 1995; McKee, 2004; Murphy, 2007]

The Roofline model [Williams et al., 2009; Lo et al., 2015]

captures peak achievable performance on a server taking into account both CPU and memory-bandwidth capabilities by introducing the Arithmetic Intensity (AI). The AI is simply the measure of the total floating-point operations needed, total FLOP, over Total Bytes Transferred (TBT). Higher AI’s indicate that the algorithm or computational framework is more computationally intensive and requires less bandwidth for a given amount of work. One is free to employ any cache model when determining the TBT metric for the roofline model. For example, scientists have developed a Sparse Matrix-Vector (SpMV) multiplications model

[Gropp et al., 1999b] which is based on “perfect cache” (i.e., matrices and vectors are loaded and stored once from memory). SpMV is an integral part of iterative solvers for solving PDEs. It has been shown in [Chang et al., 2017] that the SpMV “perfect cache” model can also be used to accurately predict and understand the hardware performance of optimization-based solvers for enforcing discrete maximum principles. In [May et al., 2014], the authors employ matrix-free iterative methods for Stokes equation, which is needed for lithospheric dynamic applications. The authors manually count the TBT based on source code. The advantage of matrix-free methods is that the sparse matrix-vector multiplication, which is memory-bandwidth limited, is not explicitly stored thus bringing the computational frameworks’ upper-bound limit of the roofline closer to the Theoretical Peak Performance (TPP) region. TBT can also be determined based on memory level traffic or cache misses. The same analysis can be carried out for many-core architectures, such as Nvidia GPUs and the Intel Xeon Phi “Knights Landing” (KNL), in [Knepley and Terrel, 2013; Knepley et al., 2016].

For a more thorough analysis of performance, advanced software tools such as the HPCToolkit [Adhianto et al., 2010], OpenSpeedShop [Schulz et al., 2008], Scalasca [Geimer et al., 2010], and TAU [Shende and Malony, 2006] are used by scientific software developers and application scientists alike. These tools provide in-depth performance analyses of scientific codes and can also be used to debug the codes. Many of them also rely on PAPI [Mucci et al., 1999] which use low level hardware counters for important metrics like FLOPs, total CPU (central processing unit) cycles, and cache misses. These tools have proven to be extremely useful for computational scientists in all areas of computational physics and can provide a good understanding of the hardware performance of any computational framework for solving PDEs.

1.2. Main contributions

In this paper, we provide a simple and easy-to-use performance model that can be used in addition to the techniques and tools mentioned above. Our performance model, which we refer to as a performance spectrum111We borrowed the terminology “performance spectrum” from Dr. Jed Brown, which he used in his presentation at 2016 SIAM Parallel Processing conference, which was held at Paris, France (conference website: http://www.siam.org/meetings/pp16/). takes into account time-to-solution, AI based on cache misses, and equations solved per second. This model is applicable to any level of a scientific code, whether it be the entire computational framework or only particular phases or functions such as mesh generation, assembly of a matrix, or the solver step. It is important to note that this tool is not intended to replace any of the aforementioned performance tools or models but to simply augment one’s ability to quickly understand and diagnose the performance from both the hardware, software, and algorithmic stand point. The main contributions of this paper can be enumerated as follows:

  1. We outline common issues pertaining to performance, ways to identify them, and methods to address them.

  2. We present a model called performance spectrum that provides an enhanced understanding of the performance and scalability of algorithms and software.

  3. We demonstrate that the proposed model can be utilized on existing popular software packages and solvers.

  4. We apply the model to a more complicated and nonlinear PDE and document the parallel performance of the computational framework across HPC machines.

  5. We discuss some possible ways in which this performance spectrum model can be extended.

The rest of the paper is organized as follows. In Section 2, we outline some of the key performance issues one may come across when solving PDEs and how to address some of them. In Section 3, we propose a model, performance spectrum, which captures three critical metrics useful for understanding performance and scalability. In Section 4, we demonstrate possible ways one could utilize the proposed model by systematically comparing commonly used finite element packages and solvers. In Section 5, we extend the model to simulate nonlinear hydrostatic ice sheet flow equations. In Section 6, we run the nonlinear hydrostatic ice sheet flow equations across multiple compute nodes and study the performance. Concluding remarks and possible extensions of this work are outlined in Section 7. All the notational conventions employed in this paper are introduced as needed.

2. Common Performance Issues

The performance of any scientific software or algorithm will depend on a myriad of factors. First and foremost, good performance depends on efficient and practical implementation of the code. Application and domain scientists may not be interested in the intricate details of the code framework that they did not design, but they must still be cognizant of important computational issues that may inhibit performance dramatically. We now briefly highlight some common performances issues computational scientist may come across in their line of work:

(b)
Figure 1. An overview of the STREAM measurement on two different compute nodes. The mapping of MPI bindings has a significant impact on the achievable memory bandwidth.
  • Core/memory bindings: The simplest way to maximize parallel performance for MPI applications is to properly enforce MPI process and memory bindings. This is particularly important for memory bandwidth-limited applications because, on most CPU architectures, the aggregate core bandwidth exceeds the CPU bandwidth to memory and it is important to use the CPUs in a multi CPU server in a balanced way. Furthermore, if multiple users share a compute node, performance metrics can vary greatly as both memory resources and certain levels of cache are shared by others. Appropriate mapping methodologies for binding ranks to cores is vital for complex hardware architectures as well as for complex topological node layouts. Consider the single dual socket servers and their respective STREAM Triad benchmark results shown in Figure 1. Both the AMD and Intel processors possess two sockets where the physical cores are contiguously ordered. However, when the MPI processes are placed on alternating sockets, the achievable bandwidth is higher for a fixed number of cores by using the memory systems on both CPUs. For multi node performance, different binding techniques are required – memory references on a single node are several times faster than on a remote node. Process allocation must be carefully done so that communication across networks is minimized.

    Processor AMD Intel Intel Intel Intel
    Opteron 2354 Xeon E5-2680v2 Xeon E5-2695v2 Xeon E5-2698v3 Xeon Phi 7250
    “Barcelona” “Ivybridge” “Ivybridge” “Haswell” “Knights Landing”
    Clock rate 2.2 GHz 2.8 GHz 2.4 GHz 2.3 GHz 1.4 GHz
    Year released 2008 2013 2013 2014 2016
    Sockets 2 2 2 2 1
    Cores/socket 4 10 12 16 68
    Threads/core 1 1 2 2 4
    L1 cache 864 KB 2032 KB 2464 KB 3264 KB 6864 KB
    2-way associativity 8-way associativity 8-way associativity 8-way associativity 8-way associativity
    L2 cache 8512 KB 20256 KB 24256 KB 32256 KB 341 MB
    16-way associativity 8-way associativity 8-way associativity 8-way associativity 16-way associativity
    L3 cache 22 MB 225 MB 230 MB 240 MB -
    32-way associativity 20-way associativity 20-way associativity 20-way associativity -
    Memory type DDR2-200 MHz DDR3-1600 MHz DDR3-1866 MHz DDR4-2133 MHz DDR4-2400 MHz,
    MCDRAM
    Total memory 16 GB 64 GB 64 GB 128 GB 96 GB (DDR4),
    16 GB (MCDRAM)
    Memory channels 4 8 8 8 6 (DDR4),
    8 (MCDRAM)
    Compiler used GNU GNU Cray Cray Cray
    STREAM Triad 10.5 GB/s 64.5 GB/s 102 GB/s 116 GB/s 90 GB/s (DDR4),
    480 GB/s (MCDRAM)
    Table 1. Single node specifications from each of the HPC systems used for this study. Note that Intel’s “Knights Landing” (KNL) processor has two different types of memory.
  • Hardware architecture: The performance of any benchmark or software depends on the hardware architecture. In this paper, we consider five different HPC systems with single node specifications listed in Table 1. It is evident from the STREAM Triad benchmark that different architectures have different levels of achievable memory-bandwidth. Some of the processors are recent (as of the writing of this paper), like the Intel KNL processor whereas others like AMD’s “Barcelona” and Intel’s “Ivybridge” processors are older. With increasing complexity of processors and memory systems, the challenge of good performance of solvers and algorithms has become an area of active research. A computational framework may solve a PDE efficiently on a laptop or small cluster, but that does not mean it will perform efficiently on a supercomputer. Understanding basic computer architectural concepts such as pipelining, instruction-level parallelism, and cache policies may offer excellent guidelines on how to speedup computations by several orders of magnitude. For example, Intel’s KNL processor has two 512-bit vector units per core and may need fine-grained parallelism to fully exploit the 68 cores per CPU. If a code is not properly vectorized to utilize the 136 vector units capable of 16 floating-point operations per cycle or the 16 GB of onboard MCDRAM, it is possible that the speedup on this system will not be fully realized, and worse yet get outperformed by processors that have faster cores. Also, languages such as Python, which are used in some sophisticated finite element simulation packages, depend on the file system I/O because the interpreter executes system calls to locate the module and may need to open hundreds of thousands of files before the actual computation can begin. Designing and utilizing algorithms/languages/compilers that are compatible with recent state-of-the-art HPC architectures is paramount [Miller et al., 2013], otherwise the computational performance may be exceedingly poor.

    (a) Default ordering
    (b) Optimized ordering
    Figure 2. Assembled sparse matrix where red represents positive numbers, blue represents negative numbers, and cyan represents allocated but unused nonzero entries.
  • Domain decomposition: The global ordering and partitioning of the computational elements in a parallel computing environment, particularly for problems with unstructured grids, affect both spatial and temporal cache locality. Consider the assembled sparse matrices shown in Figure 2. If the nonzero data entries are not properly grouped together, the code will invoke expensive cache misses and create little opportunity to use data in a cache line and reuse data in the cache. Consequently, this create serial bottlenecks at the cache/memory levels. Several mesh/graph partitioners such as Chaco [Hendrickson and Leland, 1995], METIS/ParMETIS [Karypis and Kumar, 1999], and PTSCOTCH [Chevalier and Pellegrini, 2008] are designed to optimize locality and balance the workload among MPI processes. Some graph partitioners use a simple model of communication in seeking to achieve load balance with minimum communication while others use a more detailed communication model to better capture the minimum communication needed by using hypergraphs instead of regular graphs as a basis for partitioning. Understanding which type of partitioning to use for the PDE problem at hand (e.g., spectral partitioning, geometric partitioning, multilevel graph partitioning, etc.) can significantly reduce the amount of communication and lead to higher efficiency and degree of concurrency.

  • Solver convergence: Arguably one of the most important performance factors to consider for solving PDEs is the convergence rate of the solver. Direct methods like Gaussian elimination [Grcar, 2011] as well as its sparse counterparts such as MUMPS [Amestoy et al., 2000] and SuperLU_DIST [Li and Demmel, 2003] can solve problems in parallel but may have huge memory requirements as the problem size is scaled up due to fill-in during factorization. Scalable and efficient solvers typically rely on the novel combination of iterative solvers and preconditioners. The Krylov Subspace (KSP) and Scalable Nonlinear Equations Solvers (SNES) features in the PETSc library coupled with robust preconditioners [Smith et al., 1996; Brandt and Livne, 2011] is a popular methodology for solving large and complex PDEs. Novel combinations and tuning of solver parameters provide powerful and robust frameworks that can accurately and quickly converge to a specified residual tolerance, even for complex coupled multi-physics problems [Castelletto et al., 2016; Brown et al., 2012; Brune et al., 2015]. Simple preconditioners such as Jacobi or Incomplete Lower Upper (ILU(0)) factorization may be fast for smaller problems, but the computational cost will soar because the number of solver iterations needed with Jacobi or ILU(0) will rapidly grow with problem size. Scaling up the problem under these choices of preconditioning will be extremely time consuming and may not even converge for larger or more complicated problems. Other more robust preconditioners, like the geometric and algebraic multigrid method, might have a more expensive setup time for smaller problems but have been demonstrated to maintain relatively uniform convergence for larger problems, even those that are nonsymmetric and indefinite [Bramble et al., 1994; Adams et al., 2004].

These important performance issues should not be overlooked when analyzing the performance of a parallel computational framework. There are also several quick strategies for understanding and identifying bottlenecks on a specific HPC system. For example, it is well-known that SpMV is an operation that is sensitive to the memory-bandwidth. These operations have very low AI’s which can present itself as a bottleneck at the memory level on a single node. A simple test one can perform is to run the SpMV operation in parallel, and if it does not scale well on a single server in the strong sense, the memory-bandwidth is clearly limiting the performance. One can confirm this by running some simple vector operations like the vector sum and scalar multiplication to see if they experience the same scaling issues. In addition, one can test the vector dot product operation in order to detect problems with the network interconnect or memory latency issues. The PETSc performance summary [Balay et al., 2016] provides comprehensive insight into the performance of many of these important operations including load balancing. The summary also provides information on the functions consuming most of the time. However, not all scientific software have readily available performance summaries, so a performance model amenable to any code implementation is needed to help answer common performance questions.

3. Proposed Performance Spectrum

The general concept of the performance spectrum model is illustrated by Figure 3. This model is designed to simultaneously capture both the hardware/architectural exploitation as well as the algorithmic scalability of a particular parallel computational framework. First and foremost, we need the time-to-solution since this is the metric of most importance to application scientists needing to execute large-scale simulations on state-of-the-art HPC systems. One may optionally document the total number of solver iterations needed for convergence. However, simply knowing the wall-clock time a computational framework needs to perform a task tells us little about the computational and algorithmic efficiency. In order to understand how fast (or slow) a simulation is, we need to introduce two more metrics.

(a)
Figure 3.

Proposed performance spectrum that documents time, intensity and rate. Intensity is defined as arithmetic intensity (FLOP to TBT ratio) based on cache misses, and rate is defined as degrees-of-freedom solved per second.

3.1. Intensity

The second metric of interest is the intensity. Specifically, we focus on AI. As described in [Williams et al., 2009], the AI of an algorithm or software is a measure that aids in estimating how efficiently the hardware resources and capabilities can be utilized. For the five machines listed in Table 1, it is well-known that the limiting factor of performance for many applications is the memory-bandwidth. Thus, codes that have a high AI have a possibility of reusing data in cache and have lower memory bandwidth demands. It should be noted, however, that performance depends on many factors such as network latency and file system bandwidth, and the arithmetic intensity alone cannot be used to predict performance.

The general formula for the AI is defined as

(3.1)

where [Work] is the total amount of computational effort, typically what one would refer to as FLOPs. The [TBT] metric is a measure of data movement between the core/CPU and memory. A cache model is needed in order to not only determine the TBT but also to understand what amount of useful bandwidth is sustained for a given cache line transfer. One can employ any cache model for this purpose, such as perfect cache, total number of load and store instructions at the core level, traffic at the memory level, or data cache misses. Different cache models are useful for interpreting different behavioral trends, and the choice of cache model depends on the application or research problem at hand. In this paper, we base [TBT] on the total number of cache misses and cache line size. The formula for obtaining the TBT for the L1, L2, and L3 cache levels is expressed as

(3.2)

The simplest way to define [Work] is as the total number of floating-point operations, denoted FLOPs. Thus the AI based on Lx cache misses is formally written as

(3.3)

If a solver or algorithm experiences a large number of cache misses at the last level, memory may impede performance.

Sometimes the exact TBT of a particular algorithm is not of interest. Instead, an application scientist may only care about the relative measure, i.e., whether the AI is higher or lower compared to either another algorithm, a different implementation of the same algorithm, or a different processor. Thus, one may simply look at the ratio of FLOPS and cache misses. Equation (3.3) may be simplified to

(3.4)

Every machine listed in Table 1 has a cache line size of 64 bytes for all levels of cache. Different CPUs may have different line sizes and hence a cache miss may imply different memory demands on different processor architectures. The remainder of the paper shall refer to the above formula for estimating the intensity metric.

Remark 1.

It should be noted that PAPI’s methodology for counting FLOPS may be highly inaccurate for the “Ivybridge” systems listed in Table 1. The hardware counters only count the instructions issued and not the ones executed or retired. This is paramount for iterative solvers that rely on SpMV operations because as the codes spend time waiting for data to be available from memory, they will reissue the floating-point instructions multiple times. These reissues, coupled with incomplete filling of a vector unit instruction, can lead to overcount factors of up to 10 times. For a more thorough discussion on the issue of overcounts, see [Weaver et al., 2013] and the references within. PAPI’s FLOP counters are disabled on the “Haswell” and “KNL” processors due to the aforementioned issues

Remark 2.

The AI can also be counted using other capable software tools and methodologies. For example, Intel’s Software Development Emulator (SDE) can be used to obtain FLOP counts [Tal, ; Raman, 2015] and Intel®VTune™can be used to obtain the TBT. This methodology is used in the commonly used Roofline model. Alternatively, one can approximate the FLOP count of a particular code by inserting counting mechanisms into the code. PETSc provides an interface and guidelines for manual FLOP counting, and thus FLOP counts for computational frameworks using it can be obtained through the performance summary output.

Remark 3.

The correlation between AI and speedup on a single node may not always hold true in a cluster sense (i.e., scaling when communication networks are involved). The mechanisms used for MPI process info exchanged is very different when the processes are on the same node as opposed to on different nodes. An application scientist must be fully cognizant of not only the HPC processor specification but also the network topology as well as the interconnect bandwidth and latency.

3.2. Rate

Although is useful for comparatively estimating the performance a particular parallel framework may attain, it does not necessarily aid in predictions of time-to-solution. Consequently, this means that AI can easily be “gamed” to appear high but the code consumes large amounts of wall-clock time. For example, small computationally intensive routines such as DGEMM [Dongarra et al., 1990] can be inserted to artificially inflate the computational demands and increase the AI. Other performance models, such as the Roofline model, would indicate that this is a favorable trend while ignoring the fact that more time is spent than necessary. This is also why the traditional FLOP rate metric, which can also easily be gamed, is not helpful either. Instead of measuring the total FLOPS executed per second, we measure the total degrees-of-freedom solved per second, hence the Rate metric needed to complete the performance spectrum is defined as

(3.5)

where simply refers to the total number of degrees-of-freedom or discrete component-wise equations that need to be solved.

Definition 1 (Static-scaling).

Equation (3.5) is an integral component of what we refer to as static-scaling, where we increase the problem size but fix the concurrency. This is a complete reversal to the classical definition of strong-scaling where we fix the problem size but increase the concurrency. Static-scaling plots time-to-solution versus the total degrees-of-freedom solved per second for a variety of problem sizes, so it also has characteristics similar to the classical definition of weak-scaling where both problem size and concurrency is increased.

(a)
Figure 4. Static-scaling plot. By fixing the MPI concurrency and increasing the problem size, the rate axis is the degrees-of-freedom solved per second. Algorithmic efficiency is achieved when a flat line is observed as the problem is scaled up.

Figure 4 contains a pictorial description of a static-scaling plot and illustrates how to visually interpret the data points. A scalable algorithm is where is linearly proportional to , so it is desirable to see a PDE solver maintain a constant rate metric for a wide range of problem sizes. The behavior of parallel computational frameworks for solving PDEs is not simple because 1) problems too small for a given MPI concurrency experience large communication to computation ratios (hence strong-scaling effects) and 2) large problems may have unfavorable memory accesses. The static-scaling plots are designed to capture both strong-scaling and weak-scaling characteristics and can give a good indicator of the ideal range of problem sizes for a given MPI concurrency.

The tailing off to the right of the static-scaling plot has two potential reasons. First, problem size affects how memory is allocated and accessed. Larger problem sizes may see an increase in memory contention as well as affect the access pattern to main memory. Thus more time is spent waiting on data as opposed to performing calculations. However, another reason the tailing off occurs is because solvers for complex PDEs or computational domains may not always be . Suboptimal algorithmic convergence may maintain a consistent level of hardware utilization but require more iterations and FLOPs. To determine whether suboptimal algorithmic convergence plays a role in the deterioration of the static-scaling plot, equation (3.5) can be modified as

(3.6)

This equation averages out increases in time due to an increase in iteration count. If a flat line is observed using this metric, then poor algorithmic scalability did have a negative impact on the static-scaling results.

Alternatively, if one is more interested in the performance gain for each MPI process, equation (3.5) can also be modified into

(3.7)

This metric presents the average degrees-of-freedom solver per second for each MPI process or core utilized.

3.3. Using the performance spectrum

The arithmetic intensity and static-scaling components of the spectrum offer a variety of strategies for interpreting the performance and scalability of any computational framework. Good performance is achieved when a computational framework achieves low time-to-solution, high arithmetic intensity, and flat static-scaling lines. The theoretical peak rate of degrees-of-freedom solved per second could be unknown for a particular algorithm, but the intensity metric can help us understand whether the static-scaling lines are good by estimating how well it is efficiently using the available hardware resources. We outline three possible ways one could use the performance spectrum model:

  1. Hardware limitations: As mentioned in Section 2, the hardware configuration of the compute nodes plays a vital role in the performance spectrum because different systems have different core counts, frequencies, and memory architectures. Understanding how PDE solvers behave on different systems is vital for disseminating software to the scientific and HPC communities. The different cache sizes listed in Table 1 will be reflected in equation (3.5). AI is likely to differ on different processors due to possible differences in cache sizes and cache policies. Furthermore, different processors have different clock frequencies, arithmetic capabilities, and memory bandwidth. Moreover, various GNU, Intel, and Cray compilers generate different executables that also depend on optimization flags used. Compiled code also depend on the data structures used as well as code constructs. A particular platform may be better suited for certain PDE applications. The performance spectrum model is useful for quickly visualizing and comparing the impact of platform characteristics, software, compiler options, and algorithms.

  2. Software/solver implementation: There are several software packages suited for sophisticated finite element simulations such as the C++ based DEAL.II package [Bangerth et al., 2007], the Python based Firedrake Project [Rathgeber et al., 2016], the Python/C++ based FEniCS Project [Alnæs et al., 2015], the C++ based LibMesh [Kirk et al., 2006], and MOOSE[Gaston et al., 2009] projects. These scientific libraries all use PETSc’s linear algebra backend, but they can also use other packages such as HYPRE [Falgout, 2006] and Trilinos/ML [Heroux et al., 2005]. How well can specific solvers or software packages solve the same boundary value problem? Algebraic multigrid solvers have various theoretical approaches and implementation strategies, so it is entirely possible that certain solver configurations are better suited for a particular hardware architecture or PDE. Multigrid solvers for optimization remain a difficult research problem, but will be imperative for sustaining a high level of computational performance. Quick visual representations of the AI and equations solved per second can certainly guide programmers and scientists in the right direction when designing or implementing different software and solvers.

  3. Numerical discretization: Finally, various flavors of numerical discretizations such as the finite difference, finite element, and finite volume methods not only have different orders of mathematical accuracy but different number of discrete equations to solve for a given mesh. Consider the Continuous Galerkin (CG) and Discontinuous Galerkin (DG) finite element methods – clearly the DG method has more degrees-of-freedom since each element has its own copy of a geometric node, but does that necessarily mean it is more time consuming? For example, if the CG and DG elements each take roughly seconds to attain a solution for the same computational domain, then the latter element clearly has a higher rate metric because it has more degrees-of-freedom for a given -size, hence a bigger numerator in equation 3.1. This is important for computational scientists and mathematicians that want to compare the convergence rate of various numerical methods particularly if -refinement studies are involved. A cost benefit analysis can be performed when comparing the numerical accuracy vs computational cost, often quantified using a work-precision diagram [Knepley and Bardhan, 2015]. One could also compare the impact finite element discretizations have on different geometric elements (e.g., tetrahedra, hexahedra, wedges, etc.). The performance of any numerical method depends on the hardware limitations and software implementations, but this spectrum can be useful for comparing different and available discretizations and polynomial orders in sophisticated finite element simulation packages.

4. Demonstration of the Performance Spectrum

As proof-of-concept, we apply the proposed performance spectrum to study the computational performance of a couple of popular finite element packages when used to solve the steady-state diffusion equation. A series of demonstrations shall enrich our current understanding of how hardware limitations, software implementation, numerical discretization, and material properties can impact the performance and scalability. We restrict our studies to the C++ implementation of the FEniCS Project and the Python implementation of the Firedrake Project, both of which leverage several scientific libraries and solvers such as PETSc, HYPRE, and Trilinos/ML solvers. The GMRES iterative solver is used with various algebraic multigrid solvers set to a relative convergence tolerance of . All numerical simulations are performed on a single AMD Opteron 2354 and Intel Xeon E5-2680v2 node as described in Table 1. In this section, the performance spectrum model is used only to assess the assembly and solve steps.

The steady-diffusion equation gives rise to a second-order elliptic partial differential equation. To this end, let

denote the computational domain, and let denote its boundary. A spatial point is denoted by . The unit outward normal to the boundary is denoted by . The boundary is divided into two parts: and . The part of the boundary on which Dirichlet boundary conditions are prescribed is denoted by , and the part of the boundary on which Neumann boundary conditions are prescribed is denoted by . For mathematical well-posedness we assume that

(4.1)

The corresponding boundary value problem takes the following form

(4.2a)
(4.2b)
(4.2c)
(a) Analytical solution
(b) Mesh
Figure 5. Analytical solution of the steady-state diffusion example and the corresponding mesh skeleton of the structure grid containing tetrahedra.
-size tetrahedra Vertices FEniCS error Firedrake error
1/20 48,000 9,261 1.48E-02 2.96E-02
1/40 384,000 68,921 3.90E-03 7.77E-03
1/60 1,296,000 226,981 1.75E-03 3.51E-03
1/80 3,072,000 531,441 9.89E-04 1.99E-03
1/100 6,000,000 1,010,301 6.34E-04 1.28E-03
1/120 10,368,000 1,771,561 4.41E-04 8.88E-04
1/140 16,464,000 2,803,221 3.24E-04 6.52E-04
slope: 1.97 slope: 1.96
Table 2. Mesh discretization and CG1 error norm with respect to -refinement.

where is the scalar concentration field, is the diffusivity coefficient, is the volumetric source, is the prescribed concentration on the boundary, and is the prescribed flux on the boundary. Assuming , we consider the following analytical solution and corresponding forcing function on a unit cube:

(4.3)
(4.4)

Homogeneous Dirichlet boundary conditions are applied on all faces, and the analytical solution for is presented in Figure 5. These next few studies shall consider the following -sizes on a structured tetrahedron mesh: 1/20, 1/40, 1/60, 1/80, 1/100, 1/120, and 1/140. All mesh information and error norms with respect to the FEniCS and Firedrake implementations of the continuous Galerkin (CG1) element is listed in Table 2.

4.1. Demo #1: AMD Opteron 2354 vs Intel Xeon E5-2680v2

(a)
Figure 6. Demo #1: L1 arithmetic intensity for the FEniCS finite element package with PETSc’s algebraic multigrid solver on a single Opteron 2354 and E5-2680v2 compute node.
(a)
Figure 7. Demo #1: Static-scaling for the FEniCS finite element package with PETSc’s algebraic multigrid solver on a single Opteron 2354 and E5-2680v2 compute node.
(a)
Figure 8. Demo #1: Static-scaling per MPI process for the FEniCS finite element package with PETSc’s algebraic multigrid solver on a single Opteron 2354 and E5-2680v2 compute node.
(a)
Figure 9. Demo #1: Strong-scaling efficiency for the FEniCS finite element package with PETSc’s algebraic multigrid solver on a single Opteron 2354 and E5-2680v2 compute node.

We first compare the AI between a single Intel Xeon E5-2680v2 and AMD Opteron 2354 compute node for FEniCS’s implementation of the CG1 element coupled with PETSc’s algebraic multigrid preconditioner. The AI, as seen from Figure 6, gradually decreases with mesh refinement. Moreover, increasing the number of MPI processes also reduces the AI. The Intel processor has smaller L1 and L2 caches compared to the AMD processor, which explains why the former processor has lower AIs. It can be concluded that a higher AI on a different machine does not necessarily translate to better performance because clock rates and memory bandwidths differ. The fact that differences in AI does not directly relate to time-to-solution can be seen in Figure 6.

The static-scaling plot is shown in Figure 7. It is clear that the Intel processor is capable of solving more degrees of freedom per second than the AMD processor. Increasing the number of MPI processes improves the Rate metric, which is expected since time to solution is amortized. Employing Rate from equation (3.7), as seen in Figure 8 gives us a better insight into the effect adding more MPI processes onto a single node has on the static-scaling performance. We also note that when only one or two MPI processes are used, the degrees-of-freedom solved per second metric degrades as the problem size increases. We also observe that the line plots for Intel reach higher apexes as more MPI processes and larger problems are solved. The lines curves “dipping” to the left indicate a degradation in parallel performance – the problems are very small (e.g., -size of 1/20 resulting in 9,261 degrees-of-freedom distributed among 16 MPI processes means each process solves roughly only 580 equations) thus more of the execution time is spent on interprocess communication and latencies than actual computation. Both the Rate and Rate lines decrease with problem size on the AMD node, whereas the line plots for the Intel node are relatively flat, suggesting that the FEniCS and PETSc combination is in fact an algorithmically scalable combination for the problem at hand.

Figure 9 depicts the parallel efficiency on the two different nodes. The parallel performance for the smaller -sizes is significantly worse due to the lack of computation needed for a given MPI concurrency. It is interesting to note that the speedup on the AMD node is slightly greater than on the Intel node. Recalling from Figure 6 that the AI on the AMD node is larger, we can infer that higher AIs indicate a stronger likelihood to experience greater parallel speedup. This behavior is consistent with the strong-scaling results of the optimization-based solvers for the Chromium remediation problem in [Chang et al., 2017] where similar classes of AMD and Intel processors were experimented with.

Remark 4.

We note here that we expect speedup on a single node using MPI processes to be comparable to that achievable by threaded implementations such as OpenMP or Intel TBB, since there are negligible system differences between threads and processes [Knepley et al., 2015]. This kind of behavior has been observed in careful studies of combined MPI+OpenMP finite element simulations [Turcksin et al., 2016].

4.2. Demo #2: FEniCS vs Firedrake

(a)
Figure 10. Demo #2: L1 arithmetic intensities for the FEniCS and Firedrake finite element packages with various solver packages on a single E5-2680v2 node with 16 MPI processes.
(a)
Figure 11. Demo #2: Static-scaling for the FEniCS and Firedrake finite element packages with various solver packages on a single E5-2680v2 node with 16 MPI processes.
(a)
Figure 12. Demo #2: Number of GMRES iterations required for the FEnics and Firedrake finite element packages with various solver packages on a single E5-2680v2 node with 16 MPI processes.
(a)
Figure 13. Demo #2: Correlation between the L1/L2/L3 arithmetic intensities and strong-scaling efficiency on a single E5-2680v2 node for up to 16 MPI processes when -size = 1/140.
(a)
Figure 14. Demo #2: Comparison between HYPRE’s default solver parameters and HYPRE’s optimized solver parameters through the Firedrake package on a single E5-2680v2 node with 16 MPI processes.

Next, we compare the FEniCS and Firedrake implementations of the CG1 element with 16 MPI processes on a single Intel Xeon E5-2680v2 node. The same steady-state diffusion equation is considered, but we now investigate how other multigrid solver packages like HYPRE and ML affect the performance.

The AI’s in Figure 10 clearly depend on the software implementation, the solver used, and the problem size. The results in this figure suggest that the FEniCS and Firedrake packages have very similar implementations of the PETSc and HYPRE multigrid solvers. However, the AI for FEniCS’s implementation of the ML solver deteriorates rapidly with problem size. Similar behavior is observed in the static-scaling plot of Figure 11 where the data points with the highest AI also have the highest rate at which equations are solved. Unlike the previous demonstration where different hardware implementations were compared, the AI and rate metrics are strongly correlated to each other, and it is clear that FEniCS’s current implementation of ML has some issues since the tailing off towards the right occurs before either the PETSc or HYPRE lines do.

With these two graphs in mind, one may wonder why the tailing off occurs. Does it occur due to suboptimal algorithmic convergence (i.e., iteration count increases with problem size), or do the data structures needed for important solver steps begin to drop out of cache? A scalable algorithm suggests that the number of solver iterations should not increase by much when the problem size increases, so if the GMRES iteration count increases significant, it is possible that the rate metric will decrease. Figure 12 denotes the number of GMRES iterations needed for every finite element package and solver, and it can be seen that the iteration counts do not increase by much. These plots must be interpreted carefully because although the iteration plots may suggest algorithmic scalability, the degradation in AI with respect to problem size suggests that the current software and solver parameters are not efficiently configured to utilize the hardware. As shown in the previous demonstration, the AI is useful for predicting which algorithms will see greater speedups as the number of MPI processes is increased. Figure 13 compares the AI and parallel performance of Firedrake’s three solver implementations. Regardless of which level of cache is used to determine the AI, HYPRE and ML have the lowest and highest AI’s, respectively. Moreover, HYPRE and ML have the worst and best parallel speedups, respectively, which again supports the fact that the AI metric is useful for predicting which algorithms may achieve the greatest parallel speedup.

We note that the HYPRE solver has relatively bad performance, suggesting that the out-of-box parameters are unfit for the problem at hand. One of the best ways to improve the AI and Rate metrics is to simply adjust some of the solver parameters. If, for example, we optimize the parameters by increasing the strong threshold coarsening rate, the performance improves dramatically as we can tell from Figure 14. The AI and Rate metrics are now competitive with Firedrake’s implementation of the PETSc and ML solvers, but it is important to realize that the GMRES iteration counts increased with size. An algorithm that requires fewer iterations yet remains constant when the problem size increase does not necessarily mean it has good performance and scalability. Neither the AI nor rate metrics tail off towards the right, suggesting that the optimized HYPRE solver is scalable despite some minor growth in the GMRES iteration count. As we have discussed in the previous demonstration, answers regarding performance and scalability of various solvers and software will also depend on the hardware.

4.3. Demo #3: Continuous Galerkin vs Discontinuous Galerkin

-size CG1 CG2 DG1 DG2
1/20 9,261 68,921 192,000 480,000
1/40 68,921 531,441 1,536,000 3,840,000
1/60 226,981 1,771,561 5,184,000 12,960,000
1/80 531,441 4,173,281 12,288,000 30,720,000
1/100 1,030,301 8,120,601 - -
Table 3. Demo #3: Degrees-of-freedom with respect to -refinement. In this study we do not consider -size = 1/100 for the DG1 or DG2 elements.
-size CG1 CG2 DG1 DG2
1/20 2.96E-02 3.81E-04 1.65E-02 2.16E-04
1/40 7.77E-03 3.79E-05 4.35E-03 2.26E-05
1/60 3.51E-03 1.06E-05 1.97E-03 6.47E-06
1/80 1.99E-03 4.44E-06 1.12E-03 2.72E-06
1/100 1.28E-03 2.25E-06 - -
slope: 1.95 3.19 1.94 3.16
Table 4. Demo #3: error norm with respect to -refinement for various finite elements provided through the Firedrake package. In this study we do not consider -size = 1/100 for the DG1 or DG2 elements.
(a)
Figure 15. Demo #3: L1/L2/L3 arithmetic intensities of Firedrake’s various finite element formulations on a single E5-2680v2 node with 16 MPI processes.
(a)
Figure 16. Demo #3: Static-scaling for Firedrake’s various finite element formulations on a single E5-2680v2 node with 16 MPI processes.
(a)
Figure 17. Demo #3: Degrees-of-freedom vs degrees-of-freedom solved per second for Firedrake’s various finite element formulations on a single E5-2680v2 node with 16 MPI processes.
(a)
Figure 18. Demo #3: Solver iterations needed for Firedrake’s various finite element formulations on a single E5-2680v2 node with 16 MPI processes.
(a)
Figure 19. Demo #3: Static-scaling per solver iteration for Firedrake’s various finite element formulations on a single E5-2680v2 node with 16 MPI processes.

So far we have only considered the CG1 finite element. What happens if we employ another discretization such as the Discontinuous Galerkin (DG) method? Moreover, what happens if we increase the polynomial order and employ second order CG (CG2) and second order DG (DG2) elements? Various families of elements and their respective levels of -refinement will change both the size and numerical accuracy of the numerical solution, so it is desirable to understand both the costs and benefits of these approaches on a particular mesh. Tables 3 and 4 contain the total degrees-of-freedom and error norms, respectively, of Firedrake’s various finite element discretizations. The CG elements are studied up to -size = 1/100 whereas the DG elements are studied up to -size = 1/80. We again employ 16 MPI processes across a single Intel Xeon E5-2680v2 node, and all finite element discretizations in this demonstration are solved with optimized (i.e., increased strong threshold coarsening) HYPRE parameters.

Figure 15 contains the AI for the CG1, CG2, DG1, and DG2 elements. What we learn from these results is that increasing the polynomial order for the CG elements lowers the AI whereas the AI increases for DG elements. This may not always be the case because different solvers and different hardware architectures may be better tailored to different discretization. Other finite element packages like the FEniCS or DEAL.II projects may have very different results. The Rate metric as seen from Figure 16 depicts the rate at which each discretization solves its equations. Alternatively, one could also compare the Rate metric with respect to the degrees-of-freedom as seen in Figure 17. Although DG elements have more degrees-of-freedom for a given mesh discretization, it is seen that the DG1 element has the highest Rate metric, suggesting that the optimized HYPRE solver parameters are especially suitable for DG1 elements. Unlike the FEniCS and ML combination example from the previous demonstration, the DG2 discretization experiences significant degradation in the static-scaling plot yet maintains relatively consistent AI’s. This begs the question of whether the tailing off towards the right is due to memory effects or suboptimal algorithmic convergence.

As previously observed from Figure 14, the optimized HYPRE parameters resulted in a slight increase in GMRES iteration count for CG1 elements, and we notice similar trends for the other finite elements in Figure 18. If the iteration count increase is significant enough, it could negatively affect static-scaling. To determine whether this solver iteration growth stymied the rate by which equations are solved, we can employ Rate (i.e., degrees-of-freedom solved per second per solver iterate) from equation (3.6) as shown in Figure 19. In this particular demonstration, it makes no difference as we still observe degradation with respect to problem size, hence suggesting that memory bandwidth and cache behavior have an adverse effect on the simulation. Using more compute nodes may certainly ameliorate both the AI and rate metrics for the DG2 element, but it should again be cautioned that comparative studies on the performance of numerical methods and solvers strongly depend on both the code implementation as well as the nature of the computing platform.

4.4. Demo #4: Material properties

-size
1/20 1.48E-02 3.45E-02 4.83E-02 5.71E-02 5.86E-02
1/40 3.90E-03 9.31E-03 1.46E-02 1.90E-02 1.99E-02
1/60 1.75E-03 4.23E-03 6.84E-03 9.26E-03 9.76E-03
1/80 9.89E-04 2.40E-03 3.94E-03 5.43E-03 5.75E-03
1/100 6.34E-04 1.55E-03 2.55E-03 3.55E-03 3.77E-03
1/120 4.41E-04 1.07E-03 1.78E-03 2.49E-03 2.64E-03
1/140 3.24E-04 7.88E-04 1.31E-03 1.84E-03 1.96E-03
slope: 1.97 1.94 1.86 1.77 1.75
Table 5. Demo #4: error norm with respect to -refinement for various values of in equation (4.5) when using the Firedrake implementation of the CG1 element.
(a)
Figure 20. Demo #4: Performance spectrum for various values of in equation (4.5) on a single E5-2680v2 node with 16 MPI processes.

So far all three of our demonstrations have been conducted in a homogeneous domain. However, many scientific problems are often heterogeneous in nature, which may complicate the physics of the governing equations and may become more expensive to solve numerically. In [Chang and Nakshatrala, 2017], it was shown that solving heterogeneous problems like chaotic flow resulted in suboptimal algorithmic convergence (i.e., the iteration counts grew with

-refinement), so our goal is to demonstrate how physical properties such a heterogeneity and anisotropy may skew how we interpret the performance. Let us now assume that we have a heterogeneous and anisotropic diffusivity tensor that can be expressed as follows

(4.5)

where is a user defined constant that controls the level of heterogeneity and anisotropy present in the computational domain. By employing the same analytical solution as equation (4.3), the various values of give rise to new forcing functions. The error norms with respect to using Firedrake’s CG1 elements are shown in Table 5. Again a single Intel Xeon E5-2680v2 compute node with 16 MPI processes is used for this study, and PETSc’s multigrid solver is used to solve these problems.

Figure 20 depicts the AI, Rate, solver iterations, and Rate metrics. The AI is not affected by which suggests that there are no hardware or software implementation issues, only that the Rate metric tails off as is increased. We see that while the iteration growth is significant, the Rate metric is still flat for this heterogeneous and anisotropic steady-state diffusion problem. Thus, the primary reason that the data points in the static-scaling plots decrease with problem size has little to do with memory contention.

5. Case Study Part 1: Single Node

The previous section, which focused entirely on the steady-state diffusion equation, covered the basic ways one can utilize the proposed performance spectrum model to help justify, interpret, or diagnose the computational performance of any algorithm, numerical method, or solver for a particular compute node. In these next two sections, we demonstrate that this performance spectrum model is also useful for more complicated and nonlinear PDEs. We consider PETSc’s toy hydrostatic ice sheet flow example, based on the work of [Brown et al., 2013], with geometric multigrid and apply the performance spectrum to give us a better understanding of how certain HPC platforms scale.

5.1. Hydrostatic ice sheet flow equations

Consider a computational ice domain lying between a Lipschitz continuous bed and surface . The hydrostatic equations are obtained from the non-Newtonian Stokes equations where the horizontal and derivatives of velocity in the vertical -direction are small and negligible. Denoting the horizontal component of the velocity field by where and are parallel to the and axes respectively, the governing equations for hydrostatic ice sheet flow is given by

(5.1a)
(5.1b)

where is the nonlinear effective viscosity expressed by

(5.2)

where ice sheet models typically take . The hardness parameter is denoted by , the regularizing strain rate is defined by , and the second invariant is expressed by

(5.3)

More information on the theoretical derivation of the above equations can be found in [Schoof, 2006; Schoof and Hindmarsh, 2010]. Equation (5.1) is subject to natural boundary conditions at the free surface and either no-slip or power-law slip conditions with friction parameter

(5.4)

where , is regularizing velocity, is a “low-speed” reference friction, and is the slip exponent.

5.2. Problem setup

The hydrostatic ice sheet flow equation is discretized using hexahedron Q1 finite elements on a structured grid and Figure 21 contains the corresponding solution. Details concerning the theoretical derivation of the variational formulation as well as the parameters used for the boundary value problem can be found in [Brown et al., 2013]. Since this example problem is written entirely with PETSc routines and function calls, the [FLOPs] metric in equation (3.4) is determined using PETSc’s manual FLOP counts instead of hardware FLOP counters. This is particularly useful if a thorough comparative study on PETSc’s eclectic suite of linear algebra solvers for a particular PDE were to be conducted.

For this problem, we begin with an initial coarse grid size and successively refine the grid times until we get the desired problem size and numerical accuracy. The “fine grids” produced from this element-wise refinement are solved using the geometric multigrid technique, whereas the initial coarse grid is solved using algebraic multigrid. The assembled Jacobian employs block AIJ format (better known as the compressed sparse row format), where the horizontal velocity components are grouped per grid node. Since ice-flow is tightly coupled in the vertical direction, parallel domain decomposition is specially set up so that grid points in the vertical direction are never distributed and are always contiguous in memory. The initial grid size must be chosen carefully because the mesh partition happens at the coarsest level and may cause load balancing issues if the initial grid is not large enough.

5.3. Results

(a)
Figure 21. Numerical solution of the velocity vector field for the hydrostatic ice sheet flow example.
Levels of refinement Degrees-of-freedom SNES iterations KSP iterations
0 16,000 7 39
1 115,200 8 45
2 870,400 8 44
3 6,758,400 8 44
Table 6. Hydrostatic ice sheet flow for a single node: Mesh information and number of solver iterations needed for an initial 40405 coarse grid. KSP and SNES iteration counts may vary depending on the number of MPI processes used.
(a)
Figure 22. Hydrostatic ice sheet flow single node: L1 arithmetic intensity, based on PETSc’s manual FLOP counts and L1 cache misses. Note that the two Ivybridge and KNL nodes are only partially saturated.
(a)
Figure 23. Hydrostatic ice sheet flow single node: Static-scaling. Note that the two Ivybridge and KNL nodes are only partially saturated.
(a)
Figure 24. Hydrostatic ice sheet flow single node: Static-scaling per MPI processes. Note that the two Ivybridge and KNL nodes are only partially saturated.

First, we provide an initial 40405 coarse grid and successively refine this grid up to 3 times. All five processors from Table 1 are studied, and the KNL processor is configured to use MCDRAM in flat mode. Each node has a different number of available cores so in order to maintain relatively consistent mesh partitioning, the Ivybridge and KNL processors will only utilize 16 and 64 cores, respectively. Table 6 presents the problem size as well as the number of total SNES and KSP iterations needed for each level of refinement. Figure 22 depicts the AI metrics with respect to the overall time-to-solution. Each data point has the same coarse grid size but has different levels of grid refinement ranging from 0 to 3. The Ivybridge” and Haswell processors have similar AIs and are significantly smaller than their KNL and AMD counterparts. It should be noted that GNU compilers were used to compile the problem on the AMD Opteron 2354 and Intel Xeon E5-2680v2 processors whereas the other three processors used Cray compilers, which could explain why the AIs between the two Ivybridge processors are slightly different. As with the hardware counter examples in the last section, the AIs are initially small for the coarser problems but eventually stabilize if a problem is sufficiently large. It is interesting to note that the AMD processor is consistently flat for all data points, suggesting that the smaller problem sizes selected for this example have already approached the AMD processor’s achievable peak performance of the computational/memory resource. On the other hand, the KNL’s wider vector instruction sets and caches for smaller problems are not fully utilized, resulting in low AI’s.

The static-scaling plot on each of these compute nodes is shown in Figure 23, and Figure 24 depicts static-scaling based on Rate from (3.7). Unsurprisingly, the AMD processor is outperformed by all of the Intel processors. Both Haswell and KNL have the best performance out of all the systems studied, but we again notice that the KNL processor has poor metrics when the grid is small. Furthermore, KNL’s performance per core is considerably lower as seen from Figure 24. There are many reasons why we noticed such dramatic behavior. First, as we already noted from the AI results, the problem has to be sufficiently large in order to fully utilize the KNL vector instructions. Second, we used 64 of the 68 available cores on the KNL node, which is at least double the amount of cores that the other systems have. The degrees-of-freedom per MPI is significantly smaller so it is possible interprocess communication time affects the scaling results.

6. Case Study Part 2: Multiple Nodes

The results from every example in this paper thus far behoove us to now investigate what happens when more than one compute node is needed to solve a PDE. Figures 9 and 11 from the previous section indicate that the AI metrics can be used to predict the strong-scaling potential on a single node. Our goal is now to investigate if the correlation holds true even across multiple nodes. To ensure that the problem is sufficiently large to distribute to several nodes, we consider an initial 64646 coarse grid with three levels of refinement (21,495,808 degrees-of-freedom). The number of KSP and SNES iterations needed to solve the problem are 62 and 8, respectively.

Nodes E5-2695v2 (Ivybridge) E5-2698v3 (Haswell) Xeon Phi 7250 (KNL)
Cores Time (s) Eff. (%) Cores Time (s) Eff. (%) Cores Time (s) Eff. (%)
1 16 300 - 32 227 - 64 193 -
2 32 150 100 64 108 105 128 92.4 104
4 64 72.0 104 128 57.3 99.0 256 47.3 102
8 128 37.9 98.9 256 28.7 98.9 512 25.2 95.7
16 256 18.9 99.2 512 13.9 100 1024 15.1 79.9
32 512 9.65 97.2 1024 8.11 87.5 2048 10.6 56.9
64 1024 6.75 69.4 2048 4.62 76.8 4096 9.27 32.5
Table 7. Hydrostatic ice sheet flow strong-scaling for an initial 64646 coarse grid.
Levels of refinement Degrees-of-freedom SNES iterations KSP iterations
1 3,014,656 8 85
2 23,592,960 8 85
3 186,646,528 8 85
4 1,484,783,616 8 85
5 11,844,714,496 8 85
Table 8. Hydrostatic ice sheet flow for multiple nodes: Mesh information and number of solver iterations needed for an initial 12812812 coarse grid. KSP and SNES iteration counts may vary depending on the number of MPI processes used.

Table 7 contains the strong-scaling results for the HPC systems containing the Intel Xeon E5-2695v2 (Ivybridge), Intel Xeon E5-2698v3 (Haswell), and Intel Xeon Phi 7250 (KNL) nodes. All three systems demonstrate near perfect strong-scaling performance until 1024 cores are used (roughly 20k degrees-of-freedom per core). However, it is difficult to make performance comparisons because different systems employ different numbers of MPI processes per node which affect communication to computation ratios as well as required data bandwidth between nodes. The only concrete conclusion that can be made is that the KNL system takes the least amount of wall-clock time on a single compute node but gets outperformed when the problem size per node reduces. Figures 22 and 23 suggest that when the problem size on a KNL node is sufficiently small, parallel performance would degrade drastically, which is exactly what the results of Table 7 portray.

6.1. Example #1: 1024 MPI processes

(a)
Figure 25. Hydrostatic ice sheet flow multiple nodes: AI when the systems all employ 1024 cores (64 Ivybridge nodes, 32 Haswell nodes, and 16 KNL nodes). Grid sizes ranging from 3 million to 186 million degrees-of-freedom are considered.
(a)
Figure 26. Hydrostatic ice sheet flow multiple nodes: Static-scaling when the systems all employ 1024 MPI processes (64 Ivybridge nodes, 32 Haswell nodes, and 16 KNL nodes). Three levels of refinement are considered.
(a)
Figure 27. Hydrostatic ice sheet flow multiple nodes: AI when the systems all employ 256 nodes (4096, 8192, and 16384 MPI processes for Ivybridge, Haswell, and KNL respectively). Five levels of refinement are considered.

In this section, we consider what happens when we employ the same MPI concurrency. This second example aims to model the performance when the same hydrostatic ice sheet flow problem is solved utilizing 1024 MPI processes on different systems. We set this problem up by allocating 64 Ivybridge nodes, 32 Haswell nodes, and 16 KNL nodes. An even larger initial 12812812 coarse grid is selected, and we refine the problem 1–3 times. Table 8 presents the problem size as well as the number of nonlinear and linear solver iterations needed for every level of refinement. Figures 25 and 26 contain the intensity and rate metrics, respectively. The AI data points are either relatively flat or do not experience drastic changes upon mesh refinement. The static-scaling plot tells us that the Ivybridge system has the best performance as the problem gets larger. This behavior may seem to contradict the findings of the static-scaling plot in Figure 23, but it is important to realize that this PETSc application is limited by the memory-bandwidth and not the TPP for the FLOP rate. The HPC system with Ivybridge processors has the best performance simply because it employs more compute nodes thus more available memory.

6.2. Example #2: 256 compute nodes

(a)
Figure 28. Hydrostatic ice sheet flow multiple nodes: Static-scaling when the systems all employ 256 nodes (4096, 8192, and 16384 MPI processes for Ivybridge, Haswell, and KNL respectively). Five levels of refinement are considered.

The previous example is an elegant demonstration of why comparing HPC machines based on equal MPI concurrency can produce misleading performance metrics, especially for computational frameworks that are limited by the memory-bandwidth. What happens if every system employs the same number of compute nodes? In this third example, the HPC systems shall now allocate 256 compute nodes each. Thus, Ivybridge will use 4096 MPI processes (16 out of 24 cores per node), Haswell will use 8192 MPI processes (32 out of 32 cores per node), and KNL will use 16384 processes (64 out of 68 cores per node). We use the same initial coarse grid as in the previous example but now refine the problem 1–5 times. The AI in Figure 27 again indicate relative consistency for finer problems, and we again observe that the AI metric will drop significantly if a problem is not large enough. This trend corroborates the notion that the AI dropping for small problems happens regardless of whether a single node or multiple nodes are used. The static-scaling plot shown in Figure 28 demonstrates that the Ivybridge processor does not beat out the Haswell processor. What’s particularly interesting is that the performance for the KNL processor drastically varies with problem size. KNL cannot beat out Ivybridge for small problems, but KNL will beat both Ivybridge and Haswell when a problem is neither too small nor too large.

The performance spectrum model is useful for understanding performance characteristics across a wide variety of hardware architectures. Although the STREAM Triad measurements from Table 1 suggest that KNL should greatly outperform Ivybridge and Haswell for memory-bandwidth dominated applications, the performance spectrum indicates that current and practical implementations of scientific software like PETSc v3.7.4 on KNL may be slow if the problem is dominated by main memory bandwidth. Different platforms require different implementation methodologies in order to maximize performance, so optimizing computational frameworks to fully leverage the power of the KNL processor is still an open research problem. Nonetheless, the performance spectrum model is useful for testing various implementations of PDE solvers and can be utilized to understand hardware architectures trends and algorithms of the future.

7. Concluding Remarks

In this paper, we have proposed a performance model, referred to as the performance spectrum, designed to simultaneously model both the hardware/architectural and algorithmic efficiencies of a variety of parallel PDE solvers. The techniques needed to approximate such efficiency metrics are 1) the arithmetic intensity documenting the ratio of flops over data cache misses, and 2) static-scaling, which scales up the problem while fixing the concurrency. This spectrum enabled us to visualize and enrich our current understanding of performance issues related to hardware limitations, software/solver implementation, and numerical discretization of some popular and state-of-the-art finite element simulation packages and solvers. Moreover, it has been shown that this spectrum is also useful for understanding performance and scalability of complex solvers and PDEs for nonlinear problems like hydrostatic ice sheet flow in a large-scale environment. Computational scientists have designed and are still designing software and algorithms needed to answer many of today’s pressing scientific problems, so not only do we need to solve these problems accurately but also to solve them fast. In order to understand how fast these solvers and software are, particularly ones that are either black-box or designed by others, we need a performance model, such as the proposed performance spectrum, to help answer any questions regarding computational performance.

7.1. Potential extension of this work

The performance spectrum model has been proven to be a robust and versatile tool for modeling the performance and scalability of several parallel finite element solvers and packages, but there are still plenty of research endeavors relating to this work that need to be addressed. We briefly provide a list of some potential future tasks and avenues of extension:

  1. Most of the systems chosen for this study are from Intel and have similar performance characteristics, but processors from other vendors such as IBM’s POWER8 [Sinharoy et al., 2015], ARM-based systems [Rajovic et al., 2014], may tell a different story. A logical extension to this work could be to extend this performance spectrum analysis onto GPUs. HPC architecture is constantly evolving, and simple benchmarks and measurements like STREAM Triad may not be sufficient for understanding the performance of complex solvers or algorithms on these state-of-the-art machines.

  2. The intensity equations presented in Section 3 are relatively easy to incorporate into any code, but they only provide relative comparisons between various flavors of hardware and software. An improved methodology for documenting the [Work] and [TBT] metrics would certain improve the predictive capabilities of the performance spectrum model. Other popular techniques such as using Intel’s SDE and VTune libraries can be used to measure AI on CPUs.

  3. Certain PDEs, like the advection-diffusion equation, are notoriously difficult to solve especially for high Péclet numbers. The performance spectrum can be useful to compare not only various numerical discretization (e.g., finite element vs finite volume methods) but also special solvers when physical properties such as velocity and diffusivity vary with length scale. Moreover, complex PDEs may also involve multiple physical components, thus requiring mixed formulations or hybridization techniques that may not be so trivial to solve or implement.

  4. Finally, the performance spectrum is not restricted to solving PDEs. One can examine any computational scientific problem where performance and scalability is limited by problem size. Examples include but are not limited to: boundary integral equations, uncertainty quantification, parameter estimation, adaptive mesh refinement, and parallel domain decomposition.

Acknowledgments

The authors acknowledge Jed Brown (University of Colorado at Boulder) for his invaluable thoughts and ideas which inspired the creation of this paper. Compute time on the Maxwell and Opuntia systems is provided by the Center for Advanced Computing & Data Systems (CACDS) at the University of Houston, and compute time on the Edison and Cori systems is provided by the National Energy Research Scientific Computing Center (NERSC). MGK acknowledges partial support from the Rice Intel Parallel Computing Center and US DOE Contract DE-AC02-06CH11357. The opinions expressed in this paper are those of the authors and do not necessarily reflect that of the sponsors.

References

  • Adams et al. [2004] M. F. Adams, H. H. Bayraktar, T. M. Keaveny, and P. Papadopoulos. Ultrascalable Implicit Finite Element Analyses in Solid Mechanics with Over a Half a Billion Degrees of Freedom. In ACM/IEEE Proceedings of SC2004: High Performance Networking and Computing, 2004. Gordon Bell Award.
  • Adhianto et al. [2010] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for Performance Analysis of Optimized Parallel Programs. Concurrency and Computation: Practice and Experience, 22(6):685–701, 2010.
  • Alnæs et al. [2015] M. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson, J. Ring, M. E. Rognes, and G. N. Wells. The FEniCS Project Version 1.5. Archive of Numerical Software, 3(100):9–23, 2015.
  • Amestoy et al. [2000] P. R. Amestoy, I. S. Duff, J. L’Excellent, and J. Koster. MUMPS: A General Purpose Distributed Memory Sparse Solver. In International Workshop on Applied Parallel Computing, pages 121–130. Springer, 2000.
  • Balay et al. [2014] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang. PETSc Web Page, 2014. http://www.mcs.anl.gov/petsc.
  • Balay et al. [2016] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang. PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.7, Argonne National Laboratory, 2016.
  • Bangerth et al. [2007] W. Bangerth, R. Hartmann, and G. Kanschat. DEAL.II – A General Purpose Object Oriented Finite Element Library. ACM Trans. Math. Softw., 33(4):24/1–24/27, 2007.
  • Bramble et al. [1994] J. H. Bramble, D. Y. Kwak, and J. E. Pasciak. Uniform Convergence of Multigrid V-Cycle Iterations for Indefinite and Nonsymmetric Problems. SIAM Journal on Numerical Analysis, 31(6):1746–1763, 1994.
  • Brandt and Livne [2011] A. Brandt and O. Livne. Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, Revised Edition. SIAM, 2011.
  • Brown et al. [2012] J. Brown, M. G. Knepley, D. A. May, L. C. McInnes, and B. F. Smith. Composable Linear Solvers for Multiphysics. In Proceeedings of the 11th International Symposium on Parallel and Distributed Computing (ISPDC 2012), pages 55–62. IEEE Computer Society, 2012. URL http://doi.ieeecomputersociety.org/10.1109/ISPDC.2012.16.
  • Brown et al. [2013] J. Brown, B. Smith, and A. Ahmadia. Achieving Textbook Multigrid Efficiency for Hydrostatic Ice Sheet Flow. SIAM Journal on Scientific Computing, 35:B359–B375, 2013.
  • Brune et al. [2015] P. R. Brune, M. G. Knepley, B. F. Smith, and X. Tu. Composing Scalable Nonlinear Algebraic Solvers. SIAM Review, 57(4):535–565, 2015. doi: http://dx.doi.org/10.1137/130936725. URL http://www.mcs.anl.gov/papers/P2010-0112.pdf. http://www.mcs.anl.gov/papers/P2010-0112.pdf.
  • Castelletto et al. [2016] N. Castelletto, J. A. White, and M. Ferronato. Scalable Algorithms for Three-Field Mixed Finite Element Coupled Poromechanics. Journal of Computational Physics, 327:894 – 918, 2016.
  • Chang and Nakshatrala [2017] J. Chang and K. B. Nakshatrala. Variational Inequality Approach to Enforce the Non-Negative Constraint for Advection-Diffusion Equations. Computer Methods in Applied Mechanics and Engineering, 320:287–334, 2017.
  • Chang et al. [2017] J. Chang, S. Karra, and K. B. Nakshatrala. Large-scale Optimization-Based Non-Negative Computational Framework for Diffusion Equations: Parallel Implementation and Performance Studies. Journal of Scientific Computing, 70:243–271, 2017.
  • Chevalier and Pellegrini [2008] C. Chevalier and F. Pellegrini. PT-Scotch: A Tool for Efficient Parallel Graph Ordering. Parallel computing, 34(6):318–331, 2008.
  • Dagum and Menon [1998] L. Dagum and R. Menon. OpenMP: An Industry Standard API for Shared-Memory Programming. IEEE computational science and engineering, 5(1):46–55, 1998.
  • Dongarra et al. [1990] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 16(1):1–17, March 1990. ISSN 0098-3500.
  • Falgout [2006] R. Falgout. HYPRE users manual. Technical Report Revision 2.0.0, Lawrence Livermore National Laboratory, 2006.
  • Gaston et al. [2009] D. Gaston, C. Newman, G. Hansen, and D. Lebrun-Grandie. MOOSE: A Parallel Computational Framework for Coupled Systems of Nonlinear Equations. Nuclear Engineering and Design, 239(10):1768–1778, 2009.
  • Geimer et al. [2010] M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca Performance Toolset Architecture. Concurrency and Computation: Practice and Experience, 22(6):702–719, 2010.
  • Grcar [2011] J. F. Grcar. Mathematicians of Gaussian Elimination. Notices of the American Mathematical Society, 58(6):782–792, 2011. URL http://www.ams.org/notices/201106/rtx110600782p.pdf.
  • Gropp et al. [1999a] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface, volume 1. 1999a.
  • Gropp et al. [1999b] W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. Toward Realistic Performance Bounds for Implicit CFD Codes. In Proceedings of Parallel CFD ‘99, pages 233–240. Elsevier, 1999b.
  • Hendrickson and Leland [1995] B. Hendrickson and R. W. Leland. A Multi-Level Algorithm For Partitioning Graphs. SC, 95(28), 1995.
  • Heroux et al. [2005] Michael A. Heroux, Roscoe A. Bartlett, Vicki E. Howle, Robert J. Hoekstra, Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P. Pawlowski, Eric T. Phipps, Andrew G. Salinger, Heidi K. Thornquist, Ray S. Tuminaro, James M. Willenbring, Alan Williams, and Kendall S. Stanley. An Overview of the Trilinos project. ACM Trans. Math. Softw., 31(3):397–423, 2005. ISSN 0098-3500. doi: http://doi.acm.org/10.1145/1089014.1089021.
  • Karypis and Kumar [1999] G. Karypis and V. Kumar. A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 20:359–392, 1999.
  • Keyes et al. [2013] D. E. Keyes, L. C. McInnes, C. Woodward, W. Gropp, E. Myra, M. Pernice, J. Bell, J. Brown, A. Clo, J. Connors, E. Constantinescu, D. Estep, K. Evans, C. Farhat, A. Hakim, G. Hammond, G. Hansen, J. Hill, T. Isaac, X. Jiao, K. Jordan, D. Kaushik, E. Kaxiras, A. Koniges, K. Lee, A. Lott, Q. Lu, J. Magerlein, R. Maxwell, M. McCourt, M. Mehl, R. Pawlowski, A. P. Randles, D. Reynolds, B. Rivière, U. Rőde, Tim. Scheibe, J. Shadid, B. Sheehan, M. Shephard, A. Siegel, B. Smith, X. Tang, C. Wilson, and B. Wohlmuth. Multiphysics Simulations. The International Journal of High Performance Computing Applications, 27(1):4–83, 2013.
  • Kirk et al. [2006] B. S. Kirk, J. W. Peterson, R. H. Stogner, and G. F. Carey. LibMesh: A C++ Library for Parallel Adaptive Mesh Refinement/Coarsening Simulations. Engineering with Computers, 22(3–4):237–254, 2006. http://dx.doi.org/10.1007/s00366-006-0049-3.
  • Knepley and Bardhan [2015] M. G. Knepley and J. P. Bardhan. Work/Precision Tradeoffs in Continuum Models of Biomolecular Electrostatics. In Proceedings of ASME 2015 International Mechanical Engineering Congress & Exposition, volume 9, page V009T12A04, 2015. doi: 10.1115/IMECE2015-53579.
  • Knepley and Terrel [2013] M. G. Knepley and A. R. Terrel. Finite Element Integration on GPUs. ACM Transactions on Mathematical Software, 39(2), 2013. URL http://arxiv.org/abs/1103.0066.
  • Knepley et al. [2016] M. G. Knepley, K. Rupp, and A. R. Terrel. Finite Element Integration with Quadrature on the GPU, 2016. URL https://arxiv.org/abs/1607.04245.
  • Knepley et al. [2015] Matthew G. Knepley, Jed Brown, Lois Curfman McInnes, Barry Smith, Karl Rupp, and Mark Adams. Exascale computing without threads. 2015. URL http://www.orau.gov/hpcor2015/whitepapers/Exascale_Computing_without_Threads-Barry_Smith.pdf. Whitepaper for the DOE High Performance Computing Operational Review (HPCOR) on Scientific Software Architecture for Portability and Performance.
  • Li and Demmel [2003] X. S. Li and J. W. Demmel. SuperLU_DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems. ACM Transactions on Mathematical Software (TOMS), 29(2):110–140, 2003.
  • Lichtner et al. [2013] P. C. Lichtner, G. E. Hammond, C. Lu, S. Karra, G. Bisht, B. Andre, R. T. Mills, and J. Kumar. PFLOTRAN Web Page, 2013. http://www.pflotran.org.
  • Lo et al. [2015] Y. J. Lo, S. Williams, B. V. Straalen, T. J. Ligocki., M. J. Cordery, N. J. Wright, M. W. Hall, and L. Oliker. Roofline: An Insightful Visual Performance Model for Multicore Architectures. High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 8966:129–148, 2015.
  • May et al. [2014] D. A. May, J. Brown, and L. L. Laetitia. pTatin3D: High-Performance Methods for Long-Term Lithospheric Dynamics. In Proceedings of the International Conference for High Performance Computing, Network, Storage and Analysis, SC ‘14, pages 274–284. IEEE Press, 2014.
  • McCalpin [1995] J. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance Computers, 1995. https://www.cs.virginia.edu/stream/.
  • McKee [2004] S. A. McKee. Reflections on the Memory Wall. In Proceedings of the 1st conference on Computing frontiers, page 162. ACM, 2004.
  • Miller et al. [2013] C. T. Miller, C. N. Dawson, M. W. Farthing, T. Y. Hou, J. Huang, C. E. Kees, C. T. Kelley, and H. P. Langtangen”. Numerical simulation of water resources problems: Models, methods, and trends. Advances in Water Resources, 51:405–437, 2013.
  • Mucci et al. [1999] P. J. Mucci, S. Browne, C. Deane, and G. Ho. PAPI: A Portable Interface to Hardware Performance Counters. In Proceedings of the department of defense HPCMP users group conference, volume 710, 1999.
  • Murphy [2007] R. Murphy. On the Effects of Memory Latency and Bandwidth on Supercomputer Application Performance. In Workload Characterization, 2007. IISWC 2007. IEEE 10th International Symposium on, pages 35–43. IEEE, 2007.
  • Rajovic et al. [2014] N. Rajovic, A. Rico, N. Puzovic, C. Adeniyi-Jones, and A. Ramirez. Tibidabo1: Making the Case for an ARM-based {HPC} system. Future Generation Computer Systems, 36:322 – 334, 2014.
  • Raman [2015] K. Raman. Calculating “FLOP” using Intel Software Development Emulator (Intel SDE) , 2015. https://software.intel.com/en-us/articles/intel-software-development-emulator.
  • Rathgeber et al. [2016] F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. McRae, G. T. Bercea, G. R. Markall, and P. H. J. Kelly. Firedrake: Automating the Finite Element Method by Composing Abstractions. ACM TOMS, 43:24:1–24:27, 2016.
  • Schoof [2006] C. Schoof. A Variational Approach to Ice Stream Flow. Journal of Fluid Mechanics, 556:227–251, 2006.
  • Schoof and Hindmarsh [2010] C. Schoof and R. Hindmarsh.

    Thin-Film Flows with Wall Slip: An Asymptotic Analysis of Higher Order Glacier Flow Models.

    Quarterly Journal of Mechanics and Applied Mathematics, 63:73–114, 2010.
  • Schulz et al. [2008] M. Schulz, J. Galarowicz, D. Maghrak, W. Hachfeld, D. Montoya, and S. Cranford. Open— SpeedShop: An Open Source Infrastructure for Parallel Performance Analysis. Scientific Programming, 16(2-3):105–121, 2008.
  • Shende and Malony [2006] S. S. Shende and A. D. Malony. The TAU Parallel Performance System. The International Journal of High Performance Computing Applications, 20(2):287–311, 2006.
  • Sinharoy et al. [2015] B. Sinharoy, J. A. Van Norstrand, R. J. Eickemeyer, H. Q. Le, J. Leenstra, D. Q. Nguyen, B. Konigsburg, K. Ward, M. D. Brown, J. E. Moreira, D. Levitan, S. Tung, D. Hrusecky, J. W. Bishop, M. Gschwind, M. Boersma, M. Kroener, M. Kaltenbach, T. Karkhanis, and K. M. Fernsler. IBM POWER8 Processor Core Microarchitecture. IBM Journal of Research and Development, 59(1):2:1–2:21, Jan 2015. ISSN 0018-8646.
  • Smith et al. [1996] B. F. Smith, P. Bjørstad, and W. D. Gropp. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 1996. URL http://www.mcs.anl.gov/~bsmith/ddbook.html.
  • [52] A. Tal. Intel Software Development Emulator . https://software.intel.com/en-us/articles/intel-software-development-emulator.
  • Turcksin et al. [2016] Bruno Turcksin, Martin Kronbichler, and Wolfgang Bangerth. Workstream–a design pattern for multicore-enabled finite element computations. ACM Transactions on Mathematical Software (TOMS), 43(1):2, 2016.
  • Weaver et al. [2013] V. M. Weaver, D. Terpstra, and S. Moore. Non-Determinism and Overcount on Modern Hardware Performance Counter Implementations. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 215–224, April 2013.
  • Williams et al. [2009] S. Williams, A. Waterman, and D. Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 54:65–76, 2009.
  • Wulf and McKee [1995] W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Implications of the Obvious. ACM SIGARCH computer architecture news, 23(1):20–24, 1995.