Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?

10/22/2018 ∙ by Jens Domke, et al. ∙ 0

Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view. In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy application on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNM and KNL architecturally deviate at one important point: the silicon area devoted to double-precision arithmetic's. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic. Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g. upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units.



There are no comments yet.


page 2

page 4

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

It is becoming increasingly clear that the road forward in High-Performance Computing (HPC) is one full of obstacles. With the ending of Dennard’s scaling [1] and the ending of Moore’s law [2], there is today an ever-increasing need to oversee how we allocate the silicon to various functional units in modern many-core processors. Amongst those decisions is how we distributed the hardware support for various levels of compute-precision.

Historically, most of the compute silicon has been allocated to double-precision (64-bit) compute. Nowadays – in processors such as the forthcoming AA64FX [3] and Nvidia Volta [4] – the trend, mostly driven by market/AI demands, is to replace some of the double-precision units with lower-precision units. Lower-precision units occupy less area (up to 3x going from double- to single-precision Fused-Multiply-Accumulate [5]), leading to more on-chip resources (more instruction-level parallelism), potentially lowered energy consumption, and a definitive decrease in external memory bandwidth pressure (i.e., more values per unit bandwidth). The gains – up to four times over their DP variants with little loss in accuracy [6] – are attractive and clear, but what is the impact on performance (if any) on existing HPC applications? What performance impact can HPC users expect when migrating their code to future processors with a different distribution in floating-point precision support? Finally, how can we empirically quantify this impact on performance using existing processors in an apples-to-apples comparison on real-life use cases without relying on tedious, slow, and potentially inaccurate simulators?

The Intel Xeon Phi was supposed to be the high-end for many-core processor technology for nearly a decade (Knights Ferry was announced in 2010), and has changed drastically since its first released. The latest (and also last) two revisions – the Knights Landing and Knights Mill – are of particular importance since they arguable reflect two different ways of thinking. Knights Landing has relatively large support for double-precision (64-bit) computations, and follows a more traditional school of thought. The Knights Mill follows a different direction, which is the replacement of double-precision compute units with lower-precision (single-precision, half-precision, and integer) compute capabilities.

In the present paper, we quantify and analyze the performance and compute bottlenecks of Intel’s Knights Landing [7] and Mill architectures [8] – two processors with identical micro-architecture where the main difference is in the relative allocation of double-precision units. We stress both processors with numerous realistic benchmarks from both the Exascale Computing Project (ECP) proxy applications [9] and RIKEN-CCS Fiber Miniapp Suite [10] – benchmarks used in HPC system acquisition. Through an extensive (and robust) performance measurement process (which we also open-source), we empirically show the architecture’s relative weaknesses. In short, the contributions of the present paper are:

  1. An empirical performance evaluation of the Knights Landing and Mill family of processors – both proxies for previous and future architectural trends – with respect to benchmarks derived from realistic HPC workloads,

  2. An in-depth analysis of results, including identification of bottlenecks for the different application/architecture combinations, and

  3. An open-source compilation of our evaluation methodology, including our collected raw data.

Ii Architectures, Environment, and Applications

Our research objective is to evaluate the impact of migrating from an architecture with (relatively) high amount of double-precision compute to an architecture with less. By high amount of double-precision compute we mean architectures whose Floating-Point Unit (FPU) has most of its silicon dedicated to 64-bit IEEE-754 floating-point operations, and by less double-precision compute we mean architectures that replace those same double-precision FPUs with lower – potentially hybrid – precision units.

To understand and explore the intersection of architectures with high-amount of double-precision and those with hybrid-precision, there is a need to find a processor whose architecture is unchanged with the sole exception of its floating-point unit to silicon distribution. Only one modern processor family allows for such an apples-to-apples comparison: the Xeon Phi family of processors.

Ii-a Hardware & Software Environment

Feature KNL KNM Broadwell-EP
CPU Model 7210F 7295 2x E5-2650v4
#{Cores} (HT) 64 (4x) 72 (4x) 24 (2x)
Base Frequency 1.3 GHz 1.5 GHz 2.2 GHz
Max Turbo Freq. 1.4 GHz 1.6 GHz 2.9 GHz
CPU Mode Quadrant Quadrant N/A
TDP 230 W 320 W 210 W
DRAM Size 96 GiB 96 GiB 256 GiB
Triad BW 71 GB/s 88 GB/s 122 GB/s
MCDRAM Size 16 GiB 16 GiB N/A
Triad BW 439 GB/s 430 GB/s N/A
MCDRAM Mode Cache Cache N/A
LLC Size 32 MiB 36 MiB 60 MiB
Inst. Set Extension AVX-512 AVX-512 AVX2
FP32 Peak Perf. 5,324 Gflop/s 13,824 Gflop/s 1,382 Gflop/s
FP64 Peak Perf. 2,662 Gflop/s 1,728 Gflop/s 691 Gflop/s
TABLE I: Detailed compute node hardware information; Differences between Knights Landing & Mill highlighted in bold; Shown bandwidth (BW) measured with BabelStream (see Sec.II-B); Numbers for dual-socket reference system accumulated

Intel’s Knights Landing (KNL) and Knights Mill (KNM) are the latest incarnations of a long line of architectures in the Intel’s accelerator family. Both processor consist of a large number of processors cores (64 and 72, respectively), interconnected in a mesh interconnection (prior to KNL: ring interconnection). Each core has a private L1 cache and a slice of the distributed L2 cache. Caches are kept coherent through the directory-based MESIF protocol. Both processors come with two types of external memory: MCDRAM (or, Hybrid Memory Cube) and Double-Data Rate-synchronous (DDR4) memory. Unique to the Xeon Phi processors is that the MCDRAM memory can be configured to one of three modes of operation: it is either (1) directly addressable in the global memory address space (memory-mapped), called flat mode, or it (2) acts as last-level cache before the DDR, called cache mode. Finally, the third mode (hybrid mode [11]) is a combination of the properties from the first two modes.

There are several policies governing where data is homed. A common high-performance configuration [12], which is also the one we used in our study, is the quadrant mode. Quadrant mode means that the physical cores are divided into four logical parts, where each logical part is assigned two memory controllers; each logical group is treated as a unique Non-Uniform Memory-Access (NUMA) node, allowing the operating system to perform data-locality optimizations. Table I

surveys and contrast the processors against each other, where the main differences are highlighted. The main architectural difference – which is also the difference and its impact we seek to empirically quantify – is the Floating-Point Unit (FPU). In KNL, this unit features two 512-bit wide vector units (AVX), together capable of executing 32 double-precision or 64 single-precision operations per cycle, totaling 2.6 Tflop/s of double- and 5.3 Tflop/s of single-precision performance, respectively, across all 64 processing cores. In KNM, however, the FPU is redesigned to replace one 512-bit vector unit with two Virtual Neural Network Instruction (VNNI) units. Those units, although specializing in hybrid-precision FMA, can execute single-precision vector instructions, but have no support for double-precision compute. Thus, in total, the KNM can execute up to 1.7 Tflop/s of double-precision or 13.8 Tflop/s of single-precision computations. In summary, the KNM has 2.59x more single-precision compute, while the KNL have 1.54x more double-precision compute.

While both the KNL and KNM are functionally and architectural similar, there are some note-worth differences. First, the operating frequency of the processors vary: the KNL operates at a frequency of 1.3 GHz (and up to 1.5 GHz in Turbo mode), while KNM operates at 1.5 GHz (1.6 GHz turbo). Hence, KNM executes 15% more cycles per second over KNM. Furthermore, although the cores of KNM and KNL are similar (except the FPU), the number of cores are different: KNL has 64 cores while KNM has 72 cores. Both processors are manufactured in 14 nm technology. Finally, the amount of on-chip last-level cache between the two processors is different, where KNM has a 4 MiB advantage over KNL.

Additionally, for verification reasons, we include a modern dual-socket Xeon-based compute node in our evaluation. Despite being vastly different from the Xeon Phi systems, our Xeon Broadwell-EP (BDW) general-purpose processor is used to cross-check metrics, such as: execution time and performance (Xeon Phi should perform better), frequency-scaling experiments (BDW has more frequency domains), and performance counters (BDW exposes more performance counters). Aside from those differences mentioned above (and highlighted in Table I), the setup between the Xeon Phi nodes (and BDW node) is identical, including the same operating system, software stack, and solid state disk.

For the operating system (OS) and software environment, we use equivalent setups across our three compute nodes. The OS is a fresh installation of CentOS 7 (minimal) with Linux kernel version 3.10.0-862, which has the latest versions of the Meltdown and Spectre patches enabled. During our experiments, we limit potential OS noise by disabling all remote storage (Network File System in our case) and allowing only a single user on the system. Most of our applications are compiled with Intel’s Parallel Studio XE (version 2018; update 3) compilers, and we install the latest versions of Intel TensorFlow and Intel MKL-DNN for the deep learning proxy application, since our assumption is that Intel’s software stack allows for the highest utilization of their hardware. Exceptions to this compiler selection are listed in the subsequent Section 

II. We use Intel MPI from the Parallel Studio XE suite to execute our measurements.

Ii-B Benchmark Applications

Over the years, the HPC community developed many benchmarks representing real workloads or for testing the capabilities of a system – primarily for comparisons across architectures but also for system procurement purposes. The so-called Exascale Computing Project (ECP) proxy applications [9] and RIKEN AICS’ Fiber Miniapp Suite [10], which we will focus on for this study, are just two examples representing modern HPC workloads. Those benchmarks are designed to evaluate single-node and small-scale test installations, and hence are adequate for our study.

Ii-B1 The ECP Proxy-Apps

The ECP suite (release v1.0) consists of 12 proxy applications primarily written in C (5x), FORTRAN (3x), C++ (3x), and Python (1x), listed hereafter.

Algebraic multi-grid (AMG)

solver of the hypre library is a parallel solver for unstructured grids [13] arising from fluid dynamics problems. We choose problem 1 for our tests, which applies a 27-point stencil on a 3-D linear system.

Candle (Cndl)

is a deep learning benchmark suite to tackle various problems in cancer research [14]. We select benchmark 1 of pilot 1 (P1B1

), which builds an autoencoder from a sample of gene expression data to improve the prediction of drug responses.

Co-designed Molecular Dynamics (CoMD)

serves as the reference implementation for ExMatEx [15] to facilitate co-design for (and evaluation of) classical molecular dynamics algorithms. We are using the included strong-scaling example to calculate the inter-atomic potential for 256,000 atoms.

LAGrangian High-Order Solver – Laghos (LAGO)

computes compressible gas dynamics though an unstructured high-order finite element method [16]. The input for our study is the simulation of a 2-dimensional Sedov blast wave with default settings as documented for the Laghos proxy-app.


is a synthetic Multi-purpose, Application-Centric, Scalable I/O proxy designed to closely mimic realistic I/O workloads of HPC applications [17]. Our input causes MACSio to write a total of 433.8 MB to disk.


is a adaptive mesh refinement proxy application of the Mantevo project [18] which applies a stencil computation on a 3-dimensional space, in our case a sphere moving diagonally through a cubic medium.

MiniFE (MiFE)

is a reference implementation of an implicit finite elements solver [18] for scientific methods resulting in unstructured 3-dimensional grids. For our study, we use 128128128 input dimensions for the grid.

MiniTri (MTri)

is able to apply different graph detection algorithms for a given graph, such as community detection or dense subgraph detection [19]. As input for the triangle detection and approximation of the graph’s largest clique, we download BCSSTK30 from the MatrixMarket [20].

Nekbone (NekB)

is a proxy for the Nek5000 application [21], and uses the conjugate gradient method for solving the standard Poisson equation for computational fluid dynamics problems. We enabled the multi-grid preconditioner, and for strong-scaling, see Section III-B, we fixed the elements per process and polynomial order to one number, respectively.

SW4lite (SW4L)

is a proxy for the computational kernels used in the seismic modelling software, called SW4 [22], and we use the pointsource example, which calculates the wave propagation emitted from a single point in a half-space.

Swfft (Fft)

represents the compute kernel of the HACC cosmology application [23]

for N-body simulations. The 3-D fast Fourier transformation of SWFFT emulates one performance-critical part of HACC’s Poisson solver. In our tests, we perform 32 repetitions on a 128

128128 grid.

XSBench (XSBn)

is the proxy for a Monte Carlo calculations used by a neutron particle transport simulator for a Hoogenboom-Martin nuclear reactor [24]. We simulate a large reactor model represented by a unionized grid with cross-section lookups per particle.

Ii-B2 RIKEN Mini-Apps

In comparison to the modernized ECP proxy-apps, RIKEN’s eight mini-apps are written in FORTRAN (4x), C (2x), and a mix of FORTRAN/C/C++ (2x).

FrontFlow/blue (FFB)

uses the finite element method to solve the incompressible Navier-Stokes equation for thermo-fluid analysis [25]. We simulate the 3-D cavity flow in a rectangular space discretized into 505050 cubes.

Frontflow/violet Cartesian (FFVC)

falls into the same problem class as FFB, however the difference is that FFVC uses the finite volume method (FVM) [26]. Here, we calculate the 3-D cavity flow in a 144144144 cuboid.

Modylas (Mdyl)

makes use of the fast multipole method for long-range force evaluations in molecular dynamics simulations [27]. Our input is the wat222 example which distributes 156,240 atoms over a 161616 cell domain.

many-variable Variational Monte Carlo (mVMC) method

implemented by this mini-app is used to simulate quantum lattice models for studying the physics of condensed matter [28]. We use mVMC’s included strong-scaling test, but downsize it (1/3 lattice dimensions and 1/4 of samples).

Nonhydrostatic ICosahedral Atmospheric Model (NICM)

is a proxy of NICAM, which computes mesoscale convective cloud systems based on FVM for icosahedral grids [29]. We run Jablonowski’s baroclinic wave test (gl05rl00z40pe10), but reduce the simulated days from 11 to 1.

Next-Gen Sequencing Analyzer (NGSA)

is a mini-app of a genome analyzer and a set of alignment tools designed to facilitate cancer research by detecting genetic mutations in human DNA [30]. For our experiments, we rely on pre-generated pseudo-genome data (ngsa-dummy).

NTChem (NTCh)

implements a computational kernel of the NTChem software framework for quantum chemistry calculations of molecular electronic structures, i.e., the solver for the second-order Møller-Plesset perturbation theory [31]. We select the H2O test case for our study.

Quantum ChromoDynamics (QCD)

mini-app solves the lattice QCD problem in a 4-D lattice (3-D plus time), represented by a sparse coefficient matrix, to investigate the interaction between quarks [32]. We evaluate QCD with the Class 2 input for a lattice discretization.

ECP Scientific/Engineering Domain Compute Pattern Language
AMG Physics and Bioscience Stencil C
CANDLE Bioscience Dense matrix Python
CoMD Material Science/Engineering N-body C
Laghos Physics Irregular C++
miniAMR Geoscience/Earthscience Stencil C
miniFE Physics Irregular C++
miniTRI Math/Computer Science Irregular C++
Nekbone Math/Computer Science Sparse matrix Fortan
SW4lite Geoscience/Earthscience Stencil C
SWFFT Physics FFT C/Fortran
XSBench Physics Irregular C
RIKEN Scientific/Engineering Domain Compute Pattern Language
FFB Engineering (Mechanics, CFD) Stencil Fortran
FFVC Engineering (Mechanics, CFD) Stencil C++/Fortran
mVMC Physics Dense matrix C
NICAM Geoscience/Earthscience Stencil Fortran
NGSA Bioscience Irregular C
MODYLAS Physics and Chemistry N-body Fortran
NTChem Chemistry Dense matrix Fortran
QCD Lattice QCD Stencil Fortran/C
TABLE II: Application Categorization, Compute Patterns, and main Programming Languages used; MACSio, HPL, HPCG, and BabelStream Benchmarks omitted

Ii-B3 Reference Benchmarks

In addition to those 20 applications, we use the compute intensive HPL [33] benchmark, and HPCG [34] and stream (both memory intensive) to evaluate the baseline of the investigated architectures.

High Performance Linpack (HPL)

is solving a dense system of linear equations to demonstrate the double-precision compute capabilities of a (HPC) system [35]. Our problem size is 64,512. For both HPL and HPCG (see below), we employ highly tuned versions shipped with Intel’s Parallel Studio XE suite with appropriate parameters for our systems.

High Performance Conjugate Gradients (HPCG)

is applying a conjugate gradient solver to a system of linear equation (sparse matrix ), with the intent to demonstrate the system’s memory subsystem and network limits. We choose 360360360 as global problem dimensions for HPCG.

BabelStream (BABL)

is one of many available “stream” benchmarks supporting evaluations of the memory subsystem for CPUs and accelerators [36]. We will use 2 GiB and 14 GiB input vectors, see Section IV-C for details.

We provide a compressed overview of the ECP and RIKEN’s proxy applications in Table II

. In this table, each application is categorized by its scientific domain, as well as the primary workload/kernel classification, for which we use the classifiers employed by Hashimoto et al. 

[37]. Both, the scientific domain as well as the kernel classification will be important for our subsequent analysis in Sections IV and V.

Iii Methodology

In this section, we present our rigor benchmarking approach into investigating the characteristics of each architecture, and extracting the necessary information for our study.

Iii-a Benchmark Setup and Configuration Selection

Due to the fact that the benchmarks, listed in Section II-B, are firstly realistic proxies of the original applications [38] and secondly are used in the procurement process, we can confidently assume that these benchmarks are well tuned and come with appropriate compiler options for a variety of compilers. Hence, we refrain from both manual code optimization and alterations of the compiler options. The only modifications we perform are:

  • Enabling interprocedural optimization (-ipo) and compilation for the highest instruction set available (-xHost)111 Exceptions: (a) AMG compiled with -xCORE-AVX2 to avoid arithmetic
    errors; (b) NGSA’s BWA tool compiled with GNU gcc to avoid segfaults.

  • Patching a segmentation fault in MACSio222 After our reporting, the developers patched the upstream version., and

  • Injecting our measurement source code, see Section III-B.

With respect to the measurement runs, we follow this five step approach for each benchmark:

  1. Install, patch, and compile the benchmark, see above,

  2. Select appropriate inputs/parameters/seeds for execution,

  3. Determine “best” parallelism: #processes and #threads,

  4. Execute a performance, a profiling, and a frequency run,

  5. Analyze the results (go to 0. if anomalies are detected).

and we will further elaborate on those steps hereafter.

For the input selection we have to balance between multiple constraints and choose based on: Which recommended inputs are listed by the benchmark developers?, How long does the benchmark run?333 Our aim is 1 sec–10 min due to the large sample size we have to cover. Does it occupy a realistic amount of main memory (e.g., avoid cache-only executions)? Are the results repeatable (randomness/seeds)? We optimize for the metrics reported by the benchmark (e.g., select the input with the highest Gflop/s rate). Furthermore, one of the most important considerations while selecting the right inputs is strong-scaling. We require strong-scaling properties of the benchmark for two reasons: the results collected in Step (2) need to be comparable, and even more importantly, the results of Step (3) must be comparable between different architectures, since we may have to use different numbers of MPI processes for KNL and KNL (and our BDW reference architecture) due to their difference in core counts. The only exception is MiniAMR for which we are unable to find a strong-scaling input configuration and instead optimized for the reported Gflop/s of the benchmark. Accordingly, we then choose the same amount of MPI processes on our KNL and KNM compute nodes for MiniAMR.

In Step (2), we evaluate numerous combinations of MPI processes and OpenMP threads for each benchmark, including combinations which over-/undersubscribe the CPU cores, and test each combination with three runs to minimize the potential for outliers due to system noise. For all subsequent measurements, we select the number of processes and threads based on the “best” (w.r.t time-to-solution of the solver) combination among these tested versions, see Table 

IV for details. We are not applying specific tuning options to the Intel MPI library, except for using Intel’s recommended settings for HPCG with respect to thread affinity and MPI_allreduce implementation. The reason is that our pretests (with a subset of the benchmarks) with non-default parameters for Intel MPI consistently resulted in longer time-to-solution.

For Step (3), we run each benchmark ten times to identify the fastest time-to-solution for the (compute) kernel of the benchmark. Additionally, for the profiling runs, we execute the benchmark once for each of the profiling tools and/or metrics (in case the tool is used for multiple metrics), see Section III-B for details. Finally, we perform frequency scaling experiments for each benchmark, where we throttle the CPU frequency to all the available lower CPU states below the maximum CPU frequency we use for the performance runs, and record the lowest kernel time-to-solution among ten trials per frequency. The reason and results of the frequency scaling test will be further explained in Section IV-D. One may argue for more than ten runs per benchmark to find the optimal time-to-solution, however, given the prediction interval theory and our deterministic benchmarks executed on a single node, it is unlikely to obtain a much faster run and we confirmed that the fastest 50% of executions per benchmark only vary by 3.9% on average. The collected metrics, see the following section, will be analyzed in Section IV in detail.

Iii-B Metrics and Measurement Tools

To study and analyze the floating point requirements by applications, it is not only important to evaluate an established metric (floating point operations per second), but also other metrics, such as memory throughput, cache utilization, or speedup with increased CPU frequency. The detailed list of metrics (and derived metrics) and the methodology and tools we use to collect these metrics will be explained hereafter.

One observation is that the amount of time spent on initializing and post processing within each proxy application can be relatively high (e.g., HPCG spends only 11% and 30% of its time in the solver part on BDW and Phi, respectively) and is usually not consistent with the real workloads, e.g., one can reduce the epochs for performance evaluation purposes in CANDLE but not the input data pre-processing to execute those epochs. These mismatches in kernel-to-[pre

post]processing ratio requires us to extract all metrics only for the (computational) kernel of the benchmark. Hence, we identify and inject profiling instructions around the kernels to start or pause the collection of raw metric data by the analysis tools. This code injection is exemplified in PseudoCode 1. Therefore, unless otherwise stated in this Section or subsequent sections, all presented data will be based exclusively on the kernel portion of each benchmark.

#define START_ASSAY {measure time; toggle on [PCM SDE VTune]} #define STOP_ASSAY {measure time; toggle off [PCM SDE VTune]} Function main is
       STOP_ASSAY Initialize benchmark foreach  solver loop do
             START_ASSAY Call benchmark solver/kernel STOP_ASSAY Post-processing
      Verify benchmark result START_ASSAY
PseudoCode 1 Injecting analysis instructions

For tool stability reason, attention to detail/accuracy, and overlap with our needs, we settle on the use of the MPI API for runtime measurements, alongside with Intel’s Processor Counter Monitor (PCM) [39], Intel’s Software Development Emulator (SDE) [40], and Intel’s VTune Amplifier [41]444 To avoid persistent compute node crashes, we had to use disable VTune’s
build-in sampling driver and instead rely on Linux’ perf tool.
. Furthermore, as auxiliary tools we rely on RRZE’s Likwid [42] for frequency scaling555 Our Linux kernel version required us to disable the default Intel P-State
driver to have full access to the fine-grained frequency scaling.
and LLNL’s msr-safe [43] for allowing access to CPU model-specific registers. An overview of (raw) metrics which we extract with these tools for the benchmarks, listed in Section II-B, is shown in Table III. Furthermore, derived metrics, such as Gflop/s, will be explained on-demand in Section IV.

Raw Metric Method/Tools
Runtime [s] MPI_Wtime()
#{FP / integer operations} SDE
#{Branches operations} SDE
Memory throughput [B/s] PCM (pcm-memory.x)
#{L2/LLC cache hits/misses} PCM (pcm.x)
Consumed Power [Watt] PCM (pcm-power.x)
SIMD instructions per cycle perf + VTune (‘hpc-performance’)
Memory/Back-end boundedness perf + VTune (‘memory-access’)
TABLE III: Summary of metrics and method/tool to collect these metrics

Iv Evaluation

The following subsections will primarily focus on visualizing and analyzing the key metrics we collect for each proxy- and mini-app, such as Gflop/s. The significance of our findings with respect to future software, CPU, and HPC system design will then be discussed in the next Section V. Analyzing the instruction mix, flop/s, or memory throughput, see Section IV-AIV-B, and IV-C, in a isolated fashion is not a good indication about the system’s bottlenecks, and hence, especially when reasoning about FPU requirements, we also have to understand the applications’ compute-boundedness, which we evaluate in Section IV-D. Table III summarizes the primary metrics and method/tool to collect these metrics. Table IV includes additional metrics.

Iv-a Integer vs. Single-Precision FP vs. Double-Precision FP


Fig. 1: Ratio of Integer vs. single-precision FP vs. double-precision FP per proxy-app as counted by Intel’s SDE; Per application: Left bar = BDW, middle bar = KNL, right bar = KNM; Missing bars for CANDLE due to SDE crashes on Xeon Phi; Proxy-app abbreviations acc. to Section II-B

The breakdown of total number of integer and single/double-precision floating point (FP) operations, as depicted in Figure 1

, shows two rather unexpected trends. First, the number of proxy-apps relying on 32-bit FP instructions is four out of 22, which is surprisingly low, and furthermore, only one of them utilizes both 32-bit and 64-bit FP instructions. Minor variances in integer to FP ratio between the architectures can likely be explained by the difference in AVX vector length, quality of compiler optimization for each CPU, and execution/parallelization approach. The second unexpected trend is the imbalance of integer to FP operations, i.e., 16 of 22 applications issue at least 50% integer operations. However, one has to keep in mind that Intel SDE output includes AVX vector instructions for integers, where the granularity can be as low as 1-bit per operand (cf. 4 or 8 byte per FP operand). Hence, the total integer operations count might be slightly inflated. Lastly, the results for HPCG show a big discrepancy between BDW and KNL/KNM. While the total FP operations count is similar, the binary for KNL/KNM issues far more integer operations, see Table 

IV for details, and we are unaware of the reason.

Iv-B Floating-Point Operation/s and Time-to-Solution


Fig. 2: Relative floating-point performance (FP32 and FP64 Gflop/s accumulated) of KNL/KNM in comparison to dual-socket Broadwell-EP (see KEYrel, left y-axis) and Absolute achieved Gflop/s w.r.t dominant FP operations (cf. Fig. 1) in comparison to theoretical peak performance listed in Tab. I (see KEYabs, right y-axis); Due to missing SDE data for CANDLE, we assume the total number of FP operations is equivalent to BDW and divide by CANDLE’s time-to-solution; Filtered proxy-apps with negligible FP operations: MxIO, MTri, and NGSA; Filtered out MiniAMR because of the strong-scaling issue described in Section III-A; Proxy-app abbreviations acc. to Section II-B

Figure 2

shows the relative performance improvement of KNL/KNM over the dual-socket BDW node and the absolute achieved Gflop/s on each processor. It is important to note that all proxy-/mini-apps, with the exception of HPL, have less than 21.5% (BDW), 10.5% (KNL), and 15.1% (KNM) FP efficiency. Given that these applications are presumably optimized, and still achieve this low FP efficiency, implies a limited relevance of FP unit’s availability. The figure shows that the majority of codes have comparable performance on KNM versus KNL. Notable mentions are: a) CANDLE which benefits from VNNI units in mixed precision, b) MiFE, NekB, and XSBn which improve probably due to increased core count and KNM’s higher CPU frequency, and c) some memory-bound applications (i.e., AMG, HPCG, and MTri) which get slower supposedly due to the difference in peak throughput demonstrated in Figure 

3 in addition to the increased core count causing higher competition for bandwidth.

Iv-C Memory Throughput of (MC-)DRAM


Fig. 3: Memory throughput (only DRAM for BDW, DRAM+MCDRAM for Phi) per proxy-app; Dotted lines indicate Triad stream bandwidth (flat mode, cf. Tab. I); BabelStream for 2 GiB (BABL2) and 14 GiB (BABL14) vector length added (measured in cache mode); Proxy-app labels acc. to Section II-B

For the memory throughput measurements, shown in Figure 3, we use Intel’s PCM tool to analyze DRAM and MCDRAM throughput. Our measurements with BabelStream are included as well to demonstrate the maximum achievable bandwidth, see horizontal lines for MCDRAM (in flat mode), which is lower when the MCDRAM is used in cache mode. We still achieve 86% on KNL and 75% on KNM when the vectors fit into MCDRAM, but drop to slightly higher than DRAM throughput (due to minor prefetching benefits) when the vectors do not fit (see BABL14 for 14 GiB vectors). This throughput advantage of the MCDRAM translates into a performance boost for six proxy-apps (AMG, MAMR, MiFE, NekB, XSBn, and QCD) which heavily utilize the available bandwidth, see Figure 3, and which are memory-bound on our reference system. This can easily be verified when comparing the time-to-solution for the kernels listed in Table IV. Only HPCG cannot benefit from the higher bandwidth and, despite showing 2x throughput, the runtime drops by more than 10%, indicating a memory-latency issue of HPCG on KNL/KNM, which is one of the design goals for the benchmark [34].

Iv-D Frequency Scaling to Identify Compute-Boundedness

[width=]knl-freq [width=]knm-freq [width=]bdw-freq

Fig. 4: Speedup obtained through increased CPU frequency (w.r.t baseline frequency of 1.0 GHz on KNL/KNM and 1.2 GHz on BDW); Top plot: KNL, middle plot: KNL, bottom plot: BDW; Theoretical peak (ThPeak): furthest right bar; Labels/abbreviations of proxy-apps according to Section II-B and ’TB’ = Turbo Boost is assumed to be 100 Mhz across all cores


Fig. 5: Annual HPC site/system utilization by domain; Labels acc. to Table II: geo = Geo-/Earthscience, chm = Chemistry, phy = Physics, qcd = Lattice QCD, mat = Material Science/Engineering, eng = Engineering (Mechanics, CFD), mcs = Math/Computer Science, bio = Bioscience, oth = Other

For this test, we disable turbo boost and throttle core frequency, but keep uncore at maximum frequency which would otherwise negatively affect the memory subsystem, to identify each application’s dependency on ALU/FPU performance. The shown speedup (w.r.t time-to-solution) in Figure 4 of each proxy-app is relative to the lowest CPU frequency on each architecture, and we include our performance results (cf. Section III-A) with max. frequency plus enabled turbo boost.

While a benefit from enabled turbo boost on BDW is near invisible (except for MTri), the proxy-apps clearly reduce their time-to-solution on KNL and KNM when the CPUs are allowed to turbo. Overall, the benchmarks seem to be less memory-bound and more compute-bound, especially salient for AMG and MiniFE, when moving to Xeon Phi, indicating a clear benefit from the much bigger/faster MCDRAM used as last-level cache and indicating a more balanced (w.r.t bandwidth to flop/s ratio) architecture. However, the limited speedup for HPL on KNL clearly shows the CPU’s abundance of FP64 units. Here, the successor, Knights Mill, shows a better balance. Another interesting observation is the inverse behavior of AMG and HPCG on our tested architecture. Both benchmarks are supposed to be memory-bound, but the absence of signs of any scalability with frequency on Xeon Phi strengthens our hypothesis from Section IV-C that HPCG is primarily memory-latency bound.

For I/O portions of an application, the Figure 4 reveals another observation, i.e., MACSio’s write speed scales with increased frequency. Since, MACSio performs only single figure GIop/s and negligible flop/s, increasing the CPU’s compute capabilities cannot explain the shown speedup. Hence, our theory is: MACSio (and I/O in general) is bound by the Linux kernel, whose performance depends on CPU frequency. Guérout et al. report similar findings [44], and we see equivalent behavior with a micro-benchmark (with Unix’s dd command).

Iv-E Remaining Metrics

To disseminate the remaining results from our experiments, we attached Table IV to this paper, which can be utilized for further analysis, and which contains some interesting data points. For example, the power measurements for CANDLE, which is just slightly higher than when running MACSio, indicate that Intel’s MKL-DNN (used underneath to compute on the FP16 VNNI units) does not fully utilize the CPU’s potential. Furthermore, the L2 hit rate on both Xeon Phi is considerably higher than on our reference hardware, indicating improvements in the hardware prefetcher and are presumably a direct effect of the high-bandwidth MCDRAM in cache mode.

V Discussion and Implications

While the previous section focuses on the collected data and comparisons between the three architectures, this section summarizes the relevant points to consider from our study, which should be taken into account when moving forward.

V-a Performance Metrics

The de facto performance metric reported in HPC is flop/s. Reporting flop/s is not limited to applications that are compute-bound. Benchmarks that are designed to resemble realistic workloads, e.g., the memory-bound HPCG benchmark, typically report performance in flop/s. The proxy-/mini-apps in this study as well typically report flop/s despite only six out of 20 proxy-/mini-apps we analyze in this study appearing to be compute-bound (including NGSA that is bound by ALUs, not FPUs). We argue that convening on reporting relevant metrics would shift the focus of the community to be less flop/s-centered.

V-B Considerations for HPC Utilization by Scientific Domain

This paper highlights the diminishing relevance of flop/s when considering the actual requirements of representative proxy-apps. The relevance of flop/s on a given supercomputer can be further diminished when considering the analysis of node-hours spent yearly on different scientific domains at supercomputing facilities. Figure 5 summarizes the breakdown of node-hours by scientific domain for different supercomputing facilities (based on yearly reports of mentioned facilities). For instance, by simply mapping the scientific domains in Figure 5 to representative proxies, ANL’s ALCF and R-CCS’s K-computer would be achieving 14% and 11%, respectively, of the peak flop/s when projecting for the annual node-hours. It is worth mentioning that the relevance of flop/s is even more of an issue for supercomputers to dedicated to specific workloads: the relevance of flop/s can vary widely. For instance, a supercomputer dedicated mainly to weather forecasting, e.g., the 18 Pflop/s system recently installed at Japan’s Meteorological Agency [45], should give minimal relevance to flop/s since the proxy representing this workload on that supercomputer achieves 6% of the peak flop/s, since those workloads are typically memory-bound. On the other hand, a supercomputer dedicated to AI/ML such as ABCI, the world 5th fastest supercomputer as of June 2018, would put high emphasize on flop/s since deep learning workloads rely heavily on dense matrix multiplication operations.

V-C Memory-bound Applications

As demonstrated in Figure 2, the performance of memory-bound applications is mostly not affected by the peak flop/s available. Accordingly, investment in data-centric architectures and programming models should take priority over paying premium for flop/s-centric systems. In one motivating instance, during the investigation that NASA Ames Research Center conducted to identify planned upgrade of the Pleiades supercomputer in 2016 [46], the study concluded that the performance gain from upgrading to Intel Haswell processors was insignificant in comparison to using the older Ivy Bridge-based processors (the newer processor offered double the peak flop/s at almost the same memory bandwidth). And hence the choice was only do a partial upgrade to Haswell processors.

V-D Compute-bound Applications

Investing more in data-centric architectures to accommodate memory-bound applications can have a negative impact on the remaining minority of applications: compute-bound applications. Considering the market trends that are already pushing away from dedicating the majority of chip area to FP64 units, it is likely that libraries with compute-bound code (e.g., BLAS) would support mixed precision or emulation by lower precision FPUs. The remaining applications that do not relay on external libraries might suffer a performance hit.

Vi Related Work

Apart from RIKEN’s mini-apps and the ECP proxy-apps, which we use for our study, there are numerous benchmark suites based on proxy applications from other HPC centers and institutes available  [47, 48, 49, 50, 51, 52]. Overall those lists show a partial overlap, either directly (i.e., same benchmark) or indirectly (same scientific domain), between all these suites, which, for example, were used to analyze message passing characteristic [53] or to assess how predictable full application performance is based on proxy-app measurements [54]. Hence, our systematic approach and published framework can be transferred to these alternative benchmarks for complementary studies, and our included raw data can be investigated further w.r.t metrics which were outside the scope of our study.

Furthermore, the HPC community has already started to analyze relevant workloads with respect to arithmetic intensity or memory and other potential bottlenecks for some proxy-apps  [38, 55, 56] and individual applications [57, 58, 59], revealing similar results to ours that most realistic HPC codes are not compute-bound and achieve very low computational efficiency, which in demonstrated cases affected procurement decisions [46]. However, to the best of our knowledge, we are the first to present a broad study across a wide spectrum of HPC workloads which aims at characterizing bottlenecks and aims specifically at identifying floating-point unit/precision requirements for modern architectures.

Vii Conclusion

We compared two architectural similar processors that have different double-precision silicon budget. By studying a large number of HPC proxy application, we found no significant performance difference between these two processors, despite one having more double-precision compute than the other. Our study points toward a growing need to re-iterate and re-think architecture design decisions in high-performance computing, especially with respect to precision. Do we really need the amount of double-precision compute that modern processors offer? Our results on the Intel Xeon Phi twins points towards a ’No’, and we hope that this work inspires other researchers to also challenge the floating-point to silicon distribution for the available and future general-purpose processors, graphical processors, or accelerators in HPC systems.


  • [1] R. H. Dennard et al., “Design of ion-implanted MOSFET’s with very small physical dimensions,” IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256–268, 1974.
  • [2] G. E. Moore, “Lithography and the Future of Moore’s Law,” in Integrated Circuit Metrology, Inspection, and Process Control IX, vol. 2439.   International Society for Optics and Photonics, 1995, pp. 2–18.
  • [3] T. Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” 2018. [Online]. Available:
  • [4] J. Choquette et al., “Volta: Performance and Programmability,” IEEE Micro, vol. 38, no. 2, pp. 42–52, Mar. 2018.
  • [5] J. Pu et al., “FPMax: a 106gflops/W at 217gflops/mm2 Single-Precision FPU, and a 43.7 GFLOPS/W at 74.6 GFLOPS/mm2 Double-Precision FPU, in 28nm UTBB FDSOI,” 2016. [Online]. Available:
  • [6] A. Haidar et al.

    , “Harnessing GPU’s Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers,” in

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’18, Dallas, Texas, Nov. 2018, accepted at SC ’18.
  • [7] A. Sodani et al., “Knights Landing: Second-Generation Intel Xeon Phi Product,” IEEE Micro, vol. 36, no. 2, pp. 34–46, Mar. 2016.
  • [8] D. Bradford et al.

    , “KNIGHTS MILL: New Intel Processor for Machine Learning,” 2017. [Online]. Available:
  • [9] “ECP Proxy Apps Suite,” 2018. [Online]. Available:
  • [10] RIKEN AICS, “Fiber Miniapp Suite,” 2015. [Online]. Available:
  • [11] A. Heinecke et al., “High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing),” in International Conference on High Performance Computing, ser. ISC ’16.   Springer, 2016, pp. 343–362.
  • [12] N. A. Gawande et al., “Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2017, pp. 399–408.
  • [13] J. Park et al., “High-performance Algebraic Multigrid Solver Optimized for Multi-core Based Distributed Parallel Systems,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’15.   Austin, TX, USA: ACM, 2015, pp. 54:1–54:12.
  • [14] J. Wozniak et al., “CANDLE/Supervisor: A Workflow Framework for Machine Learning Applied to Cancer Research,” BMC Bioinformatics, 2018.
  • [15] J. Mohd-Yusof et al., “Co-design for molecular dynamics: An exascale proxy application,” Los Alamos National Laboratory, Tech. Rep. LA-UR 13-20839, 2013. [Online]. Available:
  • [16] V. Dobrev et al., “High-Order Curvilinear Finite Element Methods for Lagrangian Hydrodynamics,” SIAM Journal on Scientific Computing, vol. 34, no. 5, pp. B606–B641, 2012.
  • [17] J. Dickson et al., “Replicating HPC I/O Workloads with Proxy Applications,” in Proceedings of the 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, ser. PDSW-DISCS ’16.   Piscataway, NJ, USA: IEEE Press, 2016, pp. 13–18.
  • [18] M. A. Heroux et al., “Improving Performance via Mini-applications,” Sandia National Laboratories, Tech. Rep. SAND2009-5574, 2009.
  • [19] M. M. Wolf et al., “A task-based linear algebra Building Blocks approach for scalable graph analytics,” in 2015 IEEE High Performance Extreme Computing Conference (HPEC), Sep. 2015, pp. 1–6.
  • [20] R. F. Boisvert et al., “Matrix Market: A Web Resource for Test Matrix Collections,” in Proceedings of the IFIP TC2/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement.   London, UK, UK: Chapman & Hall, Ltd., 1997, pp. 125–137. [Online]. Available:
  • [21] Argonne National Laboratory, “NEK5000.” [Online]. Available:
  • [22] N. A. Petersson and B. Sjögreen, “User’s guide to SW4, version 2.0,” Lawrence Livermore National Laboratory, Tech. Rep. LLNL-SM-741439, 2017, (Source code available from \tt
  • [23] S. Habib et al., “HACC: Extreme Scaling and Performance Across Diverse Architectures,” Commun. ACM, vol. 60, no. 1, pp. 97–104, Dec. 2016.
  • [24] J. R. Tramm et al., “XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis,” in PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future, Kyoto, 2014.
  • [25] Y. GUO et al., “Basic Features of the Fluid Dynamics Simulation Software “FrontFlow/Blue”,” SEISAN KENKYU, vol. 58, no. 1, pp. 11–15, 2006.
  • [26] K. Ono et al., “FFV-C package.” [Online]. Available:
  • [27] Y. Andoh et al., “MODYLAS: A Highly Parallelized General-Purpose Molecular Dynamics Simulation Program for Large-Scale Systems with Long-Range Forces Calculated by Fast Multipole Method (FMM) and Highly Scalable Fine-Grained New Parallel Processing Algorithms,” Journal of Chemical Theory and Computation, vol. 9, no. 7, pp. 3201–3209, 2013.
  • [28] T. Misawa et al., “mVMC–Open-source software for many-variable variational Monte Carlo method,” Computer Physics Communications, 2018. [Online]. Available:
  • [29] H. Tomita and M. Satoh, “A new dynamical framework of nonhydrostatic global model using the icosahedral grid,” Fluid Dynamics Research, vol. 34, no. 6, pp. 357–400, 2004. [Online]. Available:
  • [30] RIKEN CSRP, “Grand Challenge Application Project for Life Science,” 2013. [Online]. Available:
  • [31] T. Nakajima et al., “NTChem: A High-Performance Software Package for Quantum Molecular Simulation,” International Journal of Quantum Chemistry, vol. 115, no. 5, pp. 349–359, Dec. 2014.
  • [32] T. Boku et al., “Multi-block/multi-core SSOR preconditioner for the QCD quark solver for K computer,” Proceedings, 30th International Symposium on Lattice Field Theory (Lattice 2012): Cairns, Australia, June 24-29, 2012, vol. LATTICE2012, p. 188, 2012.
  • [33] J. Dongarra, “The LINPACK Benchmark: An Explanation,” in Proceedings of the 1st International Conference on Supercomputing.   London, UK, UK: Springer-Verlag, 1988, pp. 456–474. [Online]. Available:
  • [34] J. Dongarra et al., “A new metric for ranking high-performance computing systems,” National Science Review, vol. 3, no. 1, pp. 30–35, 2016.
  • [35] E. Strohmaier et al., “TOP500,” Jun. 2018. [Online]. Available:
  • [36] T. Deakin et al., “GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models,” in High Performance Computing, M. Taufer et al., Eds.   Cham: Springer International Publishing, 2016, pp. 489–507.
  • [37] M. Hashimoto et al., “An Empirical Study of Computation-Intensive Loops for Identifying and Classifying Loop Kernels: Full Research Paper,” in Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ser. ICPE ’17.   New York, NY, USA: ACM, 2017, pp. 361–372. [Online]. Available:
  • [38] O. Aaziz et al., “A Methodology for Characterizing the Correspondence Between Real and Proxy Applications,” in 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, Sep. 2018.
  • [39] T. Willhalm et al., “Intel Performance Counter Monitor - A better way to measure CPU utilization,” Jan. 2017. [Online]. Available:
  • [40] K. Raman, “Calculating “FLOP” using Intel Software Development Emulator (Intel SDE),” Mar. 2015. [Online]. Available:
  • [41] S. Sobhee, “Intel VTune Amplifier Release Notes and New Features,” Sep. 2018. [Online]. Available:
  • [42] J. Treibig et al., “LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments,” in Proceedings of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego, CA, 2010.
  • [43] S. Walker and M. McFadden, “Best Practices for Scalable Power Measurement and Control,” in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2016, pp. 1122–1131.
  • [44] T. Guérout et al., “Energy-aware simulation with DVFS,” Simulation Modelling Practice and Theory, vol. 39, pp. 76–91, 2013.
  • [45] Japan Meteorological Agency (JMA), “JMA begins operation of its 10th-generation supercomputer system,” Jun. 2018. [Online]. Available:
  • [46] S. Saini et al., “Performance Evaluation of an Intel Haswell and Ivy Bridge-Based Supercomputer Using Scientific and Engineering Applications,” in

    2016 IEEE $18^th$ International Conference on High Performance Computing and Communications; IEEE $14^th$ International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

    , 2016, pp. 1196–1203.
  • [47] PRACE, “Unified European Applications Benchmark Suite,” Oct. 2016. [Online]. Available:
  • [48] “Mantevo Suite.” [Online]. Available:
  • [49] NERSC, “Characterization of the DOE Mini-apps.” [Online]. Available:
  • [50] LLNL, “LLNL ASC Proxy Apps.” [Online]. Available:
  • [51] ——, “CORAL Benchmark Codes.” [Online]. Available:
  • [52] SPEC, “SPEC HPG: HPG Benchmark Suites.” [Online]. Available:
  • [53] B. Klenk and H. Fröning, “An Overview of MPI Characteristics of Exascale Proxy Applications,” in High Performance Computing: 32nd International Conference, ISC High Performance 2017, ser. ISC ’17, Frankfurt, Germany, Jun. 2017, pp. 217–236.
  • [54] R. F. Barrett et al., “Assessing the role of mini-applications in predicting key performance characteristics of scientific and engineering applications,” Journal of Parallel and Distributed Computing, vol. 75, pp. 107–122, 2015.
  • [55] K. Asifuzzaman et al., “Report on the HPC application bottlenecks,” ExaNoDe, Tech Report ExaNoDe Deliverable D2.5, 2017. [Online]. Available:
  • [56] T. Koskela et al., “A Novel Multi-level Integrated Roofline Model Approach for Performance Characterization,” in High Performance Computing: 33nd International Conference, ISC High Performance 2018, ser. ISC ’18, Frankfurt, Germany, Jun. 2018, pp. 226–245.
  • [57] M. Culpo, “Current Bottlenecks in the Scalability of OpenFOAM on Massively Parallel Clusters,” PRACE, Tech Report, Aug. 2012. [Online]. Available:
  • [58] J. R. Tramm and A. R. Siegel, “Memory Bottlenecks and Memory Contention in Multi-Core Monte Carlo Transport Codes,” Annals of Nuclear Energy, vol. 82, pp. 195–202, 2015.
  • [59] K. Kumahata et al., “Kernel Performance Improvement for the FEM-based Fluid Analysis Code on the K Computer,” Procedia Computer Science, vol. 18, pp. 2496–2499, 2013.