I Introduction
It is becoming increasingly clear that the road forward in High-Performance Computing (HPC) is one full of obstacles. With the ending of Dennard’s scaling [1] and the ending of Moore’s law [2], there is today an ever-increasing need to oversee how we allocate the silicon to various functional units in modern many-core processors. Amongst those decisions is how we distributed the hardware support for various levels of compute-precision.
Historically, most of the compute silicon has been allocated to double-precision (64-bit) compute. Nowadays – in processors such as the forthcoming AA64FX [3] and Nvidia Volta [4] – the trend, mostly driven by market/AI demands, is to replace some of the double-precision units with lower-precision units. Lower-precision units occupy less area (up to 3x going from double- to single-precision Fused-Multiply-Accumulate [5]), leading to more on-chip resources (more instruction-level parallelism), potentially lowered energy consumption, and a definitive decrease in external memory bandwidth pressure (i.e., more values per unit bandwidth). The gains – up to four times over their DP variants with little loss in accuracy [6] – are attractive and clear, but what is the impact on performance (if any) on existing HPC applications? What performance impact can HPC users expect when migrating their code to future processors with a different distribution in floating-point precision support? Finally, how can we empirically quantify this impact on performance using existing processors in an apples-to-apples comparison on real-life use cases without relying on tedious, slow, and potentially inaccurate simulators?
The Intel Xeon Phi was supposed to be the high-end for many-core processor technology for nearly a decade (Knights Ferry was announced in 2010), and has changed drastically since its first released. The latest (and also last) two revisions – the Knights Landing and Knights Mill – are of particular importance since they arguable reflect two different ways of thinking. Knights Landing has relatively large support for double-precision (64-bit) computations, and follows a more traditional school of thought. The Knights Mill follows a different direction, which is the replacement of double-precision compute units with lower-precision (single-precision, half-precision, and integer) compute capabilities.
In the present paper, we quantify and analyze the performance and compute bottlenecks of Intel’s Knights Landing [7] and Mill architectures [8] – two processors with identical micro-architecture where the main difference is in the relative allocation of double-precision units. We stress both processors with numerous realistic benchmarks from both the Exascale Computing Project (ECP) proxy applications [9] and RIKEN-CCS Fiber Miniapp Suite [10] – benchmarks used in HPC system acquisition. Through an extensive (and robust) performance measurement process (which we also open-source), we empirically show the architecture’s relative weaknesses. In short, the contributions of the present paper are:
-
An empirical performance evaluation of the Knights Landing and Mill family of processors – both proxies for previous and future architectural trends – with respect to benchmarks derived from realistic HPC workloads,
-
An in-depth analysis of results, including identification of bottlenecks for the different application/architecture combinations, and
-
An open-source compilation of our evaluation methodology, including our collected raw data.
Ii Architectures, Environment, and Applications
Our research objective is to evaluate the impact of migrating from an architecture with (relatively) high amount of double-precision compute to an architecture with less. By high amount of double-precision compute we mean architectures whose Floating-Point Unit (FPU) has most of its silicon dedicated to 64-bit IEEE-754 floating-point operations, and by less double-precision compute we mean architectures that replace those same double-precision FPUs with lower – potentially hybrid – precision units.
To understand and explore the intersection of architectures with high-amount of double-precision and those with hybrid-precision, there is a need to find a processor whose architecture is unchanged with the sole exception of its floating-point unit to silicon distribution. Only one modern processor family allows for such an apples-to-apples comparison: the Xeon Phi family of processors.
Ii-a Hardware & Software Environment
Feature | KNL | KNM | Broadwell-EP |
---|---|---|---|
CPU Model | 7210F | 7295 | 2x E5-2650v4 |
#{Cores} (HT) | 64 (4x) | 72 (4x) | 24 (2x) |
Base Frequency | 1.3 GHz | 1.5 GHz | 2.2 GHz |
Max Turbo Freq. | 1.4 GHz | 1.6 GHz | 2.9 GHz |
CPU Mode | Quadrant | Quadrant | N/A |
TDP | 230 W | 320 W | 210 W |
DRAM Size | 96 GiB | 96 GiB | 256 GiB |
Triad BW | 71 GB/s | 88 GB/s | 122 GB/s |
MCDRAM Size | 16 GiB | 16 GiB | N/A |
Triad BW | 439 GB/s | 430 GB/s | N/A |
MCDRAM Mode | Cache | Cache | N/A |
LLC Size | 32 MiB | 36 MiB | 60 MiB |
Inst. Set Extension | AVX-512 | AVX-512 | AVX2 |
FP32 Peak Perf. | 5,324 Gflop/s | 13,824 Gflop/s | 1,382 Gflop/s |
FP64 Peak Perf. | 2,662 Gflop/s | 1,728 Gflop/s | 691 Gflop/s |
Intel’s Knights Landing (KNL) and Knights Mill (KNM) are the latest incarnations of a long line of architectures in the Intel’s accelerator family. Both processor consist of a large number of processors cores (64 and 72, respectively), interconnected in a mesh interconnection (prior to KNL: ring interconnection). Each core has a private L1 cache and a slice of the distributed L2 cache. Caches are kept coherent through the directory-based MESIF protocol. Both processors come with two types of external memory: MCDRAM (or, Hybrid Memory Cube) and Double-Data Rate-synchronous (DDR4) memory. Unique to the Xeon Phi processors is that the MCDRAM memory can be configured to one of three modes of operation: it is either (1) directly addressable in the global memory address space (memory-mapped), called flat mode, or it (2) acts as last-level cache before the DDR, called cache mode. Finally, the third mode (hybrid mode [11]) is a combination of the properties from the first two modes.
There are several policies governing where data is homed. A common high-performance configuration [12], which is also the one we used in our study, is the quadrant mode. Quadrant mode means that the physical cores are divided into four logical parts, where each logical part is assigned two memory controllers; each logical group is treated as a unique Non-Uniform Memory-Access (NUMA) node, allowing the operating system to perform data-locality optimizations. Table I
surveys and contrast the processors against each other, where the main differences are highlighted. The main architectural difference – which is also the difference and its impact we seek to empirically quantify – is the Floating-Point Unit (FPU). In KNL, this unit features two 512-bit wide vector units (AVX), together capable of executing 32 double-precision or 64 single-precision operations per cycle, totaling 2.6 Tflop/s of double- and 5.3 Tflop/s of single-precision performance, respectively, across all 64 processing cores. In KNM, however, the FPU is redesigned to replace one 512-bit vector unit with two Virtual Neural Network Instruction (VNNI) units. Those units, although specializing in hybrid-precision FMA, can execute single-precision vector instructions, but have no support for double-precision compute. Thus, in total, the KNM can execute up to 1.7 Tflop/s of double-precision or 13.8 Tflop/s of single-precision computations. In summary, the KNM has 2.59x more single-precision compute, while the KNL have 1.54x more double-precision compute.
While both the KNL and KNM are functionally and architectural similar, there are some note-worth differences. First, the operating frequency of the processors vary: the KNL operates at a frequency of 1.3 GHz (and up to 1.5 GHz in Turbo mode), while KNM operates at 1.5 GHz (1.6 GHz turbo). Hence, KNM executes 15% more cycles per second over KNM. Furthermore, although the cores of KNM and KNL are similar (except the FPU), the number of cores are different: KNL has 64 cores while KNM has 72 cores. Both processors are manufactured in 14 nm technology. Finally, the amount of on-chip last-level cache between the two processors is different, where KNM has a 4 MiB advantage over KNL.
Additionally, for verification reasons, we include a modern dual-socket Xeon-based compute node in our evaluation. Despite being vastly different from the Xeon Phi systems, our Xeon Broadwell-EP (BDW) general-purpose processor is used to cross-check metrics, such as: execution time and performance (Xeon Phi should perform better), frequency-scaling experiments (BDW has more frequency domains), and performance counters (BDW exposes more performance counters). Aside from those differences mentioned above (and highlighted in Table I), the setup between the Xeon Phi nodes (and BDW node) is identical, including the same operating system, software stack, and solid state disk.
For the operating system (OS) and software environment, we use equivalent setups across our three compute nodes. The OS is a fresh installation of CentOS 7 (minimal) with Linux kernel version 3.10.0-862, which has the latest versions of the Meltdown and Spectre patches enabled. During our experiments, we limit potential OS noise by disabling all remote storage (Network File System in our case) and allowing only a single user on the system. Most of our applications are compiled with Intel’s Parallel Studio XE (version 2018; update 3) compilers, and we install the latest versions of Intel TensorFlow and Intel MKL-DNN for the deep learning proxy application, since our assumption is that Intel’s software stack allows for the highest utilization of their hardware. Exceptions to this compiler selection are listed in the subsequent Section
II. We use Intel MPI from the Parallel Studio XE suite to execute our measurements.Ii-B Benchmark Applications
Over the years, the HPC community developed many benchmarks representing real workloads or for testing the capabilities of a system – primarily for comparisons across architectures but also for system procurement purposes. The so-called Exascale Computing Project (ECP) proxy applications [9] and RIKEN AICS’ Fiber Miniapp Suite [10], which we will focus on for this study, are just two examples representing modern HPC workloads. Those benchmarks are designed to evaluate single-node and small-scale test installations, and hence are adequate for our study.
Ii-B1 The ECP Proxy-Apps
The ECP suite (release v1.0) consists of 12 proxy applications primarily written in C (5x), FORTRAN (3x), C++ (3x), and Python (1x), listed hereafter.
Algebraic multi-grid (AMG)
solver of the hypre library is a parallel solver for unstructured grids [13] arising from fluid dynamics problems. We choose problem 1 for our tests, which applies a 27-point stencil on a 3-D linear system.
Candle (Cndl)
is a deep learning benchmark suite to tackle various problems in cancer research [14]. We select benchmark 1 of pilot 1 (P1B1
), which builds an autoencoder from a sample of gene expression data to improve the prediction of drug responses.
Co-designed Molecular Dynamics (CoMD)
serves as the reference implementation for ExMatEx [15] to facilitate co-design for (and evaluation of) classical molecular dynamics algorithms. We are using the included strong-scaling example to calculate the inter-atomic potential for 256,000 atoms.
LAGrangian High-Order Solver – Laghos (LAGO)
computes compressible gas dynamics though an unstructured high-order finite element method [16]. The input for our study is the simulation of a 2-dimensional Sedov blast wave with default settings as documented for the Laghos proxy-app.
MACSio (MxIO)
is a synthetic Multi-purpose, Application-Centric, Scalable I/O proxy designed to closely mimic realistic I/O workloads of HPC applications [17]. Our input causes MACSio to write a total of 433.8 MB to disk.
MiniAMR (MAMR)
is a adaptive mesh refinement proxy application of the Mantevo project [18] which applies a stencil computation on a 3-dimensional space, in our case a sphere moving diagonally through a cubic medium.
MiniFE (MiFE)
is a reference implementation of an implicit finite elements solver [18] for scientific methods resulting in unstructured 3-dimensional grids. For our study, we use 128128128 input dimensions for the grid.
MiniTri (MTri)
Nekbone (NekB)
is a proxy for the Nek5000 application [21], and uses the conjugate gradient method for solving the standard Poisson equation for computational fluid dynamics problems. We enabled the multi-grid preconditioner, and for strong-scaling, see Section III-B, we fixed the elements per process and polynomial order to one number, respectively.
SW4lite (SW4L)
is a proxy for the computational kernels used in the seismic modelling software, called SW4 [22], and we use the pointsource example, which calculates the wave propagation emitted from a single point in a half-space.
Swfft (Fft)
represents the compute kernel of the HACC cosmology application [23]
for N-body simulations. The 3-D fast Fourier transformation of SWFFT emulates one performance-critical part of HACC’s Poisson solver. In our tests, we perform 32 repetitions on a 128
128128 grid.XSBench (XSBn)
is the proxy for a Monte Carlo calculations used by a neutron particle transport simulator for a Hoogenboom-Martin nuclear reactor [24]. We simulate a large reactor model represented by a unionized grid with cross-section lookups per particle.
Ii-B2 RIKEN Mini-Apps
In comparison to the modernized ECP proxy-apps, RIKEN’s eight mini-apps are written in FORTRAN (4x), C (2x), and a mix of FORTRAN/C/C++ (2x).
FrontFlow/blue (FFB)
uses the finite element method to solve the incompressible Navier-Stokes equation for thermo-fluid analysis [25]. We simulate the 3-D cavity flow in a rectangular space discretized into 505050 cubes.
Frontflow/violet Cartesian (FFVC)
falls into the same problem class as FFB, however the difference is that FFVC uses the finite volume method (FVM) [26]. Here, we calculate the 3-D cavity flow in a 144144144 cuboid.
Modylas (Mdyl)
makes use of the fast multipole method for long-range force evaluations in molecular dynamics simulations [27]. Our input is the wat222 example which distributes 156,240 atoms over a 161616 cell domain.
many-variable Variational Monte Carlo (mVMC) method
implemented by this mini-app is used to simulate quantum lattice models for studying the physics of condensed matter [28]. We use mVMC’s included strong-scaling test, but downsize it (1/3 lattice dimensions and 1/4 of samples).
Nonhydrostatic ICosahedral Atmospheric Model (NICM)
is a proxy of NICAM, which computes mesoscale convective cloud systems based on FVM for icosahedral grids [29]. We run Jablonowski’s baroclinic wave test (gl05rl00z40pe10), but reduce the simulated days from 11 to 1.
Next-Gen Sequencing Analyzer (NGSA)
is a mini-app of a genome analyzer and a set of alignment tools designed to facilitate cancer research by detecting genetic mutations in human DNA [30]. For our experiments, we rely on pre-generated pseudo-genome data (ngsa-dummy).
NTChem (NTCh)
implements a computational kernel of the NTChem software framework for quantum chemistry calculations of molecular electronic structures, i.e., the solver for the second-order Møller-Plesset perturbation theory [31]. We select the H2O test case for our study.
Quantum ChromoDynamics (QCD)
mini-app solves the lattice QCD problem in a 4-D lattice (3-D plus time), represented by a sparse coefficient matrix, to investigate the interaction between quarks [32]. We evaluate QCD with the Class 2 input for a lattice discretization.
ECP | Scientific/Engineering Domain | Compute Pattern | Language |
AMG | Physics and Bioscience | Stencil | C |
CANDLE | Bioscience | Dense matrix | Python |
CoMD | Material Science/Engineering | N-body | C |
Laghos | Physics | Irregular | C++ |
miniAMR | Geoscience/Earthscience | Stencil | C |
miniFE | Physics | Irregular | C++ |
miniTRI | Math/Computer Science | Irregular | C++ |
Nekbone | Math/Computer Science | Sparse matrix | Fortan |
SW4lite | Geoscience/Earthscience | Stencil | C |
SWFFT | Physics | FFT | C/Fortran |
XSBench | Physics | Irregular | C |
RIKEN | Scientific/Engineering Domain | Compute Pattern | Language |
FFB | Engineering (Mechanics, CFD) | Stencil | Fortran |
FFVC | Engineering (Mechanics, CFD) | Stencil | C++/Fortran |
mVMC | Physics | Dense matrix | C |
NICAM | Geoscience/Earthscience | Stencil | Fortran |
NGSA | Bioscience | Irregular | C |
MODYLAS | Physics and Chemistry | N-body | Fortran |
NTChem | Chemistry | Dense matrix | Fortran |
QCD | Lattice QCD | Stencil | Fortran/C |
Ii-B3 Reference Benchmarks
In addition to those 20 applications, we use the compute intensive HPL [33] benchmark, and HPCG [34] and stream (both memory intensive) to evaluate the baseline of the investigated architectures.
High Performance Linpack (HPL)
is solving a dense system of linear equations to demonstrate the double-precision compute capabilities of a (HPC) system [35]. Our problem size is 64,512. For both HPL and HPCG (see below), we employ highly tuned versions shipped with Intel’s Parallel Studio XE suite with appropriate parameters for our systems.
High Performance Conjugate Gradients (HPCG)
is applying a conjugate gradient solver to a system of linear equation (sparse matrix ), with the intent to demonstrate the system’s memory subsystem and network limits. We choose 360360360 as global problem dimensions for HPCG.
BabelStream (BABL)
is one of many available “stream” benchmarks supporting evaluations of the memory subsystem for CPUs and accelerators [36]. We will use 2 GiB and 14 GiB input vectors, see Section IV-C for details.
We provide a compressed overview of the ECP and RIKEN’s proxy applications in Table II
. In this table, each application is categorized by its scientific domain, as well as the primary workload/kernel classification, for which we use the classifiers employed by Hashimoto et al.
[37]. Both, the scientific domain as well as the kernel classification will be important for our subsequent analysis in Sections IV and V.Iii Methodology
In this section, we present our rigor benchmarking approach into investigating the characteristics of each architecture, and extracting the necessary information for our study.
Iii-a Benchmark Setup and Configuration Selection
Due to the fact that the benchmarks, listed in Section II-B, are firstly realistic proxies of the original applications [38] and secondly are used in the procurement process, we can confidently assume that these benchmarks are well tuned and come with appropriate compiler options for a variety of compilers. Hence, we refrain from both manual code optimization and alterations of the compiler options. The only modifications we perform are:
-
Enabling interprocedural optimization (-ipo) and compilation for the highest instruction set available (-xHost)111 Exceptions: (a) AMG compiled with -xCORE-AVX2 to avoid arithmetic
errors; (b) NGSA’s BWA tool compiled with GNU gcc to avoid segfaults., -
Patching a segmentation fault in MACSio222 After our reporting, the developers patched the upstream version., and
-
Injecting our measurement source code, see Section III-B.
With respect to the measurement runs, we follow this five step approach for each benchmark:
-
Install, patch, and compile the benchmark, see above,
-
Select appropriate inputs/parameters/seeds for execution,
-
Determine “best” parallelism: #processes and #threads,
-
Execute a performance, a profiling, and a frequency run,
-
Analyze the results (go to 0. if anomalies are detected).
and we will further elaborate on those steps hereafter.
For the input selection we have to balance between multiple constraints and choose based on: Which recommended inputs are listed by the benchmark developers?, How long does the benchmark run?333 Our aim is 1 sec–10 min due to the large sample size we have to cover. Does it occupy a realistic amount of main memory (e.g., avoid cache-only executions)? Are the results repeatable (randomness/seeds)? We optimize for the metrics reported by the benchmark (e.g., select the input with the highest Gflop/s rate). Furthermore, one of the most important considerations while selecting the right inputs is strong-scaling. We require strong-scaling properties of the benchmark for two reasons: the results collected in Step (2) need to be comparable, and even more importantly, the results of Step (3) must be comparable between different architectures, since we may have to use different numbers of MPI processes for KNL and KNL (and our BDW reference architecture) due to their difference in core counts. The only exception is MiniAMR for which we are unable to find a strong-scaling input configuration and instead optimized for the reported Gflop/s of the benchmark. Accordingly, we then choose the same amount of MPI processes on our KNL and KNM compute nodes for MiniAMR.
In Step (2), we evaluate numerous combinations of MPI processes and OpenMP threads for each benchmark, including combinations which over-/undersubscribe the CPU cores, and test each combination with three runs to minimize the potential for outliers due to system noise. For all subsequent measurements, we select the number of processes and threads based on the “best” (w.r.t time-to-solution of the solver) combination among these tested versions, see Table
IV for details. We are not applying specific tuning options to the Intel MPI library, except for using Intel’s recommended settings for HPCG with respect to thread affinity and MPI_allreduce implementation. The reason is that our pretests (with a subset of the benchmarks) with non-default parameters for Intel MPI consistently resulted in longer time-to-solution.For Step (3), we run each benchmark ten times to identify the fastest time-to-solution for the (compute) kernel of the benchmark. Additionally, for the profiling runs, we execute the benchmark once for each of the profiling tools and/or metrics (in case the tool is used for multiple metrics), see Section III-B for details. Finally, we perform frequency scaling experiments for each benchmark, where we throttle the CPU frequency to all the available lower CPU states below the maximum CPU frequency we use for the performance runs, and record the lowest kernel time-to-solution among ten trials per frequency. The reason and results of the frequency scaling test will be further explained in Section IV-D. One may argue for more than ten runs per benchmark to find the optimal time-to-solution, however, given the prediction interval theory and our deterministic benchmarks executed on a single node, it is unlikely to obtain a much faster run and we confirmed that the fastest 50% of executions per benchmark only vary by 3.9% on average. The collected metrics, see the following section, will be analyzed in Section IV in detail.
Iii-B Metrics and Measurement Tools
To study and analyze the floating point requirements by applications, it is not only important to evaluate an established metric (floating point operations per second), but also other metrics, such as memory throughput, cache utilization, or speedup with increased CPU frequency. The detailed list of metrics (and derived metrics) and the methodology and tools we use to collect these metrics will be explained hereafter.
One observation is that the amount of time spent on initializing and post processing within each proxy application can be relatively high (e.g., HPCG spends only 11% and 30% of its time in the solver part on BDW and Phi, respectively) and is usually not consistent with the real workloads, e.g., one can reduce the epochs for performance evaluation purposes in CANDLE but not the input data pre-processing to execute those epochs. These mismatches in kernel-to-[pre
post]processing ratio requires us to extract all metrics only for the (computational) kernel of the benchmark. Hence, we identify and inject profiling instructions around the kernels to start or pause the collection of raw metric data by the analysis tools. This code injection is exemplified in PseudoCode 1. Therefore, unless otherwise stated in this Section or subsequent sections, all presented data will be based exclusively on the kernel portion of each benchmark.For tool stability reason, attention to detail/accuracy, and overlap with our needs, we settle
on the use of the MPI API for runtime measurements, alongside with Intel’s Processor Counter
Monitor (PCM) [39], Intel’s Software Development Emulator (SDE) [40], and Intel’s VTune
Amplifier [41]444 To avoid persistent compute node crashes, we had to use disable VTune’s
build-in sampling driver and instead rely on Linux’ perf tool..
Furthermore, as auxiliary tools we rely on RRZE’s Likwid [42] for frequency
scaling555 Our Linux kernel version required us to disable
the default Intel P-State
driver to have full access to the fine-grained frequency scaling. and
LLNL’s msr-safe [43] for allowing access to CPU model-specific registers.
An overview of (raw) metrics which we extract with these tools for the benchmarks, listed in
Section II-B, is shown in Table III. Furthermore, derived metrics, such
as Gflop/s, will be explained on-demand in Section IV.
Raw Metric | Method/Tools |
---|---|
Runtime [s] | MPI_Wtime() |
#{FP / integer operations} | SDE |
#{Branches operations} | SDE |
Memory throughput [B/s] | PCM (pcm-memory.x) |
#{L2/LLC cache hits/misses} | PCM (pcm.x) |
Consumed Power [Watt] | PCM (pcm-power.x) |
SIMD instructions per cycle | perf + VTune (‘hpc-performance’) |
Memory/Back-end boundedness | perf + VTune (‘memory-access’) |
Iv Evaluation
The following subsections will primarily focus on visualizing and analyzing the key metrics we collect for each proxy- and mini-app, such as Gflop/s. The significance of our findings with respect to future software, CPU, and HPC system design will then be discussed in the next Section V. Analyzing the instruction mix, flop/s, or memory throughput, see Section IV-A, IV-B, and IV-C, in a isolated fashion is not a good indication about the system’s bottlenecks, and hence, especially when reasoning about FPU requirements, we also have to understand the applications’ compute-boundedness, which we evaluate in Section IV-D. Table III summarizes the primary metrics and method/tool to collect these metrics. Table IV includes additional metrics.
Iv-a Integer vs. Single-Precision FP vs. Double-Precision FP
[width=]alu-fpu-ops
The breakdown of total number of integer and single/double-precision floating point (FP) operations, as depicted in Figure 1
, shows two rather unexpected trends. First, the number of proxy-apps relying on 32-bit FP instructions is four out of 22, which is surprisingly low, and furthermore, only one of them utilizes both 32-bit and 64-bit FP instructions. Minor variances in integer to FP ratio between the architectures can likely be explained by the difference in AVX vector length, quality of compiler optimization for each CPU, and execution/parallelization approach. The second unexpected trend is the imbalance of integer to FP operations, i.e., 16 of 22 applications issue at least 50% integer operations. However, one has to keep in mind that Intel SDE output includes AVX vector instructions for integers, where the granularity can be as low as 1-bit per operand (cf. 4 or 8 byte per FP operand). Hence, the total integer operations count might be slightly inflated. Lastly, the results for HPCG show a big discrepancy between BDW and KNL/KNM. While the total FP operations count is similar, the binary for KNL/KNM issues far more integer operations, see Table
IV for details, and we are unaware of the reason.Iv-B Floating-Point Operation/s and Time-to-Solution
[width=]flops-rel
Figure 2
shows the relative performance improvement of KNL/KNM over the dual-socket BDW node and the absolute achieved Gflop/s on each processor. It is important to note that all proxy-/mini-apps, with the exception of HPL, have less than 21.5% (BDW), 10.5% (KNL), and 15.1% (KNM) FP efficiency. Given that these applications are presumably optimized, and still achieve this low FP efficiency, implies a limited relevance of FP unit’s availability. The figure shows that the majority of codes have comparable performance on KNM versus KNL. Notable mentions are: a) CANDLE which benefits from VNNI units in mixed precision, b) MiFE, NekB, and XSBn which improve probably due to increased core count and KNM’s higher CPU frequency, and c) some memory-bound applications (i.e., AMG, HPCG, and MTri) which get slower supposedly due to the difference in peak throughput demonstrated in Figure
3 in addition to the increased core count causing higher competition for bandwidth.Iv-C Memory Throughput of (MC-)DRAM
[width=]mem-throughput
For the memory throughput measurements, shown in Figure 3, we use Intel’s PCM tool to analyze DRAM and MCDRAM throughput. Our measurements with BabelStream are included as well to demonstrate the maximum achievable bandwidth, see horizontal lines for MCDRAM (in flat mode), which is lower when the MCDRAM is used in cache mode. We still achieve 86% on KNL and 75% on KNM when the vectors fit into MCDRAM, but drop to slightly higher than DRAM throughput (due to minor prefetching benefits) when the vectors do not fit (see BABL14 for 14 GiB vectors). This throughput advantage of the MCDRAM translates into a performance boost for six proxy-apps (AMG, MAMR, MiFE, NekB, XSBn, and QCD) which heavily utilize the available bandwidth, see Figure 3, and which are memory-bound on our reference system. This can easily be verified when comparing the time-to-solution for the kernels listed in Table IV. Only HPCG cannot benefit from the higher bandwidth and, despite showing 2x throughput, the runtime drops by more than 10%, indicating a memory-latency issue of HPCG on KNL/KNM, which is one of the design goals for the benchmark [34].
Iv-D Frequency Scaling to Identify Compute-Boundedness
[width=]knl-freq [width=]knm-freq [width=]bdw-freq
[width=]sys-util
For this test, we disable turbo boost and throttle core frequency, but keep uncore at maximum frequency which would otherwise negatively affect the memory subsystem, to identify each application’s dependency on ALU/FPU performance. The shown speedup (w.r.t time-to-solution) in Figure 4 of each proxy-app is relative to the lowest CPU frequency on each architecture, and we include our performance results (cf. Section III-A) with max. frequency plus enabled turbo boost.
While a benefit from enabled turbo boost on BDW is near invisible (except for MTri), the proxy-apps clearly reduce their time-to-solution on KNL and KNM when the CPUs are allowed to turbo. Overall, the benchmarks seem to be less memory-bound and more compute-bound, especially salient for AMG and MiniFE, when moving to Xeon Phi, indicating a clear benefit from the much bigger/faster MCDRAM used as last-level cache and indicating a more balanced (w.r.t bandwidth to flop/s ratio) architecture. However, the limited speedup for HPL on KNL clearly shows the CPU’s abundance of FP64 units. Here, the successor, Knights Mill, shows a better balance. Another interesting observation is the inverse behavior of AMG and HPCG on our tested architecture. Both benchmarks are supposed to be memory-bound, but the absence of signs of any scalability with frequency on Xeon Phi strengthens our hypothesis from Section IV-C that HPCG is primarily memory-latency bound.
For I/O portions of an application, the Figure 4 reveals another observation, i.e., MACSio’s write speed scales with increased frequency. Since, MACSio performs only single figure GIop/s and negligible flop/s, increasing the CPU’s compute capabilities cannot explain the shown speedup. Hence, our theory is: MACSio (and I/O in general) is bound by the Linux kernel, whose performance depends on CPU frequency. Guérout et al. report similar findings [44], and we see equivalent behavior with a micro-benchmark (with Unix’s dd command).
Iv-E Remaining Metrics
To disseminate the remaining results from our experiments, we attached Table IV to this paper, which can be utilized for further analysis, and which contains some interesting data points. For example, the power measurements for CANDLE, which is just slightly higher than when running MACSio, indicate that Intel’s MKL-DNN (used underneath to compute on the FP16 VNNI units) does not fully utilize the CPU’s potential. Furthermore, the L2 hit rate on both Xeon Phi is considerably higher than on our reference hardware, indicating improvements in the hardware prefetcher and are presumably a direct effect of the high-bandwidth MCDRAM in cache mode.
V Discussion and Implications
While the previous section focuses on the collected data and comparisons between the three architectures, this section summarizes the relevant points to consider from our study, which should be taken into account when moving forward.
V-a Performance Metrics
The de facto performance metric reported in HPC is flop/s. Reporting flop/s is not limited to applications that are compute-bound. Benchmarks that are designed to resemble realistic workloads, e.g., the memory-bound HPCG benchmark, typically report performance in flop/s. The proxy-/mini-apps in this study as well typically report flop/s despite only six out of 20 proxy-/mini-apps we analyze in this study appearing to be compute-bound (including NGSA that is bound by ALUs, not FPUs). We argue that convening on reporting relevant metrics would shift the focus of the community to be less flop/s-centered.
V-B Considerations for HPC Utilization by Scientific Domain
This paper highlights the diminishing relevance of flop/s when considering the actual requirements of representative proxy-apps. The relevance of flop/s on a given supercomputer can be further diminished when considering the analysis of node-hours spent yearly on different scientific domains at supercomputing facilities. Figure 5 summarizes the breakdown of node-hours by scientific domain for different supercomputing facilities (based on yearly reports of mentioned facilities). For instance, by simply mapping the scientific domains in Figure 5 to representative proxies, ANL’s ALCF and R-CCS’s K-computer would be achieving 14% and 11%, respectively, of the peak flop/s when projecting for the annual node-hours. It is worth mentioning that the relevance of flop/s is even more of an issue for supercomputers to dedicated to specific workloads: the relevance of flop/s can vary widely. For instance, a supercomputer dedicated mainly to weather forecasting, e.g., the 18 Pflop/s system recently installed at Japan’s Meteorological Agency [45], should give minimal relevance to flop/s since the proxy representing this workload on that supercomputer achieves 6% of the peak flop/s, since those workloads are typically memory-bound. On the other hand, a supercomputer dedicated to AI/ML such as ABCI, the world 5th fastest supercomputer as of June 2018, would put high emphasize on flop/s since deep learning workloads rely heavily on dense matrix multiplication operations.
V-C Memory-bound Applications
As demonstrated in Figure 2, the performance of memory-bound applications is mostly not affected by the peak flop/s available. Accordingly, investment in data-centric architectures and programming models should take priority over paying premium for flop/s-centric systems. In one motivating instance, during the investigation that NASA Ames Research Center conducted to identify planned upgrade of the Pleiades supercomputer in 2016 [46], the study concluded that the performance gain from upgrading to Intel Haswell processors was insignificant in comparison to using the older Ivy Bridge-based processors (the newer processor offered double the peak flop/s at almost the same memory bandwidth). And hence the choice was only do a partial upgrade to Haswell processors.
V-D Compute-bound Applications
Investing more in data-centric architectures to accommodate memory-bound applications can have a negative impact on the remaining minority of applications: compute-bound applications. Considering the market trends that are already pushing away from dedicating the majority of chip area to FP64 units, it is likely that libraries with compute-bound code (e.g., BLAS) would support mixed precision or emulation by lower precision FPUs. The remaining applications that do not relay on external libraries might suffer a performance hit.
Vi Related Work
Apart from RIKEN’s mini-apps and the ECP proxy-apps, which we use for our study, there are numerous benchmark suites based on proxy applications from other HPC centers and institutes available [47, 48, 49, 50, 51, 52]. Overall those lists show a partial overlap, either directly (i.e., same benchmark) or indirectly (same scientific domain), between all these suites, which, for example, were used to analyze message passing characteristic [53] or to assess how predictable full application performance is based on proxy-app measurements [54]. Hence, our systematic approach and published framework https://gitlab.com/domke/PAstudy can be transferred to these alternative benchmarks for complementary studies, and our included raw data can be investigated further w.r.t metrics which were outside the scope of our study.
Furthermore, the HPC community has already started to analyze relevant workloads with respect to arithmetic intensity or memory and other potential bottlenecks for some proxy-apps [38, 55, 56] and individual applications [57, 58, 59], revealing similar results to ours that most realistic HPC codes are not compute-bound and achieve very low computational efficiency, which in demonstrated cases affected procurement decisions [46]. However, to the best of our knowledge, we are the first to present a broad study across a wide spectrum of HPC workloads which aims at characterizing bottlenecks and aims specifically at identifying floating-point unit/precision requirements for modern architectures.
Vii Conclusion
We compared two architectural similar processors that have different double-precision silicon budget. By studying a large number of HPC proxy application, we found no significant performance difference between these two processors, despite one having more double-precision compute than the other. Our study points toward a growing need to re-iterate and re-think architecture design decisions in high-performance computing, especially with respect to precision. Do we really need the amount of double-precision compute that modern processors offer? Our results on the Intel Xeon Phi twins points towards a ’No’, and we hope that this work inspires other researchers to also challenge the floating-point to silicon distribution for the available and future general-purpose processors, graphical processors, or accelerators in HPC systems.
References
- [1] R. H. Dennard et al., “Design of ion-implanted MOSFET’s with very small physical dimensions,” IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256–268, 1974.
- [2] G. E. Moore, “Lithography and the Future of Moore’s Law,” in Integrated Circuit Metrology, Inspection, and Process Control IX, vol. 2439. International Society for Optics and Photonics, 1995, pp. 2–18.
- [3] T. Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” 2018. [Online]. Available: http://www.fujitsu.com/jp/Images/20180821hotchips30.pdf
- [4] J. Choquette et al., “Volta: Performance and Programmability,” IEEE Micro, vol. 38, no. 2, pp. 42–52, Mar. 2018.
- [5] J. Pu et al., “FPMax: a 106gflops/W at 217gflops/mm2 Single-Precision FPU, and a 43.7 GFLOPS/W at 74.6 GFLOPS/mm2 Double-Precision FPU, in 28nm UTBB FDSOI,” 2016. [Online]. Available: http://arxiv.org/abs/1606.07852
-
[6]
A. Haidar et al.
, “Harnessing GPU’s Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers,” in
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’18, Dallas, Texas, Nov. 2018, accepted at SC ’18. - [7] A. Sodani et al., “Knights Landing: Second-Generation Intel Xeon Phi Product,” IEEE Micro, vol. 36, no. 2, pp. 34–46, Mar. 2016.
-
[8]
D. Bradford et al.
, “KNIGHTS MILL: New Intel Processor for Machine Learning,” 2017. [Online]. Available:
https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.21-Monday-Pub/HC29.21.40-Processors-Pub/HC29.21.421-Knights-Mill-Bradford-Intel-APPROVED.pdf - [9] “ECP Proxy Apps Suite,” 2018. [Online]. Available: https://proxyapps.exascaleproject.org/ecp-proxy-apps-suite/
- [10] RIKEN AICS, “Fiber Miniapp Suite,” 2015. [Online]. Available: https://fiber-miniapp.github.io/
- [11] A. Heinecke et al., “High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing),” in International Conference on High Performance Computing, ser. ISC ’16. Springer, 2016, pp. 343–362.
- [12] N. A. Gawande et al., “Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2017, pp. 399–408.
- [13] J. Park et al., “High-performance Algebraic Multigrid Solver Optimized for Multi-core Based Distributed Parallel Systems,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’15. Austin, TX, USA: ACM, 2015, pp. 54:1–54:12.
- [14] J. Wozniak et al., “CANDLE/Supervisor: A Workflow Framework for Machine Learning Applied to Cancer Research,” BMC Bioinformatics, 2018.
- [15] J. Mohd-Yusof et al., “Co-design for molecular dynamics: An exascale proxy application,” Los Alamos National Laboratory, Tech. Rep. LA-UR 13-20839, 2013. [Online]. Available: http://www.lanl.gov/orgs/adtsc/publications/science_highlights_2013/docs/Pg88_89.pdf
- [16] V. Dobrev et al., “High-Order Curvilinear Finite Element Methods for Lagrangian Hydrodynamics,” SIAM Journal on Scientific Computing, vol. 34, no. 5, pp. B606–B641, 2012.
- [17] J. Dickson et al., “Replicating HPC I/O Workloads with Proxy Applications,” in Proceedings of the 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, ser. PDSW-DISCS ’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 13–18.
- [18] M. A. Heroux et al., “Improving Performance via Mini-applications,” Sandia National Laboratories, Tech. Rep. SAND2009-5574, 2009.
- [19] M. M. Wolf et al., “A task-based linear algebra Building Blocks approach for scalable graph analytics,” in 2015 IEEE High Performance Extreme Computing Conference (HPEC), Sep. 2015, pp. 1–6.
- [20] R. F. Boisvert et al., “Matrix Market: A Web Resource for Test Matrix Collections,” in Proceedings of the IFIP TC2/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement. London, UK, UK: Chapman & Hall, Ltd., 1997, pp. 125–137. [Online]. Available: http://dl.acm.org/citation.cfm?id=265834.265854
- [21] Argonne National Laboratory, “NEK5000.” [Online]. Available: http://nek5000.mcs.anl.gov
- [22] N. A. Petersson and B. Sjögreen, “User’s guide to SW4, version 2.0,” Lawrence Livermore National Laboratory, Tech. Rep. LLNL-SM-741439, 2017, (Source code available from \tt geodynamics.org/cig).
- [23] S. Habib et al., “HACC: Extreme Scaling and Performance Across Diverse Architectures,” Commun. ACM, vol. 60, no. 1, pp. 97–104, Dec. 2016.
- [24] J. R. Tramm et al., “XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis,” in PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future, Kyoto, 2014.
- [25] Y. GUO et al., “Basic Features of the Fluid Dynamics Simulation Software “FrontFlow/Blue”,” SEISAN KENKYU, vol. 58, no. 1, pp. 11–15, 2006.
- [26] K. Ono et al., “FFV-C package.” [Online]. Available: http://avr-aics-riken.github.io/ffvc_package/
- [27] Y. Andoh et al., “MODYLAS: A Highly Parallelized General-Purpose Molecular Dynamics Simulation Program for Large-Scale Systems with Long-Range Forces Calculated by Fast Multipole Method (FMM) and Highly Scalable Fine-Grained New Parallel Processing Algorithms,” Journal of Chemical Theory and Computation, vol. 9, no. 7, pp. 3201–3209, 2013.
- [28] T. Misawa et al., “mVMC–Open-source software for many-variable variational Monte Carlo method,” Computer Physics Communications, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0010465518303102
- [29] H. Tomita and M. Satoh, “A new dynamical framework of nonhydrostatic global model using the icosahedral grid,” Fluid Dynamics Research, vol. 34, no. 6, pp. 357–400, 2004. [Online]. Available: http://stacks.iop.org/1873-7005/34/i=6/a=A03
- [30] RIKEN CSRP, “Grand Challenge Application Project for Life Science,” 2013. [Online]. Available: http://www.csrp.riken.jp/application_d_e.html#D2
- [31] T. Nakajima et al., “NTChem: A High-Performance Software Package for Quantum Molecular Simulation,” International Journal of Quantum Chemistry, vol. 115, no. 5, pp. 349–359, Dec. 2014.
- [32] T. Boku et al., “Multi-block/multi-core SSOR preconditioner for the QCD quark solver for K computer,” Proceedings, 30th International Symposium on Lattice Field Theory (Lattice 2012): Cairns, Australia, June 24-29, 2012, vol. LATTICE2012, p. 188, 2012.
- [33] J. Dongarra, “The LINPACK Benchmark: An Explanation,” in Proceedings of the 1st International Conference on Supercomputing. London, UK, UK: Springer-Verlag, 1988, pp. 456–474. [Online]. Available: http://dl.acm.org/citation.cfm?id=647970.742568
- [34] J. Dongarra et al., “A new metric for ranking high-performance computing systems,” National Science Review, vol. 3, no. 1, pp. 30–35, 2016.
- [35] E. Strohmaier et al., “TOP500,” Jun. 2018. [Online]. Available: http://www.top500.org/
- [36] T. Deakin et al., “GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models,” in High Performance Computing, M. Taufer et al., Eds. Cham: Springer International Publishing, 2016, pp. 489–507.
- [37] M. Hashimoto et al., “An Empirical Study of Computation-Intensive Loops for Identifying and Classifying Loop Kernels: Full Research Paper,” in Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ser. ICPE ’17. New York, NY, USA: ACM, 2017, pp. 361–372. [Online]. Available: http://doi.acm.org/10.1145/3030207.3030217
- [38] O. Aaziz et al., “A Methodology for Characterizing the Correspondence Between Real and Proxy Applications,” in 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, Sep. 2018.
- [39] T. Willhalm et al., “Intel Performance Counter Monitor - A better way to measure CPU utilization,” Jan. 2017. [Online]. Available: https://software.intel.com/en-us/articles/intel-performance-counter-monitor
- [40] K. Raman, “Calculating “FLOP” using Intel Software Development Emulator (Intel SDE),” Mar. 2015. [Online]. Available: https://software.intel.com/en-us/articles/calculating-flop-using-intel-software-development-emulator-intel-sde
- [41] S. Sobhee, “Intel VTune Amplifier Release Notes and New Features,” Sep. 2018. [Online]. Available: https://software.intel.com/en-us/articles/intel-vtune-amplifier-release-notes
- [42] J. Treibig et al., “LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments,” in Proceedings of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego, CA, 2010.
- [43] S. Walker and M. McFadden, “Best Practices for Scalable Power Measurement and Control,” in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2016, pp. 1122–1131.
- [44] T. Guérout et al., “Energy-aware simulation with DVFS,” Simulation Modelling Practice and Theory, vol. 39, pp. 76–91, 2013.
- [45] Japan Meteorological Agency (JMA), “JMA begins operation of its 10th-generation supercomputer system,” Jun. 2018. [Online]. Available: https://www.jma.go.jp/jma/en/News/JMA_Super_Computer_upgrade2018.html
-
[46]
S. Saini et al., “Performance Evaluation of an Intel Haswell and
Ivy Bridge-Based Supercomputer Using Scientific and Engineering
Applications,” in
2016 IEEE $18^th$ International Conference on High Performance Computing and Communications; IEEE $14^th$ International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
, 2016, pp. 1196–1203. - [47] PRACE, “Unified European Applications Benchmark Suite,” Oct. 2016. [Online]. Available: http://www.prace-ri.eu/ueabs/
- [48] “Mantevo Suite.” [Online]. Available: https://mantevo.org/packages/
- [49] NERSC, “Characterization of the DOE Mini-apps.” [Online]. Available: https://portal.nersc.gov/project/CAL/designforward.htm
- [50] LLNL, “LLNL ASC Proxy Apps.” [Online]. Available: https://computation.llnl.gov/projects/co-design/proxy-apps
- [51] ——, “CORAL Benchmark Codes.” [Online]. Available: https://asc.llnl.gov/CORAL-benchmarks/
- [52] SPEC, “SPEC HPG: HPG Benchmark Suites.” [Online]. Available: https://www.spec.org/hpg/
- [53] B. Klenk and H. Fröning, “An Overview of MPI Characteristics of Exascale Proxy Applications,” in High Performance Computing: 32nd International Conference, ISC High Performance 2017, ser. ISC ’17, Frankfurt, Germany, Jun. 2017, pp. 217–236.
- [54] R. F. Barrett et al., “Assessing the role of mini-applications in predicting key performance characteristics of scientific and engineering applications,” Journal of Parallel and Distributed Computing, vol. 75, pp. 107–122, 2015.
- [55] K. Asifuzzaman et al., “Report on the HPC application bottlenecks,” ExaNoDe, Tech Report ExaNoDe Deliverable D2.5, 2017. [Online]. Available: http://exanode.eu/wp-content/uploads/2017/04/D2.5.pdf
- [56] T. Koskela et al., “A Novel Multi-level Integrated Roofline Model Approach for Performance Characterization,” in High Performance Computing: 33nd International Conference, ISC High Performance 2018, ser. ISC ’18, Frankfurt, Germany, Jun. 2018, pp. 226–245.
- [57] M. Culpo, “Current Bottlenecks in the Scalability of OpenFOAM on Massively Parallel Clusters,” PRACE, Tech Report, Aug. 2012. [Online]. Available: https://doi.org/10.5281/zenodo.807482
- [58] J. R. Tramm and A. R. Siegel, “Memory Bottlenecks and Memory Contention in Multi-Core Monte Carlo Transport Codes,” Annals of Nuclear Energy, vol. 82, pp. 195–202, 2015.
- [59] K. Kumahata et al., “Kernel Performance Improvement for the FEM-based Fluid Analysis Code on the K Computer,” Procedia Computer Science, vol. 18, pp. 2496–2499, 2013.
Comments
There are no comments yet.