Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU

In this paper, we analyze the performance and energy consumption of an Arm-based high-performance computing (HPC) system developed within the European project Mont-Blanc 3. This system, called Dibona, has been integrated by ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX2. This CPU is the same one that powers the Astra supercomputer, the first Arm-based supercomputer entering the Top500 in November 2018. We study from micro-benchmarks up to large production codes. We include an interdisciplinary evaluation of three scientific applications (a finite-element fluid dynamics code, a smoothed particle hydrodynamics code, and a lattice Boltzmann code) and the Graph 500 benchmark, focusing on parallel and energy efficiency as well as studying their scalability up to thousands of Armv8 cores. For comparison, we run the same tests on state-of-the-art x86 nodes included in Dibona and the Tier-0 supercomputer MareNostrum4. Our experiments show that the ThunderX2 has a 25 somewhat compensated by its 30 memory. We found that the software ecosystem of the Armv8 architecture is comparable to the one available for Intel. Our results also show that ThunderX2 delivers similar or better energy-to-solution and scalability, proving that Arm-based chips are legitimate contenders in the market of next-generation HPC systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 10

04/05/2018

Energy-efficiency evaluation of Intel KNL for HPC workloads

Energy consumption is increasingly becoming a limiting factor to the des...
10/12/2020

On the performance of a highly-scalable Computational Fluid Dynamics code on AMD, ARM and Intel processors

No area of computing is hungrier for performance than High Performance C...
07/15/2021

A64FX – Your Compiler You Must Decide!

The current number one of the TOP500 list, Supercomputer Fugaku, has dem...
11/20/2019

Characterizing Scalability of Sparse Matrix-Vector Multiplications on Phytium FT-2000+ Many-cores

Understanding the scalability of parallel programs is crucial for softwa...
10/08/2020

Deploying a Task-based Runtime System on Raspberry Pi Clusters

Arm technology is becoming increasingly important in HPC. Recently, Fuga...
10/31/2019

Direct N-body application on low-power and energy-efficient parallel architectures

The aim of this work is to quantitatively evaluate the impact of computa...
01/31/2019

On Energy Efficiency and Performance Evaluation of SBC based Clusters: A Hadoop case study

Energy efficiency in a data center is a challenge and has garnered resea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Arm architecture is gaining significant traction in the race to Exascale. Several international collaborations, including the Japanese Post-K, the European Mont-Blanc, and the UK’s GW4/EPSRC, announced the adoption of Arm technology as a viable option for high-end production high-performance computing (HPC) systems. During November 2018, for the first time, an Arm-based system was ranked in the Top500 list. It was the Astra supercomputer powered by Marvell’s (former Cavium) ThunderX2 processor, integrated by HPE and installed at the Sandia National Laboratories (US). For more than six years, research projects have evaluated Arm-based systems for parallel and scientific computing in collaboration with industry, advocating the higher efficiency of this technology mutated from the mobile and the embedded market.

Computational requirements of scientific and industrial applications are increasing. As a consequence, the HPC market is also growing steadily in double digits, according to one of the latest reports of Hyperion Research111https://www.hpcwire.com/2018/06/27/at-isc18-hyperion-reports. This market growth goes hand in hand with the appearance of a jungle of new technologies and architectures. From an economic standpoint, this will help to diversify the market and level the prices. From a technical point of view, however, it is not always clear what the gains are, in terms of performance and energy consumption, of new technologies entering the data centers. The most prominent example of this phenomenon is the adoption of GP-GPUs in HPC: even if the benefit of using graphical accelerators was shown in the early 2000, it is only now, after more than ten years, that data centers consider GP-GPU a well established HPC technology. A very similar dynamic is happening with the adoption of Arm CPUs.

In this blurry scenario, we performed our study to help HPC application scientists in understanding the real implications of using this new technology. We targeted the latest Arm-powered CPU by Marvell that offers close to state-of-the-art performance. The ThunderX2 processor powers the Dibona cluster, the system we have used for our evaluation. In contrast, we have also considered Intel Skylake CPUs, which are available on some nodes of Dibona, as well as in the state-of-the-art Tier-0 supercomputer MareNostrum4. Our analysis follows a bottom-up approach: we start from the micro-benchmarking of the cluster, moving then to a higher level evaluation using HPC applications, first in a single node, and finally scaling up to a thousand cores. We selected a set of production HPC workloads from those considered in the European project Mont-Blanc 3, which proved to be scalable on different state-of-the-art HPC architectures, and we measured their performance at scale as well as their energy footprint. We also include the Graph 500 benchmark as a representative of an emerging irregular work flow. By doing so, we de facto explore different architectural configurations of CPUs and show that, while the ISA does not seem to make a big difference in the final overall performance, the micro-architectural choices (e.g., the size of the SIMD units or the organization of the memory hierarchy) and the software configurations (e.g., compilers) are vital factors for delivering a powerful and efficient modern HPC system.

The main contributions of this paper are

  1. We provide a comparative analysis of the Arm-based ThunderX2 cluster Dibona and the Intel Skylake based Tier-0 supercomputer MareNostrum4. We show that the software tools for the ThunderX2 are mature and on the same level as those available for Skylake, and the performance of a ThunderX2 node is, on average, 25% less than for a Skylake node.

  2. We analyze the power drain of the Marvell ThunderX2 processor under three HPC production codes and a benchmark, comparing it with that of state-of-the-art HPC technology. We show that the energy-to-solution of the two cluster technology under study are equivalent.

  3. We study the behavior at scale of these codes in a ThunderX2 cluster and x86-based Tier-0 supercomputer MareNostrum4, concluding that the scalability trends of both technologies are comparable.

The rest of the document is structured as follows: In Section 2, we introduce the context of our evaluation, and in Section 3 and 4, we detail the evaluation methodology and the hardware features of the HPC clusters used in the following sections. Section 5 is dedicated to introducing the HPC applications and their characterization when running on Dibona. Section 6 is reserved for energy measurements and comparisons. In Section 7, we report the result of our tests at scale on Dibona and MareNostrum4 Tier-0 HPC supercomputer, and Section 8 includes performance projections based on these results. We conclude with Section 9, where we summarize our evaluation experience.

2 Related Work

Several papers have been published with preliminary analyses of benchmarks and performance projections of the Arm-based system-on-chip (SoC) coming from the mobile and embedded market rajovic2014tibidabo; rajovic2013supercomputing; rajovic2016mont; oyarzun2018efficient; cesini2018infn. More recently, tests on the Arm-based server SoC have also appeared in the literature calore2018advanced; puzovic2016quantifying. The most relevant and recent work focusing on evaluating the ThunderX2 processor is the one of McIntosh-Smith et al. in mcintosh2018performance. They evaluate Isambard, a high-end Tier-2 system developed by Cray in the framework of the GW4 alliance gw4. While they provide an extensive single-node evaluation, we complement their contribution evaluating HPC applications at scale.

For our evaluation, we chose three HPC production codes: Alya vazquez2016alya, a finite elements code handling multi-physics simulations developed at the Barcelona Supercomputing Center, already studied on different architectures in garcia2018computational; garcia2019runtime; a Lattice-Boltzmann code LBC gracia2012lbctasks using the BGK approximation for the collision term bhatnagar1954model, which, even if it is not a full-production code, mimics the typical behavior of Lattice Boltzmann simulations used both for advanced fluid dynamics studies biferale2011second and architectural evaluation mantovani2013exploiting; calore2019optimization; Tangaroa Tangaroa; reinhardt2017asyncSPH, a smoothed particle hydrodynamics code, whose purpose is to simulate fluids in a way suitable for computer animation. We also included in our study the Graph 500 benchmark murphy2010introducing as a representative of an emerging irregular workflow that not only benefits from the pure floating-point performance of a system, but also stresses the memory and the network with irregular access patterns. In addition to the several studies of this benchmark, we recall the one of Checconi et al. in checconi2013massive, where the authors study the Graph 500 benchmark on a successful HPC architecture, the IBM Blue Gene/Q CPU.

The use of Arm-technology in HPC has also been moved by energy-efficient arguments. Like several others, we try to address them comparing the performance and the energy-to-solution of our four test cases on different architectures. Radulovic et al. radulovic2018mainstream provide an extensive study of emerging HPC architectures, including their performance and efficiency. However, they focus mostly on benchmarks and kernels, while we evaluate complex codes used for the production of scientific results (e.g., Alya). Jarus et al. jarus2013performance also study the performance and efficiency of CPUs for HPC by different manufacturers. They focused on HPL, providing for each CPU an extrapolation of the ranking in the Green500. As already mentioned, our contribution focuses less on benchmarks and more on real applications. It is also worth mentioning that the CPU technology that we are evaluating is targeted at the data center market, not a technology borrowed from the embedded world. D’Agostino et al. d2019soc as well as McIntosh-Smith et al. mcintosh2018performance also analyze the cost efficiency of emerging technologies in HPC. We prefer to leave the variable of the price out of our analysis, since it involves a negotiation process beyond our control and, in our opinion, would make the comparison less relevant.

3 Hardware Description

In this section, we present the technical specifications of the systems used for our analysis. These are the new HPC Arm-based cluster, called Dibona, mainly built around the ThunderX2 processor, and for comparison, MareNostrum4, an Intel-based supercomputer, which represents a well-known baseline HPC architecture. Table 1 shows a summary of the hardware configurations of the clusters used as a reference in our study.

Dibona-TX2 Dibona-X86 MareNostrum4
Core architecture Armv8 Intel x86 Intel x86
CPU name ThunderX2 Skylake Platinum Skylake Platinum
Frequency [GHz] 2.0 2.1 2.1
Sockets/node 2 2 2
Core/node 64 56 48
Memory/node [GB] 256 256 96
Memory tech. DDR4-2666 DDR4-3200 DDR4-3200
Memory channels 8 6 6
Num. of nodes 40 3 3456
Interconnection Infiniband EDR Infiniband EDR Intel OmniPath
System integrator ATOS/Bull ATOS/Bull Lenovo
Table 1: Summary of the hardware configuration of the platforms

3.1 The MareNostrum4 Supercomputer

MareNostrum4 is a Tier-0 supercomputer in production at Barcelona Supercomputing Center (BSC) in Barcelona, Spain. It has a total of 3456 compute nodes available, which include two Intel Xeon Platinum 8160 processors, each with 24 Skylake cores clocked at 2.1 GHz and with 6 DDR4-3200 memory channels 222https://www.bsc.es/user-support/mn4.php. Each node is equipped with 96 GB of DDR4-3200. The interconnection network is 100 Gbit/s Intel Omni-Path (OPA). MareNostrum4 runs a Linux 4.4.12 kernel, and it uses SLURM 17.11.7 as a workload manager. In this supercomputer, we perform our scalability study presented in Section 7.

3.2 The Dibona Cluster

The Dibona cluster is the primary outcome of the European project Mont-Blanc 3. It was designed and integrated by ATOS/Bull and evaluated by the Mont-Blanc 3 partners between September 2018 and February 2019 mb3DeliverableAppicationsDibona. Thanks to the ATOS/Bull Sequana HPC infrastructure, it seamlessly integrates 40 Arm-based compute nodes, together with three x86-based compute nodes. Each Arm-based compute node is powered by two Marvell ThunderX2 CN9980 processors, each with 32 Armv8 cores at 2.0 GHz, 32 MB L3 cache, and 8 DDR4-2666 memory channels. The x86 nodes are powered by Intel Xeon Platinum 8176 processor with 28 Skylake cores running at 2.1 GHz with 6 DDR4-2666 memory channels. The total amount of RAM installed is 256 GB per compute node.

Compute nodes are interconnected with a fat-tree network, with a pruning factor of 12 at level 1, implemented with Mellanox IB EDR-100 switches. A separated 1 GbE network is employed for the management of the cluster and a network file system (NFS). Dibona runs the Linux 4.14.0 kernel, and it uses SLURM 17.02.11 patched by ATOS/Bull as a job scheduler.

3.3 Mapping of our study on hardware resources

Having MareNostrum4 deployed at the Barcelona Supercomputing Center allows us to use it as a reference machine to be compared with Dibona for benchmarking (as described in Sections 4 and 5) and as a platform to scale applications at thousands of cores (as presented in Section 7).

Moreover, the fact that Dibona houses Armv8 and x86 compute nodes allowed us to carry out fair energy comparisons in Section 6 with nodes on the same system, differing mainly in their CPU architecture, and offering an identical power monitoring infrastructure.

In Table 2, we map the studies performed in this manuscript onto the hardware platforms available for this evaluation.

Type of study Application Dibona-TX2 Dibona-X86 MareNostrum4
Hardware
characterization
Section 4
STREAM
FPU -kernel
Roofline
OSU
Single-node
performance
Section 5
Alya
LBC
Tangaroa
Graph 500
Energy
measurements
Section 6
Alya
LBC
Tangaroa
Graph 500
Scalability
Section 7
Alya
LBC
Tangaroa
Graph 500
Table 2: Summary of the study performed in this manuscript

In the rest of the document, we refer to the partition of Dibona powered by Arm ThunderX2 CPUs with the name Dibona-TX2 (shortened to DBN-TX2 when required). Artworks related to this systems are displayed in red. The partition of Dibona powered by x86 Skylake CPUs is named Dibona-X86 (shortened to DBN-X86 when required). Artworks related to x86-based CPUs (both Dibona-X86 and MareNostrum4) are displayed in blue.

4 Hardware Characterization

This section is dedicated to the micro-benchmarking Dibona-TX2, the Arm-based platform selected for our study. We use MareNostrum4 as the state-of-the-art system for comparison.

4.1 Dibona-TX2 Memory Subsystem

Here we evaluate the memory bandwidth using STREAM mccalpin1995memory, a simple synthetic benchmark to measure sustainable memory bandwidth. Our study analyses a ThunderX2 node from the Dibona-TX2 cluster, compared side by side with a Skylake node from MareNostrum4. STREAM kernels iterate through data arrays of double-precision floating-point elements (8 bytes) with a size fixed at compile time. The size of each array must be greater than the maximum between ten million elements and four times the size of the sum of all the last-level caches.

where is the number of elements of each array, and is the size of the last level cache in bytes. Table 3 shows a brief overview of the memory subsystem of each socket, including the minimum value of and the compiler flavor and version used. The reader should remember that each node has two sockets.

Dibona-TX2 MareNostrum4
Architecture Arm ThunderX2 x86 Skylake
L1 cache size 32 kB 64 kB
L2 cache size 256 kB 256 kB
L3 cache size 32 MB 33 MB
Main mem. tech. DDR4-2666 DDR4-3200
# of channels 8 6
Peak bandwidth 170.64 GB/s 153.60 GB/s
16777216 17301504
Compiler Arm 18.3 Intel 17.0.4
Table 3: Memory subsystem overview

We run the benchmark by fixing the problem size to the minimum valid value of for each platform and increasing the number of OpenMP threads. We report the results of the Triad function as a representative kernel since the rest of them have a similar behavior. Threads are pinned to cores by using OMP_PROC_BIND=true, distributing the threads evenly across both sockets and minimizing the number of threads accessing the same L2 cache. Figure 1 shows the achieved bandwidth. The represents the number of OpenMP threads, growing up to the number of cores in each node, and the indicates the maximum bandwidth achieved throughout 200 executions of the kernel. The figure also includes two horizontal lines representing the theoretical peak bandwidth of each processor. Please note that the DDR technology is different: the ThunderX2 CPU housed in the Dibona-TX2 cluster uses DDR4-2666, with a theoretical peak of 21.33 GB/s per channel, and the Skylake CPU of MareNostrum4 uses DDR4-3200, with a theoretical peak of 25.60 GB/s per channel.

Figure 1: STREAM Triad Best bandwidth achieved over number of OpenMP threads in Dibona and MareNostrum4 nodes, thread binding: Interleaved

The figure shows that ThunderX2 reaches 228.62 GB/s (67% of the peak) in the Triad kernel when running with 64 OpenMP threads (i.e., one full node), while MareNostrum4 obtains 171.89 GB/s (56% of the peak) with 48 OpenMP threads. This fact shows that, although the ThunderX2 has slower memory, it outperforms the Skylake thanks to the extra memory channels.

4.2 Floating-Point Throughput

We designed a micro-kernel to measure the peak floating-point throughput of a given CPU. We call this code FPU_kernel. It contains exclusively fused-multiply-accumulate assembly instructions with no data dependencies between them. The kernel has four versions distinguishing between i) scalar and vector instructions and ii) single and double precision. The Armv8 ISA has floating-point instructions that accept single- and double-precision registers as operands. In this case, the kernel uses the instruction FMADD. In Dibona-TX2, the Armv8 cores of the ThunderX2 integrate the NEON vector extension. With 128 bit registers, it can fit either two double-precision or four single-precision data elements per register. The kernel uses the NEON vector instruction FMLA. Although the x86 ISA has floating-point instructions that run on the FPU, the more recent SIMD instructions of the AVX512 vector extension are recommended. Thus, the compiler automatically translates a * b + c to VFMADD132SS or VFMADD132SD for single- and double-precision instructions, respectively. We implemented the SIMD version of the kernel using the AVX512 instruction VFMADD132PS for single precision and the VFMADD132PD for double precision. This means that the scalar version of the code in the x86 architecture will use vector instructions with the same behavior as scalar floating-point instructions. The theoretical peak of the vector unit can be computed as the product of i) the vector size in elements (e.g., four single-precision elements in NEON); ii) the number of instructions issued per cycle; iii) the frequency of the processor; iv) the number of floating-point operations made by the instruction (e.g., fused-multiply-accumulate does two floating-point operations).

Dibona-TX2 MareNostrum4
Architecture Arm ThunderX2 x86 Skylake
Vector extension NEON AVX512
Instruction FMLA VFMADD
Precision Single Double Single Double
Vec. Length 4 2 16 8
Issue/cycle 2 2 2 2
Freq. [GHz] 2.00 2.00 2.10 2.10
Flop/Inst 2 2 2 2
Peak [GFlop/s] 32.00 16.00 134.40 67.20
Table 4: Theoretical peak performance of one NEON and one AVX512 vector units in Dibona-TX2 and MareNostrum4 nodes

Table 4 lists these parameters and the theoretical peak for the vector extensions available in Dibona-TX2 and MareNostrum4 nodes, both with single- and double-precision vector operations.

Figure 2: Sustained performance in one core of the four versions of the FPU_Kernel in Dibona-TX2 and MareNostrum4 nodes (see theoretical peak performance in Table 4)

Figure 2 shows the results measured on both machines. It can be seen that the floating-point units of both architectures have similar performance, and that the slightly better performance of MareNostrum4 is solely due to its 5% higher CPU frequency. In contrast, the AVX512 vector unit of Skylake housed in MareNostrum4 outperforms the NEON unit of the ThunderX2 processor in Dibona-TX2 by a factor of in single precision and in double precision. The fact that the vector registers of the Skylake are four times larger than those of the ThunderX2, together with the increased frequency, account for this performance difference.

4.3 Roofline Model

The roofline model allows us to visualize the hardware limitations of different computational systems and characterize the workload of computational kernels ofenbeck2014applying. The model plots the sustained performance , measured in (Giga) floating-point operations per second, as a function of the arithmetic intensity . The arithmetic intensity is defined as , where is the number of floating-point operations executed by the application (i.e., computational workload), while is the number of bytes that the application exchanges with the main memory (i.e., the application dataset).

(1)

Equation 1 shows the theoretical formulation of the roofline model. and are, respectively, the peak floating-point performance (expressed in Flop/s) and the peak memory bandwidth to/from memory (expressed in Byte/s), while . Using as measured with the FP-kernel in Section 4.2 and as benchmarked in Section 4.1, we can, therefore, compute and plot it in Figure 3, where we present the roofline model of a Dibona-TX2 node and a MareNostrum4 node.

An engineer knowing the arithmetic intensity and the computational performance of its application or kernel can, therefore, identify the space available for its optimization towards the maximum performance delivered by the system.

Figure 3: Roofline of Dibona-TX2 (red) and MareNostrum4 (blue)

Figure 3 quickly reveals the different architectural configurations of the two types of node and allows the exploration of a slightly different configuration space. The red area is indeed a configuration space enabled by the 8-channel memory sub-system of the ThunderX2 in Dibona-TX2, not reachable with current Skylake architectures that are only offering six memory channels. In contrast, the area highlighted in blue is the space where highly vectorizable codes can take advantage of the wider SIMD units on the Skylake (AVX512) compared to the narrower Armv8 NEON extension.

4.4 Interconnection Network

To this point, we have characterized different architectural aspects of the ThunderX2 and Skylake processors, but multi-node scaling experiments will be affected by the interconnection network. Therefore, this section evaluates the network performance of the Dibona-TX2 cluster, which uses Mellanox Infiniband EDR (IB) interconnect. Naturally, we compare it with that of the MareNostrum4 supercomputer, which uses an Intel Omni-Path (OPA) interconnect. Figure 4 shows the achieved throughput, as reported by the OSU benchmarks liu2003performance as a function of the message size of the communication. All points represent the average value of repetitions of the communication.

Figure 4: Bandwidth between two processes in different nodes

Both networks approach the theoretical peak as the message size increases (95%). It seems that OPA is consistently achieving a better bandwidth than IB with message sizes over 256 KiB. The difference in bandwidth is also very noticeable at message sizes around 4 KiB and 8 KiB, where OPA almost doubles IB. It seems that the measured bandwidth of OSU in Dibona-TX2 stalls around 8 and 16 KiB but then shoots up to 10 GB/s for larger message sizes. This behavior is consistent throughout multiple pairs of nodes and between executions. We verified that this behavior disappears if we measure the bandwidth with the ib_read_bw tool by Mellanox. As this tool exchanges data using the raw network protocol, we can only conclude that the “valley” appearing in Figure 4

is caused by the OpenMPI configuration deployed by ATOS/Bull on the Dibona cluster at the moment of the tests.

Figure 5: Weak links in the Dibona-TX2 network, message size: 4 KiB, axis correspond to nodes, the bandwidth [MB/s] in color code

We repeated the tests for multiple pairs of nodes to determine if there were systematic weak links on Dibona-TX2. Figure 5 shows a map where the axes represent the pair of nodes, and each cell is color-coded to indicate the bandwidth. We present the measurements for message sizes of 4 KiB. There is a recurring pattern along the diagonal where pairs of nodes have higher bandwidth. This fact is due to the network topology, as these pairs of nodes are connected to the same level-1 switch. Nodes that are topologically farther apart reach 10 % less bandwidth than neighboring nodes.

5 Application Characterization

In this section, we describe the three scientific applications and the benchmark we used to evaluate Dibona-TX2. It provides a short description and a computational characterization of the codes, together with a performance evaluation at a small scale (one or two compute nodes). The applications were selected from the set of production HPC workloads that were analyzed in the European project Mont-Blanc 3. They proved to be scalable on different state-of-the-art HPC architectures and represent a range of real scientific HPC applications. The lattice Boltzmann code (LBC) operates on a stencil-like access pattern, Tangaroa is a particle tracking code with a behavior similar to sparse-matrix applications, and Alya is a complex finite-element code that comprises several solvers with different characteristics. All of them are parallelized leveraging MPI using two-sided primitives.

We also evaluate the Graph 500 benchmark with the goal of including an evaluation of an irregular code representative of emerging HPC workloads that are becoming increasingly important for diverse fields, such as social networks, biology, intelligence, and e-commerce. We invite the reader interested in other benchmark evaluations to complement our work with e.g., that of McIntosh-Smith et al. mcintosh2018performance and D. Ruiz et al. HPCS19_hpcg. We also highlight the work of G. Ramirez-Gargallo et al. HPML19_tensorflow

for the reader interested in the performance of Dibona-TX2 under artificial intelligence workloads.

For each code, we present a brief description of its purpose, internal structure, and the chosen dataset. We include a small scale evaluation of the applications and benchmark on Dibona-TX2 and MareNostrum4 nodes. It also details the performance of different compiler solutions available in both clusters: the generic GNU suite and the vendor-specific alternatives (LLVM-based Arm HPC Compiler and the Intel suite). Although, in several cases, the complexity of the codes does not allow a fine-grained benchmarking, we try to provide quantitative observations of computational features to help understanding the behavior of the applications at scale.

Unless otherwise noted, we use two metrics to evaluate the scalability, the elapsed time, which gives the idea of the fastest option, and the efficiency, which helps to understand how well the code scale does on a given system. In this section, the efficiency has been computed as follows: , where is the execution time when running with one core, and is the execution time when running with cores.

5.1 Alya

Alya vazquez2016alya is a high-performance computational mechanics code developed at the Barcelona Supercomputing Center. Alya can solve different physics, including incompressible/compressible turbulent flows, solid mechanics, chemistry, particle transport, heat transfer, and electrical propagation. It is part of the Unified European Applications Benchmark Suite (UEABS) of PRACE, a set of twelve relevant codes together with their data sets, which can realistically be run on large systems. Thus Alya complies with the highest standards in HPC.

Figure 6: Alya – Trace with highlight of the main computational phases

In this work, we simulate an incompressible turbulent flow and the transport of Lagrangian particles with Alya. In particular, we simulate the air through the human respiratory system and the transport of the particles during rapid inhalation calmet2016large.

In Figure 6, we can see a trace of a one-time step of the respiratory simulation with Alya. The time is represented on the -axis, and on the -axis are the different MPI processes. The color indicates the phase that is being executed, and white corresponds to MPI communication. The matrix assembly, algebraic solver, and subgrid scale phases correspond to the computation of the fluid (the velocity of the air), and the particles phase corresponds to the calculation of the transport of particles.

In the trace, we highlight the most time-consuming phases. In Table 5, we quantify the percentage of the total time spent in each one.

Phase % Time I
Matrix assembly 40.84% 0.09
Solver1 16.13% 0.03
Solver2 4.20% 0.12
Subgrid scale 21.43% 0.07
Particles 3.37% 0.05
Table 5: Alya – Percentage of the total execution time and arithmetic intensity for different phases of the respiratory simulation executed with 96 MPI processes

The rightmost column of Table 5 shows the arithmetic intensity for each phase (see Section 4.3). We measured the computational load as the number of double-precision operations reported by the CPU counters. We also determined the data set , where and are the number of load and store instructions accessing double-precision data values (8 Bytes) measured by the hardware counters of the CPU. We show in Table 5 the values of , arithmetic intensities, of the different phases of Alya, ranging from one floating-point operation per 8 bytes transferred in the first type of solver to one floating-point operation per 32 bytes transferred in the second solver of Alya.

5.1.1 Experimental setup

We performed a small scale comparison of Alya executing on one and two nodes of Dibona-TX2 and MareNostrum4. The idea is to provide a small-scale evaluation and compare at the same time hardware platforms and compiler solutions, as summarized in Table 6.

Platform Compiler MPI version
Dibona-TX2 GNU 7.2.1 OpenMPI 2.0.2.11
Dibona-TX2 Arm 18.4.2 OpenMPI 2.0.2.14
MareNostrum4 GNU 7.2.1 OpenMPI 2.0.2.10
MareNostrum4 Intel 2018.1 OpenMPI 2.0.2.10
Table 6: Alya – Software environment used on different clusters

Production simulations can run for up to time steps. The results presented in this evaluation have been obtained by averaging 10 time steps, like the one shown in Figure 6. Statistical variability of the measurements is below 1%, so we chose not to pollute plots with error bars.

The mesh used in our experiments is a subject specific geometry including from the face to the seventh branch generation of the bronchopulmonary tree. The mesh is hybrid with 17.7 million elements, including prisms, tetrahedra, and pyramids. We partition the mesh with METIS parMETIS to distribute the elements as homogeneously as possible across different MPI processes. We inject 400,000 particles during the first time step of the simulation through the nasal orifice.

5.1.2 Node-to-node comparison

Figure 7: Alya – Performance running on one and two nodes of Dibona-TX2 and MareNostrum4

In Figure 7, we report the average elapsed time of a time step when running with different compilers on both node types of the Dibona-TX2 and MareNostrum4. Comparing both platforms with each compiler family, it can be seen that MareNostrum4 shows a 30% improvement over Dibona-TX2 regardless of the compiler. Furthermore, for each platform, vendor-specific compilers deliver better performance: a 34% average improvement on Armv8 ThunderX2, and 37% on x86 Skylake. In all cases, the efficiency when going from one to two nodes is between 90% and 96%.

5.2 Lbc

LBC is a Lattice Boltzmann code written in Fortran used for advanced fluid dynamics studies, using the BGK approximation for the collision term. The version used in this work is pure-MPI with two-sided communication only. We have benchmarked the code on a single node of Dibona-TX2 and MareNostrum4 Skylake. Multi-node scalability will be presented in a later section.

5.2.1 Experimental setup

The following details regarding the experimental setup are common to LBC experiments and performance data reported below. On both architectures, we used the GNU suite to compile the code and OpenMPI for communication across nodes. Besides, we used compilers by the processor vendor, i.e., Arm and Intel, respectively, to evaluate the impact of the compiler on the performance of the code. Details of the software stack can be found in Table 7.

Platform Compiler MPI version
Dibona-TX2 GNU 7.2.1 OpenMPI 2.0.2.14
Dibona-TX2 Arm 19.1 OpenMPI 2.0.2.14
MareNostrum4 GNU 7.2.0 OpenMPI 3.1.1
MareNostrum4 Intel 2018.1 OpenMPI 3.1.1
Table 7: LBC – Software environment used on clusters

Each data point in this section’s figures corresponds to the average of over at least 30 time measurements. To get a representative result, at most 10 runs were done within the same job, and jobs were distributed over two or more days. We also calculated the standard deviation of the sample as error estimates. However, we do not plot error bars, as they are usually small and would only crowd the figures. In cases were errors are substantial, we mention this fact in the text.

In the Lattice Boltzmann community, the underlying grid cells are often referred to as lattice elements. The usual metric for performance is the number of lattice updates per time interval measured in units of MLUP/s (mega lattice updates per second, lattice updates per second). This metric is reported by the application at the end of the run. Note that LBC disregards the initialization phase and other overhead when reporting performance.

For comparison purposes, the problem size of all experiments is chosen such that the number of lattice elements per core present on the node is . The problem sizes assigned to each node is , where is the number of cores per node. We call the 3-dimensional domain decomposition of that represents the MPI ranks assigned to each node. The memory requirement is roughly proportional to the total number of lattice elements (including one ghost cell at each boundary); each lattice elements stores 41 double-precision float numbers. Table 8 shows the domain configurations for each of the cluster.

Dibona-TX2 MareNostrum4
Cores per node, 64 48
Domain size per node,
Domain decomposition,
Memory requirements
Compiler GNU Arm GNU Intel
Performance
Table 8: Node-to-node comparison for LBC on Dibona-TX2 and MareNostrum4 nodes

The code was run for 10 time steps without any intermediate output. This proved sufficiently large for accurate time measurements.

5.2.2 Node-to-node comparison

To compare the computing capabilities of the ThunderX2 and Skylake processors, we executed the pure-MPI version of LBC on single nodes of Dibona-TX2 and MareNostrum4.

Figure 8: LBC – Strong scaling on Dibona-TX2 and MareNostrum4, with execution time as a function of number of cores
Figure 9: LBC – Strong scaling on Dibona-TX2 and MareNostrum4, with parallel efficiency as a function of number of cores

We assume that the MPI communication within a single node will have little impact on the performance of the code. Single-core runs were disregarded, as one core is not sufficient to saturate the available memory bandwidth on either system, and such performance measurements would overestimate and obfuscate the real application performance in production runs.

The performance of single-node runs is reported in Table 8 in terms of the application-specific metric. In addition, we performed a strong scaling experiment on a single node of Dibona-TX2 and MareNostrum4. MPI ranks have been placed to spread evenly across NUMA domains to best utilize the available memory bandwidth. The execution time is illustrated in Figure 8, while the efficiency is shown in Figure 9.

With the GNU compiler, the performance on the Dibona-TX2 is 4% lower than on the MareNostrum4 node, even though the number of cores per compute node is roughly higher. This indicates that, proportionally, Skylake cores outperform ThunderX2 cores.

We also note that the performance of the code heavily depends on the maturity of the compiler technology used. On Dibona-TX2, the performance with the Arm compiler is roughly lower than the performance obtained with the GNU compiler (see Table 8). We can only explain this performance degradation with the maturity of the Arm compiler technology since we have seen that the vendor-specific compiler can improve the performance of other applications (see e.g., Alya in Section 5.1.2). On MareNostrum4, the performance with the GNU compiler and with the Intel compiler is practically identical.

All versions of the code show an initial super-scalar behavior up to roughly 8 cores. This is because the memory size of the working set per core decreases with an increasing core count so cache reuse increases. However, this effect is countered, as the effective memory bandwidth per core saturates at larger core counts.

While both kinds of nodes have a similar core count, the ThunderX2 architecture on Dibona-TX2 retains higher strong scaling efficiency. Again, we see slight differences amongst compilers, in particular, the Intel compiler produces code that scales better at larger core counts of MareNostrum4 nodes than the GNU compiler. For the Arm compiler, we observe high efficiencies even up to the full core count. This, however, is an artifact of the low performance of this particular version: due to the low compute speed, there is never sufficient pressure on the memory systems to feel the limited effective bandwidth.

Overall, however, our results suggest that the ThunderX2 memory system of the Dibona-TX2 cluster is relatively better suited for memory bound workloads such as LBC.

5.3 Tangaroa

Tangaroa is a C++ application that simulates fluid dynamics in a way suitable for computer animation Tangaroa; reinhardt2017asyncSPH

. Its results are not meant to be physically correct but only visually plausible. The simulation progresses by analyzing the behavior of a large amount of particles in discrete time steps. The 3D space that contains the particles is partitioned to allow parallel execution of the simulation. Particle position, velocity, and other properties are calculated with single-precision floating-point arithmetic (4 Bytes).

5.3.1 Experimental setup

The data extracted from the Tangaroa executions considers the actual simulation of the particles as the region of interest; job setup and data allocation times are not taken into account. Tangaroa tries to hide communication as much as possible; therefore, computation is very intense in this region. Single-node experiments were made on dedicated nodes, meaning that intra-node MPI communication was not perturbed by traffic from other applications. Moreover, since execution times range from tens of minutes to hours, the perturbation from the OS is considered negligible. Each data point shown is the average of five independent executions. In all experiments, the I/O operations were disabled to avoid interference from different network file system technologies and thus improve the accuracy of the measurements.

The dataset represents a fairly even distribution of around 12 million particles in a box-shaped region. The internal representation of these particles is at least 1.5 GB; there are other data structures whose size is not directly proportional to the number of particles. The size of the dataset is enough to justify using several hundred processes, although a larger one would allow better scalability.

Platform Compiler MPI version
Dibona-TX2 GNU 7.2.1 OpenMPI 3.1.2
Dibona-TX2 Arm 19.1 OpenMPI 3.1.2
MareNostrum4 GNU 7.2.0 OpenMPI 3.1.1
MareNostrum4 Intel 2017.4 IntelMPI 2017.3
Table 9: Software environment used on clusters for Tangaroa

Table 9 reports the software configuration used for the evaluation of Tangaroa.

5.3.2 Node-to-node comparison

The first experiment compares the performance of full nodes, meaning that the same problem is divided into all the cores in each node. The domain containing the particles must be adequately divided between the cores. To accommodate the number of cores in one node of MareNostrum4, we had to choose an irregular domain decomposition, which nevertheless led to a reasonably even distribution of particles per core. Experiments with regular domain decompositions did not improve the load balance.

Dibona-TX2 MareNostrum4
Cores/node 64 48
Domain decomposition
Particles/core
Compiler GNU Arm GNU Intel
Simulation time [s] 101.16 93.77 55.84 61.20
IPC 1.64 1.66 2.62 2.47
Table 10: Node-to-node comparison for Tangaroa on Dibona-TX2 and MareNostrum4 nodes

Table 10 summarizes the results of this experiment. The first two columns of the table show that with the Arm compiler (Clang), the application is 7% faster than with GCC. Since we observed that the number of SIMD instructions executed by the Clang version is higher than in the GCC one, and that the Instructions per Clock-cycle (IPC) is practically the same, we can conclude that the higher performance of the Arm compiler is due to a better use of the vector units. With the Skylake node of MareNostrum4, however, it is GCC that gives the best performance; through an improvement of the IPC and a higher number of SIMD instructions, it manages an 8% reduction of the execution time when compared to ICC. Comparing both platforms, it can be seen that, although Skylake has roughly 12% fewer cores than ThunderX2, it manages to give a 31% better performance. This indicates that the computing power of each core in the ThunderX2 node is substantially less, and that their increased count is not enough to overcome this limitation.

Figure 10: Tangaroa – Strong scaling on Dibona-TX2 and MareNostrum4, with execution time as a function of number of cores

The second experiment is a strong scaling test contained within a single node. Process pinning was set to interleaved to maximize memory bandwidth utilization. The results are shown in Figures 10 and 11, where it can be seen that both machines diverge from the ideal scaling. This is because each process of Tangaroa must communicate the particles close to the subdomain border to the neighboring processes. Thus, if the number of processes increases, so does the amount of data that must be transmitted. This has a more noticeable effect on the scalability in MareNostrum4 node since it has two memory channels less than the ThunderX2 node.

Figure 11: Tangaroa – Parallel efficiency on Dibona-TX2 and MareNostrum4: the baseline for the efficiency is the time of a single core execution

Comparing the parallel efficiency of Tangaroa in Figure 11 and the one of LBC in Figure 9, we can also see that the higher pressure to the main memory played by LBC results into a lower parallel efficiency when using all cores of the compute node.

5.4 Graph 500

Graph 500 is a benchmark used to rank supercomputers when performing a large-scale graph search problem murphy2010introducing. It is written in C, and we evaluated the version parallelized using MPI of the breadth-first search (BFS). Each graph of the vertices is associated with a scale , where . The parameter is the only input parameter that needs to be provided for the benchmark. The Graph 500 benchmark generates an internal representation of a graph with the number of vertices supplied as input. With this graph, it performs 64 breadth-first searches (BFSs) of randomly generated keys. The numbers presented are the average duration of the 64 BFSs performed and their error, as reported by the benchmark. While the rules of the benchmark allow one to optimize this internal representation as well as the BFS implementation, we left them “as is” in the reference version of the Graph 500 benchmark (v3.0.0).

5.4.1 Experimental setup

The Graph 500 benchmark used is optimized to run with a number of MPI processes that is a power of 2. Nevertheless, it offers the possibility of compiling it without this optimization, allowing it to run with a number of processes that is not a power of 2. In our case, the number of cores per node of MareNostrum4 is not a power of 2; for this reason, we decided to evaluate both options.

In all plots of this section, we report the average time spent in performing 64 BFSs, ignoring the setup and the validation phases of the benchmark. We selected as input size , corresponding to a number of vertices , which is the largest scale that fits the memory of a single compute node of both clusters and allows a reasonable duration of the simulation.

As for the rest of the applications, we studied the effects of different compilers. We summarize the software configurations in Table 11.

Platform Compiler MPI version
Dibona-TX2 GNU 8.2.0 OpenMPI 3.1.2
Dibona-TX2 Arm 19.1 OpenMPI 3.1.2
MareNostrum4 GNU 8.1.0 Intel MPI 2018.4
MareNostrum4 Intel 2019.4 Intel MPI 2018.4
Table 11: Graph 500 – Software environment on different clusters

5.4.2 Node-to-node comparison

To compare the computing and communication capabilities of the ThunderX2 and Skylake processors under an irregular workflow, we executed the MPI-only version of the Graph 500 benchmark on a single node of Dibona-TX2 and MareNostrum4.

Figure 12: Graph 500– Strong scaling on Dibona-TX2 and MareNostrum4 with power of 2 ranks, with execution time as a function of number of cores

Figure 12 shows the average time of a BFS algorithm varying the number of MPI processes, when compiling the benchmark optimized to run on a number of MPI processes that is a power of 2. Notice that, in this scenario, we are not able to use all the cores of one node of MareNostrum4. We can see that the performance obtained by the different compilers within the same architecture is very similar; the difference in all the cases is below . Notice also that, in a core-to-core comparison, Dibona-TX2 is between and slower than MareNostrum4 when using the GNU compiler suite and between and slower when comparing vendor-specific compilers. Nevertheless, when using the full node, Dibona-TX2 is faster than MareNostrum4 when using the GNU compiler suite and when using the vendor-specific compilers.

Figure 13: Graph 500– Strong scaling on Dibona-TX2 and MareNostrum4 with non-power of 2 ranks, with execution time as a function of number of cores

In Figure 13, we depict the same metric when the MPI processes are not a power of two. The performance delivered by different compilers in Dibona-TX2 is below 5.5% in all the cases and for MareNostrum4 below 3.5%. When compiling with the GNU compiler suite and in a core-to-core comparison, Dibona-TX2 is between and slower than MareNostrum4, and between and when using vendor-specific compilers. When using the full node, Dibona-TX2 is slower with GNU and with vendor-specific compilers.

As expected, the benefit of using a binary optimized to use a power of two number of MPI processes implies a non-negligible performance gain. We measured a benefit between 25% and 40% on Dibona-TX2 with both compilers and between and on MareNostrum4. If we compare the performance of using a full node of MareNostrum4, running the power of 2 version with 32 cores is faster with GNU and faster with the Intel compiler than running with 48 cores with the non-power of 2 version.

Figure 14: Graph 500– Strong scaling on Dibona-TX2 and MareNostrum4 with power of 2 ranks, with parallel efficiency as a function of number of cores

In Figure 14

, we can see the parallel efficiency obtained in MareNostrum4 and Dibona-TX2 when using different compilers and running with a power of 2 number of ranks. We can see that the parallel efficiency in both architectures have a similar trend, it drops when using 32 cores, probably due to memory bandwidth. This also explains that the drop in MareNostrum4 is higher than in Dibona-TX2 because Dibona-TX2 offers a higher memory bandwidth.

Figure 15: Graph 500– Strong scaling on Dibona-TX2 and MareNostrum4 with non-power of 2 ranks, with parallel efficiency as a function of number of cores

In Figure 15, we can find the same data plotted for a non-power of 2 number of MPI ranks. In this case, we observe that the use of a non-optimized code is affecting Dibona-TX2 more than MareNostrum4. Nevertheless, we observe the same effect of parallel efficiency dropping when using more than 32 cores and the fact that it has a higher impact on the parallel efficiency of MareNostrum4.

6 Energy Considerations

In this section, we report energy measurements intending to compare the energy efficiency of state-of-the-art HPC architectures based on Arm and x86 CPUs. To this aim, we measured the energy consumption of our workloads on Dibona’s ThunderX2 (Dibona-TX2) and Skylake (Dibona-X86) nodes. We do not perform an energy/power comparison between Dibona and MareNostrum4 because the power monitoring infrastructure of the two systems is significantly different, and the measurements would be neither homogeneous nor comparable. In Dibona, the power monitoring infrastructure allows us to homogeneously measure the power consumption of the whole motherboard of both types of compute nodes (Dibona-TX2 and Dibona-X86). However, in MareNostrum4, the power monitoring is based on the on-chip RAPL counters, which do not take into account the static power drain of the boards and the on-board components.

Our method has the following unique strengths: i) we employ complex production codes instead of benchmarks; ii) since the system integrator of the Dibona system (ATOS/Bull) is the same for nodes powered by Arm and x86, these share a common power monitoring infrastructure, and the location of energy consumption sensors within the system is the same so that we can assure fair measurements; iii) as we employ a production system, we can collect energy figures as users, testing the actual accessibility to the energy information, and providing data gathered on a final production system rather than on a specialized test bench.

The system also offered the possibility of measuring the Energy To Solution (E2S) of complete runs, as well as portions of code with no impact on the performance of the application monitored. We consider, therefore, the energy-to-solution as the relevant metric for our study in this section. We also present the Energy Delay Product (EDP) (i.e., the product of the time to perform a run and its energy consumption) as a metric to combine performance and power drain over time. Despite its convenience, EDP, like E2S, does not allow to compare directly different applications since it depends on the execution time. The reader interested in a metric related to the energy efficiency (independent on the execution time) should take a closer look at LBC, which offers the MLUP/s and the MLUP/J metrics defined in Section 5.2. A summary of the energy measurements for all applications is given in Table 12. Note that these energy measurements relate to the whole node, including the CPU, the memory, the network interfaces, the I/O devices, and most of the motherboard. It is, therefore, not possible to make strong statements about the relative energy efficiency of the underlying processor architecture. Unless otherwise noted, we use the same experimental setup, as described in the previous section.

Dibona-TX2 Dibona-X86
Compiler GNU Arm GNU Intel
Alya
Simulation time [s] 347.40 223.81 236.13 151.76
E2S [kJ] 90.17 63.44 101.12 69.24
EDP [kJs] 31325.1 14198.5 23877.5 10507.9
LBC
Simulation time [s] 251.64 333.99 205.87 208.53
Performance 266.7 200.9 285.2 281.6
E2S [kJ] 82.20 107.09 95.61 97.89
Energy eff. [] 0.82 0.63 0.61 0.60
EDP [kJs] 20681.5 35767.4 19683.5 20413.2
Tangaroa
Simulation time [s] 101.16 93.77 55.84 61.20
E2S [kJ] 27.38 25.17 24.78 27.69
EDP [kJs] 2769.76 2360.19 1383.72 1694.62
Graph 500 (pow of 2)
Simulation time [s] 39.84 37.34 53.80 48.53
E2S [kJ] 12.00 11.29 17.69 16.00
EDP [kJs] 477.98 421.37 951.48 776.30
Graph 500 (generic)
Simulation time [s] 54.79 52.61 49.33 46.46
E2S [kJ] 15.66 14.95 21.20 19.95
EDP [kJs] 857.80 786.60 1045.85 926.76
Table 12: Summary of energy measurements in one node

6.1 Alya

For Alya, we report in Table 13 the E2S in of 10 time steps of the respiratory simulation introduced in Section 5.1. Each column indicates the energy consumption when running a binary generated either with the GNU compiler suite or with vendor-specific alternatives, the Arm HPC compiler for Dibona-TX2, and the Intel suite for x86 Dibona-X86. The color code indicates in green the lowest E2S and in red the highest E2S.

Table 13: Alya: Energy to solution (in ) comparison among different compilers/architectures

As highlighted when analyzing the performance in Section 5.1.2, the vendor-specific compilers allow one to solve the same problem with less energy. It is interesting to see that the compute nodes of Dibona-TX2 obtain the lowest E2S, although they are not faster than the Dibona-X86 ones. This observation is valid for both generic and vendor-specific compilers, ThunderX2-based nodes are more efficient than Skylake-based ones when comparing against the same kind of compiler. In Table 14, the reader can compare the elapsed time in  for the same Alya case with the same color code used in Table 13 applied to the execution time (green = faster; red = slower). Analyzing the performance at scale, it is interesting to note that running the same scientific simulation with the Arm HPC Compiler is 10% faster than using GFortran. Still, its overall energy consumption is 30% lower with the vendor-specific compiler than with the GNU compiler suite.

Table 14: Alya: Execution time (in ) comparison among different compilers/architectures

As described in Section 5.1, Alya is a complex scientific code composed of several phases with different computational characteristics (see Table 5 for details). Focusing on one node, Table 12 shows that, even if Skylake processors are faster than ThunderX2, the energy consumption is slightly smaller on the Dibona-TX2 nodes. Because of this, the Dibona-X86 nodes end up being more efficient. It is very likely, however, that from the energy point of view, the simulation we studied can take more advantage of the 30% higher memory bandwidth, offered by the architectural choice implemented by Marvell in the ThunderX2, than the wider SIMD unit provided by the x86 Skylake CPU.

Moreover, it is worth mentioning that this advantage of Skylake is lost if we compare the execution with the GNU compiler on Dibona-X86 nodes with the Dibona-TX2 nodes using the Arm compiler. The time-to-solution is very similar, but the energy consumption and the efficiency are better in the Dibona-TX2 nodes.

6.2 Lbc

LBC has a relatively long initialization phase, which is disregarded in the application’s performance measurement. For simplicity, we chose to read out the energy consumption counters only at the beginning and the end of the application. To decrease the impact of the initialization phase on the energy reading, we increased the number of time steps from 10 to 500 for these measurements. We expected that the initialization phase accounts for less than a few percent of the energy consumption, which has been confirmed by varying the number of time steps. Note that the error reported below is the statistical standard deviation of the measurement sample and does not include this initialization bias.

In Table 12, we have chosen to report energy efficiency in terms of lattice updates (i.e., work done) per consumed energy unit. For the GNU compiler, the energy efficiency is on the Dibona-TX2 node, while for Dibona-X86 it is . The results show that, for this particular code, the energy efficiency (in terms of the domain-specific metric ) of Dibona-TX2 is roughly higher than that of Dibona-X86.

In addition to the GNU compiler, we also used the compilers of the respective processor vendor, i.e., Arm and Intel. On ThunderX2, the Arm compiler produces significantly lower performing and energy-efficient code than the GNU compiler. On Skylake, the Intel compiler, on the other hand, yields practically identical performance and energy efficiency compared with the GNU compiler.

6.3 Tangaroa

As with the execution times, the energy measurements of Tangaroa are constrained to the simulation phase. These are summarized in Table 12, and it can be seen that for each platform, the energy consumption is almost proportional to the execution time. This is a consequence of the effort Tangaroa makes to hide communication delays and achieve a sustained IPC. Therefore, the compilers that give the best time also improve the energy consumption. Between the two platforms, the table shows that, although the simulation time in a single node is 67% longer for Dibona-TX2, the energy consumed in the region of interest is only 1.5% higher. This reflects on the efficiency values, making Dibona-X86 nodes more efficient than Dibona-TX2 ones.

6.4 Graph 500

For the Graph 500 benchmark, we report figures of energy (E2S and EDP) in Table 12 when performing 64 BFS operations with scale 24 for both cases, when we run with power of 2 MPI processes and when we run with a generic number of MPI processes. As highlighted for Alya, also in the Graph 500 benchmark, vendor-specific compilers deliver more performance, wasting less energy for a given problem.

The reader should also note the higher performance and higher energy efficiency of Dibona-TX2 when running with power of 2 MPI processes. The fact that Dibona-TX2 nodes house a number of cores that are a power of 2 allows us to run using all cores of the node, while in Dibona-X86, we run 32 processes on each node, leaving 24 cores idle. Still, even taking this into account, Dibona-TX2 nodes are also 30% more energy-efficient when running a non-power of 2 number of MPI processes.

7 Scalability

The evaluation of the Dibona-TX2 platform has been performed so far on a small scale. However, Arm-based compute nodes are being considered as building blocks of large systems to progress towards Exascale computing. Therefore, we analyze in this section our set of applications in a multi-node context, as we think that a study at the scale of a thousand cores reveals valuable insights for extrapolating performance for larger systems.

In this section, the speedup sample points coincide with full nodes instead of cores. The efficiency is then computed as , where is the execution time when running with one node, and is the execution time when running with nodes.

The reader should note that we perform our scalability tests scaling up the number of compute nodes disregarding the number of cores housed in each node. The reason for that is twofold: on the one hand, we have already presented in Sections 4 and 5 the performance of a single core/node of the Dibona-TX2 cluster; on the other hand, we aim at providing insightful information to domain scientists and HPC facility managers interested in the Dibona-TX2 technology, being well aware that the basic unit when acquiring/deploying a cluster is a single compute node.

Since Dibona-X86 has only three Skylake nodes, we have resorted to MareNostrum4 to execute the applications with many nodes. However, the software stack used in the Skylake experiments is still Intel 2018. Unless otherwise noted, we use the same experimental setup, as described in the previous sections.

7.1 Alya

Figure 16: Alya – Strong scalability of the execution time on Dibona-TX2 and MareNostrum4 using different compilers

Figure 16 shows the execution time per time step in seconds for both node types, also with the generic and vendor software stacks mentioned above. We can observe that, in all the cases, the vendor-specific compiler outperforms the generic one. The solution that delivers the best performance is the Dibona-TX2 cluster with the Arm HPC compiler, closely followed by less than 10%, is the Intel version in MareNostrum4. We can also see that the two versions using the generic GNU compiler perform worse. Up to four nodes, their performance is very similar, but for more than four nodes, the performance of the GNU version in MareNostrum4 drops. We do not have a clear explanation for this, but we suspect that the different cache microarchitecture of the ThunderX2 and the Skylake CPU combined with the workload distribution performed by Alya can produce this kind of behavior using the GNU compiler toolchain at a mid-large number of MPI processes.

Figure 17: Alya – Parallel efficiency of Alya Dibona-TX2 and MareNostrum4 using different compilers

In Figure 17, we show the parallel efficiency obtained by each run. The GNU compiler on MareNostrum4 appears to be the less efficient version of our test cases, reaching an efficiency of 24% when running with 32 nodes of MareNostrum4. The performance drift when passing from four to eight nodes seems to be the root cause of this bad scalability. On Dibona-TX2, the run with the Arm compiler is the one obtaining worse parallel efficiency, although Figure 16 shows that this is also the fastest case. This is a common effect when computing metrics over different references.

Moreover, we can see that the parallel efficiency of the Arm compiler version in Dibona-TX2 is slightly worse than the one obtained by the Intel compiler in MareNostrum4. This can be an effect of the configuration of the software layers leveraging the underlying physical layer of the network. We have seen, in fact, in Section 4.4, that the Dibona-TX2 network still suffers some configuration glitch.

Comparing the runs with GNU compilers in both platforms, we observe again that the Dibona-TX2 cluster obtains a better performance, also resulting in a better parallel efficiency.

7.2 Lbc

The code LBC is intended to be used for large problems, and at the time of data acquisition, only 16 ThunderX2 nodes were regularly available in Dibona-TX2. In order to study scalability across nodes, we have therefore decided to perform a weak scaling experiment. Strong scaling would be expected to become necessary at larger scales only. We compare results obtained on Dibona-TX2 and MareNostrum4.

Figure 18: LBC – Weak scaling on Dibona-TX2 and MareNostrum4

For the weak scaling experiment, we have kept the problem size per node equal to the single-node runs presented above (see Table 8). We performed at least 30 executions of LBC on each machine. To get a representative result, at most 10 runs were done within the same job, and jobs were distributed over two days, in most cases. Again, we use the application-specific metric as a proxy for performance. Unless stated otherwise, the statistical variation of measurements is at most on MareNostrum4. For Dibona-TX2, it is between and .

Figure 19: LBC – Efficiency with respect to the baseline of one node as a function of nodes on Dibona-TX2 and MareNostrum4

The results of a weak scaling experiment on Dibona-TX2 and MareNostrum4 are illustrated in Figure 18, where we plot the weak scaling time, and Figure 19, where we represent the efficiency with respect to a single node.

When increasing the number of nodes from 1 to 16, the scaling efficiencies on Dibona and MareNostrum4 drop steadily. However, the slope is slightly larger for Dibona than for MareNostrum4. At 16 nodes, Dibona ends up at around scaling efficiency, while MareNostrum4 achieves . However, this difference of is only marginally significant when taking into account the measurement error.

The dip at 4 nodes for MareNostrum4 is due to a series of closely spaced runs that show substantially higher execution times. Closer inspection of the logs revealed an increased communication time, presumably due to a particularly high intermittent network load. Disregarding these runs moves the data point right on top of a gradually sloping line.

7.3 Tangaroa

A set of strong-scaling experiments of Tangaroa were made using GCC as the compiler in both platforms, Dibona-TX2 and MareNostrum4. The execution times of these experiments are shown in Figure 20, while the parallel efficiency is depicted in Figure 21. The base times of the efficiency plot for the single node executions appear in Table 10.

Figure 20: Tangaroa – Strong scalability of the execution time on Dibona-TX2 and MareNostrum4
Figure 21: Tangaroa – Parallel efficiency on Dibona-TX2 and MareNostrum4

It is apparent that the scalability is far better in Dibona-TX2. Note that the efficiency of the 16 node execution in MareNostrum4 is close to 50%, despite the nodes and the network being superior to Dibona-TX2. The reason behind this effect is twofold. First, the size of the problem is not big enough for such a large number of processes, leading to computation bursts in the range of 100 ms. Second, since MareNostrum4 is a production machine, network traffic is significantly higher than in Dibona-TX2. As a consequence, the transmission delays are in the same order of magnitude as the computation bursts, and Tangaroa is less capable of hiding them.

7.4 Graph 500

For the scalability study with Graph 500, we use a size , corresponding to a number of vertices . All the numbers reported are the duration of one BFS operation averaging 64 measurements, as reported by the application.

Figure 22: Graph 500– Strong scalability of the execution time on Dibona-TX2 and MareNostrum4, power of 2

In Figure 22, we can see the average elapsed time to perform one BFS when running with a number of MPI ranks that is a power of 2. We can observe that up to 8 nodes Dibona-TX2 is faster than MareNostrum4. In both cases, the vendor-specific compiler is the one that delivers the best performance, but the performance gain with respect to GNU is below 10% in all the cases. Although Dibona outperforms MareNostrum4 in all the cases, the distance between them decreases as the number of nodes increase, being equal for 8 nodes.

Figure 23: Graph 500– Parallel efficiency on Dibona-TX2 and MareNostrum4, power of 2

Figure 23 shows the parallel efficiency when running with a binary optimized to run in a number of MPI ranks power of 2. We can see that all the parallel efficiencies drop from 1 to 2 nodes. This confirms that Graph 500 is network intensive because the parallel efficiency drops when the communications are done outside the node.

Figure 24: Graph 500– Strong scalability of the execution time on Dibona-TX2 and MareNostrum4, not power of 2

In Figure 24, we show the execution time of one BFS when using a binary compiled to run in a non-power of 2 number of MPI processes. In this case, MareNostrum4 outperforms Dibona. This difference in performance can be explained because, as we have seen in the previous sections, the compilers at MareNostrum4 can obtain more performance from the non-optimized version of the code. It is important to notice that in this case, both clusters can use all the cores of each compute node.

Figure 25: Graph 500– Parallel efficiency on Dibona-TX2 and MareNostrum4, not power of 2

Figure 25 depicts the parallel efficiency of the same executions. We can see that the big drop in performance when communicating outside a compute node is still present, meaning that, in this version, the communications are also an important factor.

8 Scalability Projections

8.1 Graph 500 Communication Study

For analyzing further the behavior at scale of the Graph 500 benchmark, we measured , the time spent on the MPI calls, and , the time spent performing the local BFS operations. The goal is to study how evolves, adding more processes, and to extrapolate its impact at scale.

Due to the irregular nature of the benchmark, we refined our model differentiating into , which is the time during which actual communications happen, and , which is the time due to the load balance across different MPI processes. The total execution time can be therefore expressed as

(2)

The additive nature of Equation 2 allow us to study the ratio of each contribution with the total time:

(3)
Figure 26: Graph 500– MPI time on Dibona-TX2 and MareNostrum4, power of 2, vendor compilers

In Figure 26, we represent with points the percentage of time measured on the Dibona-TX2 and MareNostrum4 when running Graph 500 with scale 27 in the configuration with power of 2 MPI processes. Since we are interested in scalability, we studied the percentage of time spent in the communication (corresponding to ) and the percentage of time spent due to load balance (corresponding to ). The solid lines are linear fits of those two contributions to

. We interpolate the data of load balance with a linear function

, where is the number of MPI processes, and and are the two parameters to fit. The rate of time spent in communication is modeled as a constant function . In Table 15, we present the parameters resulting from the fit.

Dibona-TX2 MareNostrum4
a = a=
b = b=
c = c=
Table 15: Summary of , , and parameters for modeling in Graph 500

The first observation is related to the trends visible in Figure 26. The factor that limits the scalability is the load balance across MPI processes that, in our projections, should reach a critical point already with compute nodes on Dibona-TX2 and with nodes in MareNostrum4.

The second remark is related to the percentage of time spent in performing the actual communication. This contribution seems to be 30% higher on MareNostrum4 than on Dibona-TX2. This is caused by the different technologies used to interconnect the nodes on the clusters and the network congestion of MareNostrum4 (which is an operational production cluster).

8.2 Speedup Projections

In this section, we use the scalability data presented in Section 7 for each application to project the behavior at scale of the two clusters under study. The goal is to clarify the behavior at scale of Dibona under HPC workloads.

For projecting the scalability, we study the speedup with respect to the performance of a single node of each cluster. In the case of strong scalability, we use Amdahl’s law. The speedup as a function of the number of processes is computed as

(4)

In the case of weak scalability, we use Gustafson’s law. The speedup as a function of the number of processes is computed as

(5)

In both equations, is a parameter that expresses the time rate that is taking advantage of the parallelization. For the strong scaling, we also take into account a parameter , which represents an approximation of the parallelization overhead.

For Alya, Tangaroa, and Graph 500, we fit the strong scalability data presented in Section 7 with Equation 4, while for LBC we project the behavior at scale using Equation 5. This allows us to find the values of . In Table 16, we report the values of for each application when running on Dibona-TX2 and MareNostrum4 using the GNU Compiler Suite and the proprietary compilers.

Dibona-TX2 MareNostrum4
Compiler GNU Arm GNU Intel
Alya a =
b =
LBC a = N/A N/A
Tangaroa a = N/A N/A
b = N/A N/A
Graph 500 a =
b =
Table 16: Summary of and parameters for projection of scalability
Figure 27: Alya – Speedup projection of Dibona-TX2 and MareNostrum4 using different compilers
Figure 28: LBC – Speedup projection of Dibona-TX2 and MareNostrum4 using different compilers
Figure 29: Tangaroa – Speedup projection of Dibona-TX2 and MareNostrum4 using different compilers
Figure 30: Graph 500 (power of 2) – Speedup projection of Dibona-TX2 and MareNostrum4 using different compilers

In Figures 272829, and 30, we present the projections at the scale of the four applications. Points represent real measured data, while solid lines are projections obtained fitting the measurements with Equation 4 for Alya, Tangaroa, and Graph 500, and with Equation 5 for LBC.

Looking at the strong scalability, we expect that the slope of the scalability curve becomes increasingly flat for all applications. This is an inherent problem of splitting the workload across an increasing number of computational units. With our study, however, we demonstrate that Dibona-TX2 allows one to reach a scalability similar to MareNostrum4, even when projecting our measurements at a larger scale. This means that the Dibona-TX2 technology can be considered as good as the one of MareNostrum4 for large HPC clusters, besides the micro-architectural differences.

9 Conclusions

In this paper, we analyzed the performance of the latest Arm-based CPU targeting the HPC market, Marvell (former Cavium) ThunderX2. These processors were available to us in the Dibona cluster, developed within the European project, Mont-Blanc 3. For comparison, we performed the same experiments in state-of-the-art x86 Skylake processors in the same machine, as well as in the Tier-0 supercomputer MareNostrum4.

9.1 CPU and system architecture

We begin the analysis focusing on pure micro-benchmarking. Perhaps the most salient fact about the ThunderX2 processor is the number of cores it integrates. With 32 cores per socket, a compute node of the Dibona-TX2 cluster has 33% more than a Skylake-based node in MareNostrum4, our baseline machine, although their clock frequency is 5% less than those in the Skylake running at 2.1 GHz.

Attending to the memory subsystem, the memory installed on the Dibona-TX2 nodes is DDR4-2666, which is 17% slower than the DDR4-3200 coupled with the Skylake CPUs in MareNostrum4. However, our experiments show that the ThunderX2, with its eight memory channels per socket, gives a 25% higher memory bandwidth than the Skylake, which only offers six channels.

The raw floating-point performance has been studied with a synthetic benchmark in Section 4.2. Figure 2 shows that both processors have scalar floating-point units (FPUs) of similar characteristics. We have shown that the slight difference of in scalar floating-point performance can be explained by different clock frequencies. The performance of the Intel AVX512 SIMD unit is four times higher than the performance of the NEON SIMD unit found in the ThunderX2 CPU. Again, this is expected as the AVX512 unit is four times wider than the 128-bit NEON SIMD unit.

Summarizing, the ThunderX2 processors have more cores and memory channels than the Skylake, but slower clock and memory, and substantially shorter SIMD units.

9.2 Software stack maturity

An important aspect when considering new architectures, is the maturity of the software stack. This includes not only the compiler but also libraries enabling parallel programming, such as MPI. Our experiments show that, in general, the software stack provided by Arm has a high maturity level, comparable to that of the Intel tools and libraries. On the other side, the Arm support offered by the open-source stack is on par with Intel, although both usually show less (

) performance than the vendor-provided stacks. This confirms the trend highlighted by F. Banchelli et al. in IsARMSoftware.

9.3 HPC applications performance and scalability

While in the first part of the paper we focus on pure micro-benchmarking, in the second part we analyze the performance of three complex scientific applications at scale: Alya, LBC, and Tangaroa. Alya is the most complex code of the set. It is a multi-physics solver based on finite-element methods. LBC is an implementation of the Lattice Boltzmann method of solving fluid dynamics, showing a stencil-like data access pattern. Tangaroa is a particle tracking code with single-precision arithmetic.

We also complement our study including the Graph 500 benchmark, as a representative of emerging workloads that appear to be increasingly relevant in modern datacenters.

The four codes were executed within a single node of each machine. The performance of the Dibona-TX2 nodes showed to be lower than the Skylake-based compute nodes, ranging from slower for LBC, to and slower for Alya and Tangaroa, respectively, which means that the extra number of cores and memory channels of the ThunderX2 CPU does not entirely overcome the limitations on the SIMD units. However, further experiments within a single node showed that the applications scaled better in Dibona-TX2 than in MareNostrum4. This can be attributed to the presence of two extra memory channels and a different cache micro-architecture.

Concerning scalability across nodes, the strong scaling experiment with Tangaroa on Dibona-TX2 gave scaling efficiencies of roughly , which is substantially higher than on MareNostrum4, where the efficiency drops to approximately . Note that as expected from a strong scaling experiment, the problem size becomes progressively smaller, and the relative importance of MPI communication time increases. This explains the gradual decrease in scaling efficiency. Contrary to Dibona-TX2, however, MareNostrum4 is a production cluster with an extensive diameter network and a high number of users active at any time. Thus, network traffic by other users will, therefore, have a much higher effect MPI communication time on MareNostrum4 than on Dibona-TX2, which explains the worse scaling. However, for Alya, and more so for LBC, the MPI communication is not as performance-critical as for Tangaroa, and the scaling efficiencies observed on both platforms are almost identical for LBC and are within 10% of each other, respectively. Our experiments with synthetic MPI bandwidth benchmark (see Section 4.4), however, show that at the time of our data acquisition, Dibona-TX2’s OpenMPI was misconfigured and reached a lower bandwidth for intermediate message sizes than expected from raw low-level network benchmarks. This shows that the Infiniband interconnection implemented in Dibona-TX2 is at the level of state-of-the-art HPC systems, even if the tuning of the OpenMPI stack leveraging the network is not optimal. Another important conclusion resulting from the study of the Graph 500 benchmark is that for irregular workloads, the most important limiting factor at scale is the load balance across MPI processes, more than the communication overhead. This is clearly visible in Figure 26.

9.4 Energy considerations

Knowing the importance that energy efficiency has in modern HPC systems, we also study the energy consumption of the three applications. Since the Dibona cluster offered a small amount of Skylake nodes (called Dibona-X86) with equivalent energy measuring capabilities than the Dibona-TX2 nodes, our measurements allow for a fair comparison of both platforms.

As expected, our energy measurements have been strongly affected by the performance of the applications. Therefore, the maturity of the software stack has a substantial impact on the energy consumption, as vendor-specific tools deliver not only better performance but also lower energy-to-solution. In this sense, we consider the example of Alya as a clear, successful story for the energy efficiency and the scalability of Arm systems on a complex production code. For instance, we showed that the same simulation could be carried out on Dibona-TX2 saving of the energy and running slower compared to Dibona-X86 nodes. The tests of Graph 500 highlight another success story for the energy efficiency of Dibona-TX2, which also takes advantage of the fact that its node configuration offers a power of 2 number of cores. In general, we have observed that the energy-to-solution for the different applications is approximately the same on both platforms. However, if we consider the energy-delay-product, the better performance of Dibona-X86 results also in a better efficiency, when compared to Dibona-TX2.

9.5 Lessons learned

Considering the priorities of HPC-facility managers, we can conclude that ThunderX2 is a great leap forward for Arm architectures in this field. However, the energy efficiency is still behind the Skylake architecture delivered by Intel for pure HPC workloads.

Thinking about the needs of domain scientists and HPC developers, the ThunderX2 processor is a platform worth considering as the maturity of the different software stacks is comparable to the state-of-the-art options available in Intel platforms. Programmability and stability of Dibona-TX2 nodes have been proved at the level of other production systems.

For computer architects, this study also offers some interesting insights. For instance, the importance of memory channels, as they can overcome the limitations of a slower clock and memory speeds. However, it also points out the demand for vector operations from current scientific applications, which means that improving the size of SIMD units will bring a significant advantage.

Globally, we think that the architectural point explored with the ThunderX2 CPU is extremely relevant for the research of a path towards Exascale. In our view, it shows that, for complex applications such as Alya, the performance penalties introduced by a smaller SIMD unit can be compensated by a higher memory bandwidth, and, more generally, it allows programmers and integrators to explore an innovative architectural design point that is able to deliver decent performance in a competitive power envelope. We expect the situation will soon change when the new Scalable Vector Extension by Arm rico2017arm is implemented in some real hardware. However, we will reserve this discussion for future work.

Acknowledgments

The authors thank the support team of Dibona operating at ATOS/Bull. This work is partially supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), the Spanish Ministry of Science and Technology project (TIN2015-65316-P), the Generalitat de Catalunya (2017-SGR-1414), the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects (grant agreements n. 288777, 610402 and 671697), the European POP2 Center of Excellence (grant agreement n. 824080), and the Human Brain Project SGA2 (grant agreement n. 785907).

References

References