An analysis of core- and chip-level architectural features in four generations of Intel server processors

02/24/2017
by   Johannes Hofmann, et al.
FAU
0

This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broad- well) with a focus on performance with floating point workloads. Starting on the core level and going down the memory hierarchy we cover instruction throughput for floating-point instructions, L1 cache, address generation capabilities, core clock speed and its limitations, L2 and L3 cache bandwidth and latency, the impact of Cluster on Die (CoD) and cache snoop modes, and the Uncore clock speed. Using microbenchmarks we study the influence of these factors on code performance. This insight can then serve as input for analytic performance models. We show that the energy efficiency of the LINPACK and HPCG benchmarks can be improved considerably by tuning the Uncore clock speed without sacrificing performance, and that the Graph500 benchmark performance may profit from a suitable choice of cache snoop mode settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

06/15/2018

AVX-512 extension to OpenQCD 1.6

We publish an extension of openQCD-1.6 with AVX-512 vector instructions ...
09/12/2016

An ECM-based energy-efficiency optimization approach for bandwidth-limited streaming kernels on recent Intel Xeon processors

We investigate an approach that uses low-level analysis and the executio...
10/19/2020

Evaluating the Cost of Atomic Operations on Modern Architectures

Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-...
08/14/2020

Manticore: A 4096-core RISC-V Chiplet Architecture for Ultra-efficient Floating-point Computing

Data-parallel problems demand ever growing floating-point (FP) operation...
03/04/2022

AgileWatts: An Energy-Efficient CPU Core Idle-State Architecture for Latency-Sensitive Server Applications

User-facing applications running in modern datacenters exhibit irregular...
04/07/2022

Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors

Modern processors, in particular within the server segment, integrate mo...
10/02/2019

Base64 encoding and decoding at almost the speed of a memory copy

Many common document formats on the Internet are text-only such as email...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intel Xeon server CPUs dominate in the commodity HPC market. Although the microarchitecture of those processors is ubiquitous and can also be found in mobile and desktop devices, the average developer of numerical software hardly cares about architectural details and relies on the compiler to produce “decent” code with “good” performance. If we actually want to know what “good performance” means we have to build analytic models that describe the interaction between software and hardware. Despite the necessary simplifications, such models can give useful hints towards the relevant bottlenecks of code execution and thus point to viable optimization approaches. The Roofline model [6, 22] and the Execution-Cache-Memory (ECM) model [5, 18] are typical examples. Analytic modeling requires simplified machine and execution models, with details about properties of execution units, caches, memory, etc. Although much of this data is provided by manufacturers, many relevant features can only be understood via microbenchmarks, either because they are not documented or because the hardware cannot leverage its full potential in practice. One simple example is the maximum memory bandwidth of a chip, which can be calculated from the number, frequency, and width of the DRAM channels but which, in practice, may be significantly lower than this absolute limit. Hence, microbenchmarks such as STREAM [13] or likwid-bench [20] are used to measure the limits achievable in practice.

Although there has been some convergence in processor microarchitecures for high performance computing, the latest CPU models show interesting differences in their performance-relevant features. Building good analytic performance models and, in general, making sense of performance data, requires intimate knowledge of such details. The main goal of this paper is to provide a coverage and critical discussion of those details on the latest four Intel architecture generations for server CPUs: Sandy Bridge (SNB), Ivy Bridge (IVB), Haswell (HSW), and Broadwell (BDW). The actual CPU models used for the analysis are described in Sect. 2.1 below.

1.1 Performance on modern multicore CPUs

Out of the many possible approaches to performance analysis and optimization (coined performance engineering [PE]) we favor concepts based on analytic performance models. For recent server multicore designs the ECM performance model allows for a very accurate description of single-core performance and scalability. In contrast to the Roofline model it drops the assumption of a single bottleneck for the steady-state execution of a loop. Instead, time contributions from in-core execution and data transfers through the memory hierarchy are calculated and then put together according to the properties of a particular processor architecture; for instance, in Intel x86 server CPUs all time contributions from data transfers including LOADs and STOREs in the L1 cache must be added to get a prediction of single-core data transfer time [18, 8]. On the other hand, the IBM Power8 processor shows almost perfect overlap [9]. A full introduction to the ECM model would exceed the scope of this paper, so we refer to the references given above. The model has been shown to work well for the analysis of implementations of several important computational kernels [19, 18, 23, 2, 9].

In order to construct analytic models accurately, data about the capabilities of the microarchitecture and how it interacts with the code at hand is needed. For floating-point centric code in scientific computing, maximum throughput and latency numbers for arithmetic and LOAD/STORE instructions are most useful in all their vectorized and non-vectorized, single (SP) and double precision (DP) variants. On Intel multicore CPUs up to Haswell, this encompasses scalar, streaming SIMD extensions (SSE), advanced vector extensions (AVX), and AVX2 instructions. Modeling the memory hierarchy in the ECM model requires the maximum data bandwidth between adjacent cache levels (assuming that the hierarchy is inclusive) and the maximum (saturated) memory bandwidth. As for the caches it is usually sufficient to assume the maximum documented theoretical bandwidth (presupposing that all prefetchers work perfectly to hide latencies), although latency penalties might apply [9]. The main memory bandwidth and latency may depend on the cluster-on-die (CoD) mode and cache snoop mode settings. Finally, the latest Intel CPUs work with at least two clock speed domains: one for the core (or even individual cores) and one for the Uncore, which includes the L3 cache and memory controllers. Both are subject to automatic changes; in case of AVX code on Haswell and later CPUs the guaranteed baseline clock speed is lower than the standard speed rating of the chip. The performance and energy consumption of code depends crucially on the interplay between these clock speed domains. Finally, especially when it comes to power dissipation and capping, considerable variations among the specimen of the same CPU model can be observed.

All these intricate architectural details influence benchmark and application performance, and it is insufficient to look up the raw specs in a data sheet in order to understand this influence.

1.2 Related work

There is a large number of papers dealing with details in the architecture of CPUs and their impact on performance and energy consumption. In [1] the authors assessed the capabilities of the then-new Nehalem server processor for workloads in scientific computing and compared its capabilities with its predecessors and competing designs. In [17], tools and techniques for measuring and tuning power and energy consumption of HPC systems were discussed. The QuickPath Interconnect (QPI) snoop modes on the Haswell EP processor were investigated in [15]. Energy efficiency features, including the AVX and Uncore clock speeds, on the same architecture were studied in [4] and [7]. Our work differs from all those by systematically investigating relevant architectural features, from the core level down to memory, via microbenchmarks in view of analytic performance modeling as well as important benchmark workloads such as LINPACK, Graph500, and HPCG.

1.3 Contribution

Apart from confirming or highlighting some documented or previously published findings, this paper makes the following new contributions:

  • We present benchmark results showing the improvement in the performance of the vector gather instruction from HSW to BDW. On BDW it is now advantageous to actually use the gather instruction instead of “emulating” it.

  • We fathom the capabilities of the L2 cache on all four microarchitectures and establish practical limits for L2 bandwidth that can be used in analytic ECM modeling. These limits are far below the advertised 64 B/cy on HSW and BDW.

  • We study the bandwidth scalability of the L3 cache depending on the Cluster on Die (CoD) mode and show that, although the parallel efficiency for streaming code is never below 85%, CoD has a measurable advantage over non-CoD.

  • We present latency data for all caches and main memory under various cache snoop modes and CoD/non-CoD. We find that although CoD is best for streaming and non-uniform memory access (NUMA) aware workloads in terms of latency and bandwidth, highly irregular, NUMA-unfriendly code such as the Graph500 benchmark benefits dramatically from non-CoD mode with Home Snoop and Opportunistic Snoop Broadcast by as much as 50% on BDW.

  • We show how the Uncore clock speed on HSW and BDW has considerable impact on the power consumption of bandwidth- and cache-bound code, opening new options for energy efficient and power-capped execution.

2 Test bed

2.1 Hardware description

All measurements were performed on standard two-socket Intel Xeon servers. A summary of key specifications of the four generations of processors is shown in Table 1. According to Intel’s “tick-tock” model, a “tick” represents a shrink of the manufacturing process technology; however, it should be noted that “ticks” are often accompanied by minor microarchitectural improvements while a “tock” usually involves larger changes.

Microarchitecture Sandy Bridge-EP Ivy Bridge-EP Haswell-EP Broadwell-EP
Shorthand SNB IVB HSW BDW
Chip Model Xeon E5-2680 Xeon E5-2690 v2 Xeon E5-2695 v3 E5-2697 v4
Release Date Q1/2012 Q3/2013 Q3/2014 Q1/2016
Base Freq. 2.7 GHz 3.0 GHz 2.3 GHz 2.3 GHz
Max All Core Turbo Freq. 2.8 GHz 2.8 GHz
AVX Base Freq. 1.9 GHz 2.0 GHz
AVX All Core Turbo Freq. 2.6 GHz 2.7 GHz
Cores/Threads 8/16 10/20 14/28 18/36
Latest SIMD Extensions AVX AVX AVX2, FMA3 AVX2, FMA3
Memory Configuration 4 ch. DDR3-1600 4 ch. DDR3-1866 4 ch. DDR4-2133 4 ch. DDR4-2400
Theor. Mem. Bandwidth 51.2 GB/s 59.7 GB/s 68.2 GB/s 76.8 GB/s
L1 Cache Capacity 832 kB 1032 kB 1432 kB 1832 kB
L2 Cache Capacity 8256 kB 10256 kB 14256 kB 18256 kB
L3 Cache Capacity 20 MB (82.5 MB) 25 MB (102.5 MB) 35 MB (142.5 MB) 45 MB (182.5 MB)
L1Reg Bandwidth 216 B/cy 216 B/cy 232 B/cy 232 B/cy
RegL1 Bandwidth 116 B/cy 116 B/cy 132 B/cy 132 B/cy
L1L2 Bandwidth 32 B/cy 32 B/cy 64 B/cy 64 B/cy
L2L3 Bandwidth 32 B/cy 32 B/cy 32 B/cy 32 B/cy
Table 1: Key test machine specifications. All reported numbers taken from data sheets.

SNB (a “tock”) first introduced AVX, doubling the single instruction, multiple data (SIMD) width from SSE’s 128 bit to 256 bit. One major shortcoming of SNB is directly related to AVX: Although the SIMD register width has doubled and a second LOAD unit was added, data path widths between the L1 cache and individual LOAD/STORE units were left at 16 B/cy. This leads to AVX stores requiring two cycles to retire on SNB, and AVX LOADs block both units. IVB, a “tick”, saw an increase in core count as well as a higher memory clock; in addition, IVB brought speedups for several instructions, e.g., floating-point (FP) divide and square root; see Table 2 for details.

HSW, a “tock”, introduced AVX2, extending the existing 256 bit SIMD vectorization from floating-point to integer data types. Instructions introduced by the fused multiply-add (FMA) extension are handled by two new, AVX-capable execution units. Data path widths between the L1 cache and registers as well as the L1 and L2 caches were doubled. A vector gather instruction provides a simple means to fill SIMD registers with non-contiguous data, making it easier for the compiler to vectorize code with indirect accesses. To maintain scalability of the core interconnect, HSW chips with more than eight cores move from a single-ring core interconnect to a dual-ring design. At the same time, HSW introduced the new CoD mode, in which a chip is optionally partitioned into two equally sized NUMA domains in order to reduce latencies and increase scalability. Starting with HSW, the system’s QPI snoop mode can also be configured. HSW no longer guarantees to run at the base frequency with AVX code. The guaranteed frequency when running AVX code on all cores is referred to as “AVX base frequency,” which can be significantly lower than the nominal frequency [12, 14]. Also there is a separation of frequency domains between cores and Uncore. The Uncore clock is now independent and can either be set automatically (when Uncore frequency scaling (UFS) is enabled) or manually via model specific registers.

As a “tick,” BDW, the most recent Xeon-EP processor, offers minor architectural improvements. Floating-point and gather instruction latencies and throughput have partially improved. The dual-ring design was made symmetric and an additional QPI snoop mode is available.

2.2 Software and benchmarks

All high-level language benchmarks (Graph500, HPCG) were compiled using Intel ICC 16.0.3. For Graph500 we used the reference implementation in version 2.1.4, and for LINPACK we ran the Intel-provided binary contained in MKL 2017.1.013, the most recent version available at the time of writing.

The LIKWID111http://tiny.cc/LIKWID tool suite in its current stable version 4.1.2 was employed heavily in many of our experiments. All low-level benchmarks consisted of hand-written assembly. When available (e.g., for streaming kernels auch as STREAM triad and others) we used the assembly implementations in the likwid-bench microbenchmarking tool. Latency measurements in the memory hierarchy were done with all prefetchers turned off (via likwid-features) and a pointer chasing code that ensures consecutive cache line accesses. Energy consumption measurements were taken with the likwid-perfctr tool via the RAPL (Running Average Power Limit) interface, and the clock speed of the CPUs was controlled with likwid-setFrequencies.

3 In-core features

3.1 Core frequency

Starting with HSW, Intel chips offer different base and turbo frequencies for AVX and SSE or scalar instruction mixes. This is due to the higher power requirement of using all SIMD lanes in case of AVX. To reflect this behavior, Intel introduced a new frequency nomenclature for these chips.

The “base frequency,” also known as the “non-AVX base frequency” or “nominal frequency” is the minimum frequency that is guaranteed when running scalar or SSE code on all cores. This is also the frequency the chip is advertised with, e.g., 2.30 GHz for the Xeon E5-2695v3 in Table 1. The maximum frequency that can be achieved when running scalar or SSE code on all cores is called “max all core turbo frequency.” The “AVX base frequency” is the minimum frequency that is guaranteed when running AVX code on all cores and is typically significantly lower than the (non-AVX) base frequency. Analogously, the maximum frequency that can be attained when running AVX code is called “AVX max all core turbo frequency.”

On HSW, at least core running AVX code resulted in a chip-wide frequency restriction to the AVX max all core turbo frequency. On BDW, cores running scalar or SSE code are allowed to float between the non-AVX base and max all core turbo frequencies even when other cores are running AVX code.

Figure 1: Attained chip frequency during LINPACK runs on all cores on (a) BDW and (b) HSW. (c) Variation of clock speed and package power among all 1456 Xeon E5-2630v4 CPUs in RRZE’s “Meggie” cluster running LINPACK.

All relevant values for the HSW and BDW specimen used can be found in Table 1. According to official documentation the actually used frequency depends on the workload; more specifically, it depends on the percentage of AVX instructions in a certain instruction execution window. To get a better idea about what to expect for demanding workloads, LINPACK and FIRESTARTER [3] were selected to determine those frequencies. The maximum frequency difference between both benchmarks was 20 MHz, so Figure 1 shows only results obtained with LINPACK. Figure 1a shows that BDW can maintain a frequency well above the AVX and the non-AVX base frequency for workloads running at its TDP limit of 145 W (measured package power during stress tests was 144.8 W). HSW, shown in Figure 1b, drops below the non-AVX base frequency of 2.3 GHz, but stays well above the AVX base frequency of 1.9 GHz while consuming 119.4 W out of a 120 W TDP. When running SSE LINPACK, BDW consumes 141.8 W and manages to run at the max all core turbo frequency of 2.8 GHz. On HSW, running LINPACK with SSE instructions still keeps the chip at its TDP limit (119.7 W out of 120 W); the attained frequency of 2.6 GHz is slightly below the max all core turbo frequency of 2.7 GHz.

While it might be tempting to generalize from these results, we must emphasize that statistical variations even between specimen of the same CPU type are very common [21]. When examining all 1456 Xeon E5-2630v4 (10-core, 2.2 GHz base frequency) chips of RRZE’s new “Meggie” cluster,222http://www.hpc.rrze.fau.de/systeme/meggie-cluster.shtml we found significant variations across the individual CPUs. The chip has a max all core turbo and AVX max all core turbo frequency of 2.4 GHz [14]. Figure 1c shows each chip’s frequency and package power when running LINPACK with SSE or AVX on all cores. With SSE code, each chip manages to attain the max all core turbo frequency of 2.4 GHz. However, a variation in power consumption can be observed. When running AVX code, not all chips reach the defined peak frequency but stay well above the AVX base frequency of 1.8  GHz. Some chips do hit the frequency ceiling; for these, a strong variation can be observed in the power domain.

3.2 Instruction throughput and latency

Accurate predictions of instruction execution (i.e., how many clock cycles it takes to execute a loop body assuming a steady state situation with all data coming from the L1 cache) are notoriously difficult in all but the simplest cases, but they are needed as input for analytic models. As a “lowest-order” and most optimistic approximation one can assume full throughput, i.e., all instructions can be executed independently and are dynamically fed to the execution ports (and the pipelines connected to them) by the out-of-order engine. The pipeline that takes the largest number of cycles to execute all its instructions determines the runtime. The worst-case assumption would be an execution fully determined by the critical path through the code, heeding all dependencies. In practice, the actual runtime will be between these limits unless other bottlenecks apply that are not covered by the in-core execution, such as data transfers from beyond the L1 cache, instruction cache misses, etc. Even if a loop body contains strong dependencies the throughput assumption may still hold if there are no loop-carried dependencies.

Latency [cy] Inverse throughput [cy/inst.]
µarch BDW HSW IVB SNB BDW HSW IVB SNB
vdivpd (AVX) 24 35 35 45 16 28 28 44
divpd (SSE) 14 20 20 22 8 14 14 22
divsd (scalar) 14 20 20 22 4.5 14 14 22
vdivps (AVX) 17 21 21 29 10 14 14 28
divps (SSE) 11 13 13 14 5 7 7 14
divss (scalar) 11 13 13 14 2.5 7 7 14
vsqrtpd (AVX) 35 35 35 44 28 28 28 43
sqrtpd (SSE) 20 20 20 23 14 14 14 22
sqrtsd (scalar) 20 20 20 23 7 14 14 22
vsqrtps (AVX) 21 21 21 23 14 14 14 22
sqrtps (SSE) 13 13 13 15 7 7 7 14
sqrtss (scalar) 13 13 13 15 4 7 7 14
vrcpps (AVX) 7 7 7 7 2 2 2 2
rcpps (SSE, scalar) 5 5 5 5 1 1 1 1
*add* 3,4 3 3 3 1 1 1 1
*mul* 3 5 5 5 0.5 0.5 1 1
*fma* 5,6 5,6§ 0.5 0.5
SP/DP AVX addition: 3 cycles; SP/DP SSE and scalar addition: 4 cycles
SP/DP AVX FMA: 5 cycles; SP/DP SSE and scalar FMA: 6 cycles
§SP scalar FMA: 6 cycles; all other: 5 cycles
Table 2:

Measured worst-case latency and inverse throughput for floating-point arithmetic instructions. For all of these numbers, lower is better.

Calculating the throughput and critical path predictions requires information about the maximum throughput and latency of all relevant instructions as well as general limits such as decoder/retirement throughput, L1I bandwidth, and the number and types of address generation units. The Intel Architecture Code Analyzer333http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/ (IACA) can help with this, but it is proprietary software with an unclear future development path and it does not always yield accurate predictions. Moreover, it can only analyze object code and does not work on the high-level language constructs. Thus one must often revert to manual analysis to get predictions for the best possible code, even if the compiler cannot produce it. In Table 2 we give worst-case measured latency and inverse throughput numbers for arithmetic instructions in AVX, SSE, and scalar mode. In the following we point out some notable changes over the four processor generations.

The most profound change happened in the performance of the divide units. From SNB to BDW we observe a massive decrease in latency and an almost three-fold increase in throughput for AVX and SSE instructions, in single and double precision alike. Divides are still slow compared to multiply and add instructions, of course. The fact that the divide throughput per operation is the same for AVX and SSE is well known, but with BDW we see a significant rise in scalar divide throughput, even beyond the documented limit of one instruction every five cycles. The scalar square root instruction shows a similar improvement, but is in line with the documentation.

The standard multiply, add, and fused multiply-add instructions have not changed dramatically over four generations, with two exceptions: Together with the introduction of FMA instructions with HSW, it became possible to execute two plain multiply (but not add) instructions per cycle. The latency of the add instruction in scalar and SSE mode on BDW has increased from three to four cycles; this result is not documented by Intel for BDW but announced for AVX code in the upcoming Skylake architecture. The fma instruction shows the same characteristic (latency increase from 5 to 6 cycles when using SSE or scalar mode).

One architectural feature that is not directly evident from single-instruction measurements is the number of address generation units (AGUs). Up to IVB there are two such units, each paired with a LOAD unit with which it shares a port. As a consequence, only two addresses per cycle can be generated. HSW introduced a third AGU on the new port 7, but it can only handle simple addresses for STORE instructions, which may lead to some restrictions. See Sect. 3.3 for details.

3.3 L1 cache/AGU

The cores of all four microarchitectures feature two load units and one store unit. The data paths between each unit and the L1 cache are 16 B on SNB and IVB, and 32 B on HSW and BDW. The theoretical bandwidth is thus 48 B/cy on SNB and IVB and 96 B/cy on HSW and BDW; however, several restrictions apply.

An AVX vectorized STREAM triad benchmark uses two AVX loads, one AVX FMA, and one AVX store instruction to update four DP elements. On HSW and BDW, only two address generation units are capable of performing the necessary address computations, i.e., (base + scaled index + offset), typically used in streaming memory accesses; HSW’s newly introduced third store AGU can only perform offset computations. This means that only two addresses per cycle can be calculated, limiting the L1 bandwidth to 64 B/cy. STREAM triad performance using only two AGUs is shown in Figure 2a. One can make use of the new AGU by using one of the “fast LEA” units (which can perform only indexed and no offset addressing) to pre-compute an intermediate address, which is then used by the simple AGU to complete the address calculation. This way both AVX load units and the AVX store unit can be used simultaneously. When the store is paired with address generation on the new store AGU, both micro-ops are fused into a single micro-op. This means that the four micro-op per cycle front end retirement constraint should not be a problem: in each cycle two AVX load instructions, the micro-op fused AVX store instruction, and one AVX FMA instruction is retired. With sufficient unrolling, loop instruction overhead becomes negligible and the bandwidth should approach 96 B/cy. Figure 2 shows, however, that micro-op throughput still seems to be the bottleneck because bandwidth can be further increased by removing the FMA instructions from the loop body.

Figure 2: (a) L1 bandwidth achieved with STREAM triad and various optimizations on BDW. (b) Comparison of achieved L1 bandwidths using STREAM triad on all microarchitectures.

Figure 2b compares the bandwidths achievable by different microarchitectures (using no arithmetic instructions on HSW and BDW for the reasons described above). On SNB and IVB a regular STREAM triad code can almost reach maximum theoretical L1 performance because it only requires half the number of address calculations per cycle, i.e., two AGUs are sufficient to generate three addresses every two cycles.

3.4 Gather

Vector gather is a microcode solution for loading noncontinuous data into vector registers. The instruction was first implemented in Intel multicore CPUs with AVX2 on HSW. The first implementation offered a poor latency (i.e., the time until all data was placed in the vector register) and using hand-written assembly to manually load distributed data into vector registers proved to be faster than using the gather instruction in some cases [10].

Microarchitecture Haswell-EP Broadwell-EP
Location of data L1 L2 L3 Mem L1 L2 L3 Mem
Distributed across 1 CLs 12.3 12.3 12.4 15.5 7.3 7.3 7.7 13.3
Distributed across 2 CLs 12.5 12.5 13.2 23.0 7.5 7.6 11.0 24.5
Distributed across 4 CLs 12.5 12.7 20.6 42.7 7.5 9.9 20.0 47.5
Distributed across 8 CLs 12.3 18.4 38.5 89.3 7.3 18.1 38.2 94.4
Table 3: Time in cycles per gather instruction on HSW and BDW depending on data distribution across CLs.

Table 3 shows the gather instruction latency for both HSW and BDW. The latency depends on where the data is coming from and, in case data is not in L1, over how many cache lines it is distributed. We find that the instruction is 40% faster on BDW in L1. When data is coming from L2 on HSW and distributed across eight CLs, the latency is dominated by time required to transfer eight CLs from L2 to L1 cache. On BDW, this effect is already visible when data is coming from the L2 cache and distributed across four CLs. BDW’s improvement of the instruction offers no returns when the latency is dominated by CL transfers, which is the case when loading more than four CLs from L2, two from L3, or one from memory.

4 L2 cache

According to official documentation, the L2 cache bandwidth on HSW was increased from 32 B/cy to 64 B/cy compared to IVB. To validate this expectation, knowledge about overlapping transfers in the cache hierarchy is required. The ECM model for x86 assumes that no CLs are transferred between L2 and L1 in any cycle in which a LOAD instruction retires. Hence, the maximum of 64 B/cy can never be attained by design but an improvement may still be expected. To derive the time spent transferring data, cycles in which load instructions are retired are subtracted from the overall runtime with an in-L2 working set. The resulting bandwidth should be compared with the documented theoretical maximum.

Pattern Code SNB IVB HSW BDW
Dot product dot+=A[i]+B[i] 28 27 43 43
STREAM triad A[i]=B[i]+s*C[i] 29 29 32 32
Table 4: Measured L1-L2 bandwidth on different microarchitectures for dot product and STREAM triad access patterns.

Table 4 shows the measured bandwidths for a dot product (a load-only benchmark) and the STREAM triad. Both SNB and IVB operate near the specified bandwidth of 32 B/cy for both access patterns. Although HSW and BDW offer bandwidth improvements, especially in case of the dot product, measured bandwidths are significantly below the advertised 64 B/cy.

The question arises of how this result may be incorporated into the ECM model. Preliminary experiments indicate that the ECM predictions for in-L3 data are quite accurate when assuming theoretical L2 throughput. We could thus interpret the low L2 performance as a consequence of a latency penalty, which can be overlapped when the data is further out in the hierarchy. Further experiments are needed to substantiate this conjecture.

5 Uncore

5.1 L3 cache

5.1.1 Cluster-on-Die

Together with the dual-ring interconnect, HSW introduced the CoD mode, in which a single chip can be partitioned into two equally-sized NUMA clusters. HSW features a so-called “eight plus ” design, in which the first physical ring features eight cores and the second ring contains the remaining cores (six for our HSW chip). This asymmetry leads to a scenario in which the seven cores in the first cluster domain are physically located on the first ring; the second cluster domain contains the remaining core from the first and six cores from the second physical ring. The asymmetry was removed on BDW: here both physical rings are of equal size so both cluster domains contain cores from dedicated rings. CoD is intended for NUMA-optimized code and impacts L3 scalability and latency and, implicitly, main memory bandwidth because it uses a dedicated snoop mode that makes use of a directory to avoid unnecessary snoop requests (see Section 5.2.1 for more details).

Figure 3: (a) L3 scalability on HSW and BDW depending on whether CoD is used. (b) Comparison of microarchitectures regarding L3 scalability. (c) Absolute L3 bandwidth for STREAM triad as function of cores on different microarchitectures.

Figure 3a shows the influence of CoD on L3 bandwidth (using STREAM triad) for HSW and BDW. When data is distributed across both rings on HSW, the parallel efficiency of the L3 cache is 92%; it can be raised to 98% by using CoD. The higher core count of BDW results in a more pronounced effect; here, parallel efficiency is only 86% in non-CoD mode. Using CoD the efficiency goes above 95%. Figure 3b shows that HSW and BDW with CoD offer similar L3 scalability as SNB and IVB.

Assuming an -core chip, the topological diameter (and with it the average distance from a core to data in an L3 segment) is smaller in each of the -core cluster domains in comparison to the non-CoD domain consisting of cores. Shorter ways between cores and data result in lower latencies when using CoD mode. On BDW, the L3 latency is 41 cycles when with CoD and 47 cycles without (see Table 5).

5.2 Memory

5.2.1 Snoop modes

Starting with HSW, the QPI snoop mode can be selected at boot time. HSW supports three snoop modes: early snoop (ES), home snoop (HS), and directory (DIR) (often only indirectly selectable by enabling CoD in the BIOS) [15, 11, 16]. BDW introduced a fourth snoop mode called HS+ opportunistic snoop broadcast (OSB) [16]. The remainder of this section discusses the differences among the modes and the immediate impact on memory latency and bandwidth.

On a L3 miss inside a NUMA domain, in addition to fetching the CL containing the requested data from main memory, cache coherency mandates other NUMA domains be checked for modified copies of the CL. Attached to each L3 segment is a cache agent (CA) responsible for sending and receiving snoop information. In addition to multiple CAs, each NUMA domain features a home agent (HA), which plays a major role in snooping.

In ES, snoop requests are sent directly from the CA of the L3 segment in which the L3 miss occurred to the respective444CLs are mapped to L3 segments based on their addresses according to a hashing function. Thus, each CA knows which CA in other NUMA domains is responsible for a certain CL. CAs in other NUMA domains. Queried remote CAs directly respond back to the requesting CA; in addition, they report to the HA in the requesting CA’s domain, so it can resolve potential conflicts. ES involves a lot of requests and replies, but offers low latencies.

In HS, CAs forward snoop requests to their NUMA domain’s HA. The HA proceeds to fetch the requested CL from memory but stalls snoop requests to remote NUMA domains until the CL is available. For each CL, so-called directory information is stored in its memory ECC bits. The bits indicate whether a copy of the CL exists in other NUMA domains. The directory bits only tell whether a CL is present or not in other NUMA domains; they do not tell which NUMA domain to query, so snoops have to broadcast to all NUMA domains. By waiting for directory data, unnecessary snoop requests are avoided at the cost of higher latency due to delayed snoops. By reducing snoop requests, overall bandwidth can be increased. As in ES, potentially queried remote CAs respond to the initiating CA and HA, which resolves potential conflicts.

In DIR, a two-step approach is used. Starting with HSW, each HA features a 14 kB directory cache (also called “HitMe” cache) holding additional directory information for CLs present in remote NUMA domains.555Investigations using the HITME_* performance counter events indicate this cache is exclusively used in DIR mode. In addition to the directory information recorded in the ECC bits, the directory cache stores the particular NUMA domain in which the copy of a CL resides; this means that on a hit in the directory cache only a single snoop request has to be sent. This mechanism further reduces snoop traffic, potentially increasing bandwidth. When the directory cache is hit, latency is also improved in DIR compared to HS, because snoops are not delayed until directory information stored in ECC bits from main memory becomes available. In case of a directory cache miss, DIR mode proceeds similarly to HS. Note, however, that DIR mode is recommended only for NUMA-aware workloads. The directory cache can only hold data for a small number of CLs. If the number of CLs shared between both cluster domains exceeds the directory cache capacity , DIR mode degrades to HS mode, resulting in high latencies.

BDW’s new HS+OSB mode works similarly to HS. However, HAs will send opportunistic snoop requests while waiting for directory information stored in the ECC bits under “light” traffic conditions. Latency is reduced in case the directory information indicates snoop requests have to be sent, because they were already sent opportunistically. Redundant snoop requests are not supposed to impact performance under “light” traffic conditions.

µarch L1 L2 L3 MEM
SNB 4 12 40 230
IVB 4 12 40 208
HSW 4 12 372 1686
BDW 4 12 471, 412 2483, 2804, 1905, 1786
1COD disabled, 2COD enabled,3ES, 4HS, 5HS+OSB, 6DIR
Table 5: Measured access latencies of all memory hierarchy levels in base frequency core cycles

The impact of snoop modes is largest on main memory latency. As expected, DIR produces the best results with 178 cy (see Table 5). Pointer chasing in main memory does not generate a lot of traffic on the ring interconnect, which is why HS+OSB will generate opportunistic snoops, achieving a latency of 190 cy. The difference in latency of 12 cy compared to DIR can be explained through shorter paths inside a single cluster domain in CoD mode. We measured an L3 latency of 41 cy for CoD and 47 cy for non-CoD mode. Since memory accesses pass through the interconnect twice (one to request the CL, once to deliver it) the memory latency of non-CoD mode is expected to be twice the L3 latency penalty of six cycles. In ES, the requesting CA has to wait for its HA to acknowledge that it received all snoop replies from the remote CAs, which causes a latency penalty. On BDW, the measured memory latency is 248 cy. As expected, HS offers the worst latency at 280 cy, because necessary snoop broadcasts are delayed until directory information becomes available from main memory.

Figure 4: (a) Graph500 performance in millions of traversed edges per second (MTEP/s) as function of snoop mode on BDW. (b) Graph500 performance of all chips. (c) HPCG performance and performance per Watt as function of Uncore frequency.

Graph500 was chosen to evaluate the influence of snoop modes on the performance of latency-sensitive workloads. Figure 4a shows Graph500 performance for a single BDW chip. A direct correlation between latency and performance can be observed for HS, ES, and HS+OSB. DIR mode performs worst despite offering the best memory latency. This can be explained by the non-NUMA-awareness of the Graph500 benchmarks. Too much data is shared between both cluster domains; this means the directory cache can not hold information on all shared CLs. As a result, snoops are delayed until directory information from main memory becomes available. Figure 4b shows an overview of Graph500 performance on all chips and the qualitative improvement offered by the new HS+OSB snoop mode introduced with BDW.

Figure 5: Sustained main memory bandwidth on BDW for various access patterns. NT=nontemporal stores.

The effect of snoop mode on memory bandwidth for BDW is shown in Fig. 5. The data is roughly in line with the reasoning above. For NUMA-aware workloads, DIR should produce the least snoop traffic due to snoop information stored in the directory cache. This is reflected in a slightly better bandwidth compared to other snoop modes (with the exception of the non-temporal (NT) store access pattern, which seems to be a toxic case for DIR mode). DIR offers up to 10 GB/s more for load-only access patterns when compared to ES, which produces the most amount of snoop traffic. The effect is less pronounced but still observable when comparing DIR to HS and HS+OSB. Figure 6 shows the evolution of sustained memory bandwidth for all examined microarchitectures, using the best snoop mode on HSW and BDW. Increases in bandwidths over the generations is explained by new DDR standards as well as increased memory clock speeds (see Table 1).

Figure 6: Comparison of sustained main memory bandwidth across microarchitectures for various access patterns.

5.3 Uncore clock, bandwidth, and energy efficiency

Before HSW, the Uncore was clocked at the same frequency as the cores. Starting with HSW, the Uncore has its own clock frequency. The motivation for this lies in potential energy savings: When cores do not require much data via the Uncore (i.e., from/to L3 cache and main memory) the Uncore can be slowed down to save power. This mode of operation is called UFS. For our BDW chip, the Uncore frequency can vary automatically between 1.2 –2.8 GHz, but one can also define custom minimum and maximum settings within this range via MSRs.

We examine the default UFS behavior for both extremes of the Roofline spectrum and use HPCG as a bandwidth-bound and LINPACK as a compute-bound benchmark. Our findings indicate that at both ends of the spectrum, UFS tends to select higher than necessary frequencies, pointlessly boosting power and in the case of LINPACK even hurting performance.

Figure 4c shows HPCG performance and energy efficiency versus Uncore frequency for a fixed core clock of 2.3 GHz on HSW. We find that the Uncore is the performance bottleneck only for Uncore frequencies below 2.0 GHz. Increasing it beyond this point does not improve performance, because main memory is now the bottleneck. Using performance counters the Uncore frequency was determined to be the maximum of 2.8 GHz when running HPCG in UFS mode. The energy efficiency of 64.7 GFLOP/s/W at 2.8 GHz is 26% lower than the 87.2 GFLOP/s/W observed at 2.0 GHz Uncore frequency, at almost the same performance. Energy efficiency can be increased even more by further lowering the Uncore clock; however, below 2.0 GHz performance is degraded.

Figure 7: LINPACK performance on BDW as a function of core and uncore frequency.

For LINPACK, we observe a particularly interesting side effect of varying Uncore frequency. Figure 7 shows LINPACK performance on BDW as a function of core and Uncore clock. Note that in Turbo mode, the performance increases when going from the highest Uncore frequencies towards 1.8 GHz. This effect is caused by Uncore and cores competing for the chip’s TDP. When the Uncore clovk speed is reduced, a larger part of the chip’s power budget can be consumed by the cores, which in turn boost their frequency. The core frequency in Turbo mode is 2479 MHz when the Uncore clock is set to 2.8 GHz (the Uncore actually only achieves a clock rate of 2475 MHz) vs 2595 MHz when the Uncore clock is set to 1.8 GHz. Below 1.8 GHz the CPU frequency increases further, e.g., to 2617 MHz at an Uncore clock of 1.7 GHz and up to 2720 MHz at an Uncore clock of 1.2 GHz. LINPACK performance starts to degrade at this point despite an increasing core frequency due to the Uncore becoming a data bottleneck. In UFS mode, the Uncore is clocked at 2489 MHz and the cores run at 2491 MHz. Compared to the optimum, UFS degrades performance by 3%. Energy efficiency is reduced by 6% from 4.94 GFLOP/s/W at an Uncore clock of 1.8 GHz to 4.65 GFLOP/s/W in UFS. The most energy-efficient operating point for LINPACK is 5.74 GFLOP/s/W at a core clock of of 1.6 GHz and an Uncore clock of 1.2 GHz.

6 Conclusions and outlook

We have conducted an analysis of core- and chip-level performance features of four recent Intel server CPU architectures. Previous findings about the behavior of clock speed and its interaction with thermal design limits on Haswell and Broadwell CPUs could be confirmed. Overall the documented instruction latency and throughput numbers fit our measurements, with slight deviations in scalar DP divide throughput and SSE/scalar add and fused multiply-add latency on Broadwell. We could also demonstrate the consequences of limited instruction throughput and the special properties of Haswell’s and Broadwell’s address generation units for L1 cache bandwidth.

Our microbenchmark results have unveiled that the gather instruction, which was newly introduced with the AVX2 instruction set, was finally implemented on Broadwell in a way that makes it faster than hand-crafted assembly. The L2 cache on Haswell and Broadwell does not keep its promise of doubled bandwidth to L1 but only delivers between 32 and 43 B/cy, as opposed to Sandy Bridge and Ivy Bridge, which get close to their architectural limit of 32 B/cy.

The scalable L3 cache was one of the major innovations in the Sandy Bridge architecture. On Haswell and Broadwell, the bandwidth scalability of the L3 cache is substantially improved in Cluster on Die (CoD) mode. Even without CoD the full-chip efficiency (at up to 18 cores) is never worse than 85%. In the memory domain we find, unsurprisingly, that CoD provides the lowest latency and highest memory bandwidth (except with streaming stores), but the irregular Graph500 benchmark shows a 50% speedup on Broadwell when switching to non-CoD and Home Snoop with Opportunistic Snoop Broadcast.

Finally, our analysis of core and Uncore clock speed domains has exhibited significant potential for saving energy in a sensible setting of the Uncore frequency, without sacrificing execution performance.

Future work will include a thorough evaluation of the ECM performance model on all recent Intel architectures, putting to use the insights generated in this study. Additionally, existing analytic power and energy consumption models will be extended to account for the Uncore power more accurately. Significant changes in performance and power behavior are expected for the upcoming Skylake architecture, such as (among others) an L3 victim cache and AVX-512 on selected models, and will pose challenges of their own.

References

  • [1] Barker, K., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho, J.C.: A Performance Evaluation of the Nehalem Quad-core Processor for Scientific Computing. Parallel Processing Letters 18(4), 453–469 (December 2008), http://dx.doi.org/10.1142/S012962640800351X
  • [2] Gasc, T., Vuyst, F.D., Peybernes, M., Poncet, R., Motte, R.: Building a more efficient Lagrange-remap scheme thanks to performance modeling. In: Papadrakakis, M., et al. (eds.) Proc. ECCOMAS Congress 2016, the VII. European Congress on Computational Methods in Applied Sciences and Engineering, Crete Island, Greece, 5–10 June 2016 (2016), https://www.eccomas2016.org/proceedings/pdf/12210.pdf
  • [3] Hackenberg, D., Oldenburg, R., Molka, D., Schöne, R.: Introducing FIRESTARTER: A processor stress test utility. In: 2013 International Green Computing Conference Proceedings. pp. 1–9 (June 2013)
  • [4] Hackenberg, D., Schöne, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An energy efficiency feature survey of the Intel Haswell processor. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. pp. 896–904 (May 2015)
  • [5] Hager, G., Treibig, J., Habich, J., Wellein, G.: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency Computat.: Pract. Exper. (2013), DOI: 10.1002/cpe.3180
  • [6] Hockney, R.W., Curington, I.J.: : A parameter to characterize memory and communication bottlenecks. Parallel Computing 10(3), 277–286 (1989)
  • [7] Hofmann, J., Fey, D.: An ECM-based energy-efficiency optimization approach for bandwidth-limited streaming kernels on recent Intel Xeon processors. In: Proceedings of the 4th International Workshop on Energy Efficient Supercomputing. pp. 31–38. E2SC ’16, IEEE Press, Piscataway, NJ, USA (2016), https://doi.org/10.1109/E2SC.2016.16
  • [8] Hofmann, J., Fey, D., Eitzinger, J., Hager, G., Wellein, G.: Analysis of Intel’s Haswell Microarchitecture Using the ECM Model and Microbenchmarks, pp. 210–222. Springer International Publishing, Cham (2016), http://dx.doi.org/10.1007/978-3-319-30695-7_16
  • [9] Hofmann, J., Fey, D., Riedmann, M., Eitzinger, J., Hager, G., Wellein, G.: Performance analysis of the Kahan-enhanced scalar product on current multi-core and many-core processors. Concurrency and Computation: Practice and Experience pp. n/a–n/a (2016), http://dx.doi.org/10.1002/cpe.3921
  • [10] Hofmann, J., Treibig, J., Hager, G., Wellein, G.: Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing. pp. 57–64. WPMVP ’14, ACM, New York, NY, USA (2014), http://doi.acm.org/10.1145/2568058.2568068
  • [11] Intel Corporation: Intel Xeon Processor E5-1600, E5-2400, and E5-2600 v3 Product Families - Volume 2 of 2, Registers, http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf
  • [12] Intel Corporation: Intel Xeon Processor E5 v3 Product Family, http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf
  • [13] McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (Dec 1995)
  • [14] Microway Inc.: Detailed specifications of the intel xeon e5-2600 v4 broadwell-ep processors
  • [15] Molka, D., Hackenberg, D., Schöne, R., Nagel, W.E.: Cache coherence protocol and memory performance of the intel haswell-ep architecture. In: Proceedings of the 44th International Conference on Parallel Processing (ICPP’15). IEEE (2015)
  • [16] Sailesh Kottapalli, Vedaraman Geetha, Henk G. Neefs, Youngsoo Choi: Patent US20130007376 A1: Opportunistic snoop broadcast (osb) in directory enabled home snoopy systems, http://www.google.com/patents/US20130007376
  • [17] Schöne, R., Treibig, J., Dolz, M.F., Guillen, C., Navarrete, C., Knobloch, M., Rountree, B.: Tools and methods for measuring and tuning the energy efficiency of HPC systems. Scientific Programming 22(4), 273–283 (2014), http://dx.doi.org/10.3233/SPR-140393
  • [18] Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. In: Proceedings of the 29th ACM International Conference on Supercomputing. ICS ’15, ACM, New York, NY, USA (2015), http://doi.acm.org/10.1145/2751205.2751240
  • [19] Treibig, J., Hager, G., Hofmann, H.G., Hornegger, J., Wellein, G.: Pushing the limits for medical image reconstruction on recent standard multicore processors. The International Journal of High Performance Computing Applications 27(2), 162–177 (2013), http://dx.doi.org/10.1177/1094342012442424
  • [20] Treibig, J., Hager, G., Wellein, G.: likwid-bench: An extensible microbenchmarking platform for x86 multicore compute nodes. In: Parallel Tools Workshop. pp. 27–36 (2011)
  • [21] Wilde, T., Auweter, A., Shoukourian, H., Bode, A.: Taking Advantage of Node Power Variation in Homogenous HPC Systems to Save Energy, pp. 376–393. Springer International Publishing, Cham (2015), http://dx.doi.org/10.1007/978-3-319-20119-1_27
  • [22] Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009), http://doi.acm.org/10.1145/1498765.1498785
  • [23] Wittmann, M., Hager, G., Zeiser, T., Treibig, J., Wellein, G.: Chip-level and multi-node analysis of energy-optimized lattice Boltzmann CFD simulations. Concurrency and Computation: Practice and Experience 28(7), 2295–2315 (2016), http://dx.doi.org/10.1002/cpe.3489