BOPS, Not FLOPS! A New Metric and Roofline Performance Model For Datacenter Computing

01/28/2018
by   Lei Wang, et al.
0

The past decades witness FLOPS (Floating-point Operations per Second) as an important computation-centric performance metric. However, for datacenter (in short, DC) computing workloads, such as Internet services or big data analytics, previous work reports that they have extremely low floating point operation intensity, and the average FLOPS efficiency is only 0.1 average IPC is 1.3 (the theoretic IPC is 4 on the Intel Xeon E5600 platform). Furthermore, we reveal that the traditional FLOPS based Roofline performance model is not suitable for modern DC workloads, and gives misleading information for system optimization. These observations imply that FLOPS is inappropriate for evaluating DC computer systems. To address the above issue, we propose a new computation-centric metric BOPs (Basic OPerations) that measures the efficient work defined by the source code, includes floating-point operations and the arithmetic, logical, comparing, and array addressing parts of integer operations. We define BOPS as the average number of BOPs per second, and propose replacing FLOPS with BOPS to measure DC computer systems. On the basis of BOPS, we propose a new Roofline performance model for DC computing, which we call DC-Roofline model, with which we optimize DC workloads with the improvement varying from 119

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

01/28/2018

BOPS, Not FLOPS! A New Metric, Measuring Tool, and Roofline Performance Model For Datacenter Computing

The past decades witness FLOPS (Floating-point Operations per Second), a...
04/04/2018

End-to-End DNN Training with Block Floating Point Arithmetic

DNNs are ubiquitous datacenter workloads, requiring orders of magnitude ...
06/10/2021

NetFC: enabling accurate floating-point arithmetic on programmable switches

In-network computation has been widely used to accelerate data-intensive...
02/24/2020

On the use of the Infinity Computer architecture to set up a dynamic precision floating-point arithmetic

We devise a variable precision floating-point arithmetic by exploiting t...
09/20/2018

FFT Convolutions are Faster than Winograd on Modern CPUs, Here is Why

Winograd-based convolution has quickly gained traction as a preferred ap...
11/05/2020

Datasets for Benchmarking Floating-Point Compressors

Compression of floating-point data, both lossy and lossless, is a topic ...
06/15/2011

A Characterization of the SPARC T3-4 System

This technical report covers a set of experiments on the 64-core SPARC T...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To perform big data analysis or provide Internet services, more and more organizations are building internal datacenters, or renting hosted datacenters. As a result, DC (datacenter) computing has become a new paradigm of computing. It seems that the proportion of DC has outweighed HPC (High Performance Computing) in terms of market share (HPC only takes 20% of total) [1].

To measure the performance of DC, the wall clock time is used as a ground truth metric. In practice, there are several user-perceived metrics—derivatives of the wall clock time that are used to measure the application-specific systems, such as transactions per minute for the online transaction system [2], and input data processed per second for the big data analysis system [25]. However, there are two limitations in these user-perceived metrics. First, different user-perceived metrics cannot be used to perform apple-to-apple comparison. For example, transactions per minute (TPM) and data per second processing capability (GB/S) cannot be used for apple-to-apple comparison. Second, user-perceived metrics are hard to measure the upper bound performance of computer systems, which is the foundation of the performance model. So, a single computation-centric metric like FLOPS is simple but powerful in the system and architecture community. The performance numbers of the metric can be measured by the micro-architecture of the system, the specific micro benchmark and the real-world workload. By using these different numbers, we can build the upper bound model, which allows us to better understand the performance ceiling of the computer system, and then guide the system co-design. As the most important computation-centric metric, the FLOPS (FLoating-point Operations Per Second) [11] metric and its upper bound model–Roofline model have led the computing technology progress, not limited to high performance computing (HPC), for many years. So a natural question arises: what is the metric for DC, and is it still the FLOPS metric?

Different from HPC, DC has many unique characteristics. For typical DC workloads, the average floating point instruction ratio is only 1% and the average FLOPS efficiency is only 0.1%, while the average IPC (Instructions Per Cycle) is 1.1 (the theoretic IPC is 4 for the experimental platform). We also found that the FLOPS gap is 12X between two systems equipped with Intel Xeon or Intel Atom processors, but the average user-perceived performance gap of the DC workloads is only 7.4X. These observations imply that FLOPS is inappropriate to measure the DC systems from the perspectives of the performance gaps between different systems or system efficiencies. OPS (operations per second) is another computation-centric metric. OPS [27]

is initially proposed for digital processing systems, which is defined as the 16-bit addition operations per second. The definitions of OPS are extended to the artificial intelligence processor, such as Google’s Tensor Processing Unit 

[36, 23] and the Cambricon processor [24, 17]. All of them are defined in terms of a specific operation, such as the specific matrix multiplication operations. However, these matrix operations are only a fraction of diverse operations in DC workloads. So, OPS is only suitable for specific accelerators, not generalized DC computing systems.

In this paper, inspired by FLOPS [11], Basic OPerations per Second (BOPS for short) is proposed to evaluate DC computing systems. The contributions of the paper are described as follows.

First, we propose BOPs (Basic OPerations), which includes the integer and floating point computations of arithmetic, logical, comparing and array addressing. We define BOPS as the average number of BOPs per second. BOPs can be calculated at the source code level of the application, independent of the underlying system implementation. So it is fair to evaluate different computing systems and facilitate co-design of systems and architectures. We also take three systems equipped with three typical Intel processors as examples to illustrate that BOPS not only truly reflects the performance gaps across different DC systems, but also reflects the system efficiency of the DC system. The bias between the BOPS gap and the average user-perceived performance gap is no more than 11%, and the BOPS efficiency of the Sort workload achieves 32%.

Second, by using several typical micro benchmarks, we measure the upper bound performance following different optimization methods, and then we propose a BOPS-based Roofline model, named DC-Roofline. DC-Roofline not only depicts performance ceilings of the DC workloads on the target systems, but also is helpful to guide the optimization of the DC computing systems. And for Sort—a typical kernel DC workload, the performance improvement is 4.4X.

Third, a real-world DC workload always has million lines of codes and tens of thousands of functions, so it is not easy to use the DC-Roofline model directly. We propose a new optimization methodology. We profile the hotspot functions of the real-world workload and extract the corresponding kernel workloads. The real-world application gains performance benefit from merging optimizations methods of kernel workloads, which are under the guidance of DC-Roofline. Through experiments, we demonstrate that Redis—a typical real-world workload gains performance improvement by 1.2X.

The remainder of the paper is organized as follows. Section 2 summarizes the related work. Section 3 states background and motivations. Section 4 defines BOPs, and reports how to use it. Section 5 introduces BOPS-based DC-Roofline model and its usage. Section 6 presents how to optimize the performance of real-world DC workloads under the DC-Roofline model. Section 7 draws a conclusion.

2 Related Work

In this section, the related work of this paper is introduced from two perspectives: metrics, and performance models.

2.1 Metrics

The wall clock time is the most basic performance metric for the computer system [39]

, and almost all the other performance metrics are derived from it. Based on the wall clock time, the performance metrics can be classified into two categories. One is the user-perceived metrics, which can be intuitively perceived by the user, such as the TPM (transactions per minute) metric. The other is the computation-centric metrics, which are related to specific computation operations, such as FLOPS (FLoating-point Operations Per Second).

User-perceived metrics can be further classified into two categories: one is the metrics for the whole system, and the other is the metrics for components of the system. The examples of the former include data sorted in one minute (MinuteSort), which measures the sorting capability of a system [5], and transactions per minute (TPM) for the online transactions system [2]. The examples of the latter include the SPECspeed/SPECrate for the CPU component [3], the input/output operations per second (IOPS) for the storage component [14], and the data transfer latency for the network component [13].

There are many computation-centric metrics. FLOPS (FLoating-point Operations Per Second) is a computation-centric metric to measure the computer system, especially in fields of scientific computing that make heavy use of floating-point calculations [11]. The wide recognition of FLOPS indicates the maturation of high performance computing. MIPS (Million Instructions Per Second) [22] is another famous computation-centric metric, which is defined as the million number of instructions the processor can execute in a second. The main limitation of MIPS is that it is architecture-dependent. There are many derivatives of the MIPS, including MWIPS and DMIPS [28], which use synthetic workloads to evaluate the floating point operations and integer operations, respectively. The metric of WSMeter [37], which is defined as the quota-weighted sum of MIPS of a job(IPC), is also a derivative of MIPS, and hence it is also architecture-dependent. Unfortunately, modern datacenters are heterogeneous, which consist of different generations of hardwares.

OPS (operations per second) is another computation-centric metric. OPS [27] is initially proposed for digital processing systems, which is defined as the 16-bit addition operations per second. The definitions of OPS are then extended to Intel Ubiquitous High-Performance Computing [35] and artificial intelligence processors, such as Google’s Tensor Processing Unit [36, 23] and the Cambricon processor [24, 17]. All of these definitions are in terms of a specific operation. For example, the OPS is 8-bit matrix multiplication operations in TPU and 16-bit integer operations in the Cambricon processor, respectively. However, the workloads of modern DCs are comprehensive and complex, and the bias to the specific operation cannot ensure the evaluation fairness.

For each kind of metrics, the corresponding tools or benchmarks [39] are proposed to calculate the values. For user-perceived metrics—SPECspeed/SPECrate, SPECCPU is the benchmark suite [3] to measure the CPU component. For TPM metrics, TPC-C [2] is the benchmark suite to measure the online transactions system. The other example is the Sort benchmark  [5] for MinuteSort. For computation-centric metrics, Whetstone [21] and Dhrystone [32] are the measurement tools for MWIPS and DMIPS, respectively. HPL [11] is a widely-used measurement tool for FLOPS. As a microbenchmark, HPL demonstrates a sophisticated design: the proportion of the floating-point addition and multiplication in HPL is 1:1, so as to fully utilize the FPU unit of the modern processor.

2.2 Performance model

A performance model can depict and predict the performance of a specific system. There are two categories of performance models: one is the analytical model, which uses stochastic/statistical analytical methods to depict and predict the system performance, and the other is the bound model, which is relatively simpler and only depicts the performance bound or bottleneck of systems. Previous work [34, 29, 9, 37] use stochastic/statistical analytical models to predict the system performance. Actually, the distributed and parallel systems always have lots of uncertain behaviors, so it is hard to build the accurate prediction models for them. Instead, the bound and bottleneck analysis is more suitable. Amdahl’s Law [7] is one of the famous performance bound models for parallel processing computer systems, and it is also used for big data systems  [16]. The Roofline model [33] is another famous performance bound model. The original Roofline model [33] adopts FLOPS as the metric. On the basis of the definition of operational intensity (OI)—which is the total number of floating point instructions divided by the total byte number of memory access. The Roofline model can depict the upper bound performance of given workloads, when different optimization strategies are adopted to the target system.

3 Background and Motivations

3.1 Background

The diversity and complexity of modern DC workloads raise great challenges for quantitatively depicting the performance of the computer system spanned multiple domains – algorithm, programming, compiling, system development, and architecture design. And the computation-centric metric and the performance model are two key elements for quantitatively depicting the performance of the system  [39] [33].

3.1.1 The Computation-centric Metric

For system and architecture community, the computation-centric metric, such as FLOPS, is a fundamental yardstick to reflect the running performance and gap across different systems or architectures. Generally, a computation-centric metric has performance upper bound on the specific architecture according to the micro-architecture design. For example, the peak FLOPS is computed as follows.

(1)

The measurement tool is used to measure the performance of systems and architectures in terms of metric values, and report the gap between the real value and the theoretical peak one. For example, HPL [11] is a widely-used measurement tool in terms of FLOPS. The FLOPS efficiency of a specific system is the ratio of the HPL’s FLOPS to the peak FLOPS.

3.1.2 The Upper Bound Performance Model

The computation-centric metric is the foundation of the system performance model. And the performance model of a computer system can depict and predict the workload performance of the specific system. For example, the Roofline model [33] is a famous upper bound model based on FLOPS. There are many system optimization works  [19]  [15], which are performed on the basis of the Roofline model in the HPC domain.

(2)

The above equation indicates that the attained workload performance bound of a specific platform is limited by the computing capacity of processor and the bandwidth of memory. and are the peak performance of the platform, and the operation intensity (i.e., ) is the total number of floating point operations divided by the total byte number of memory access. To identify the bottleneck and guide the optimization, the ceilings (for example, the ILP and SIMD optimizations) can be added to provide the performance tuning guidance.

3.2 Requirements of the DC computing metric

We define the requirements from the following three perspectives.

First, the metric should reflect the performance gaps among different DC systems. User-perceived metrics always reflect the running performance. For example, data processed per second (GB/s), which divides the input data size by total running time, is a user-perceived metric and effectively reflect the data processing capability. Also, the computation-centric metric should preserve this characteristic and reflect the performance gap.

Second, the metric is the facility for hardware and software co-design of DC systems. For the co-design of different layers of system stacks, i.e., application, software system and hardware system, the metric should support different level measurements, spanning source code, binary code, and hardware instruction level.

Third, the metric can reflect the upper bound performance of a specific system. Focusing on different system design, the metric should be sensitive to design decisions and reflect theoretical performance upper bound. Then, the gap between real and theoretical values is useful to understand the performance bottlenecks and guide the optimizations.

3.3 The characteristics of DC Workloads

In this subsection, the characteristics of the DC workloads and traditional benchmarks are depicted. We choose the DC benchmark suite: DCMIX as the DC workloads. DCMIX is designed for modern datacenter computing systems, which has 17 typical datacenter workloads and includes online service and data analysis workloads. Latencies of DCMIX workloads are ranged from microseconds to minutes. The applications of DCMIX involve big data, artificial intelligence, transaction processing database, and so on. As shown in Table 1, there are two types of benchmarks in the DCMIX, which are Micro-Benchmarks (kernel workloads) and Component benchmarks (real DC workloads).

Name Type Domain Category
Sort offline analytics Big Data MicroBench
Count offline analytics Big Data MicroBench
MD5 offline analytics Big Data MicroBench
Multiply offline analytics AI MicroBench
FFT offline analytics AI MicroBench
Union offline analytics OLTP MicroBench
Redis online service Big Data Component
Xapian online service Big Data Component
Masstree online service Big Data Component
Bayes offline analytics Big Data Component
Img-dnn online service AI Component
Moses online service AI Component
Sphinx online service AI Component
Alexnet offline analytics AI Component
Silo online service OLTP Component
Shore online service OLTP Component
Table 1: Workloads of the DCMIX

For traditional benchmarks, we choose HPCC, PARSEC, and SPECPU. We have used HPCC 1.4, which is a representative HPC benchmark suite, for the experiment. We run all of the seven benchmarks in HPCC. PARSEC is a benchmark suite composed of multi-threaded programs, and deploys PARSEC 3.0. For SPEC CPU2006, we run the official floating-point benchmarks (SPECFP) applications with the first reference inputs.

The experimental platform is the same like that in Section IV, and we choose the Intel Xeon E5645, which is a typical brawny-core processor (OoO execution, four-wide instruction issue).

We choose GIPS (Giga-Instructions per Second) and GFLOPS (Giga-Floating point Operations Per Second) as the performance metrics, which are all derived from the wall clock time. Corresponding to performance metrics, we choose IPC, CPU utilization and memory bandwidth utilization as the efficiency metrics.

Figure 1: GIPS and FLOPS of Workloads.

As shown in the Fig.  1 (please note that the Y axis in the figure is in logarithmic coordinates), the average GFLOPS of DC workloads is two magnitude orders lower than that of traditional benchmarks, while the GIPS of DC workloads is in the same magnitude order as the traditional benchmarks. Furthermore, the average IPC of DC workloads is 1.1 and that of traditional benchmarks is 1.4, the average CPU utilization of DC workloads is 70% and that of traditional benchmarks is 80%. These metrics imply that DC workloads can utilize the system resource efficiently as the same as the traditional benchmarks. The poor FLOPS efficiency does not lie in the lower execution efficiency. In fact, the floating point operation intensity of DC workloads (0.05 on average) is much lower, which leads to the low FLOPS efficiency.

In order to analyze the execution characteristics of DC workloads, we choose the instruction mixture to perform the further analysis. Fig.  2 shows the retired instructions breakdown, and we have three observations as follows. First, for DC workloads, the ratio of integer to floating point operations is 38, while the ratios for HPCC, Parsec and SPECFP are 0.3, 0.4, and 0.02, respectively. That is the main reason why FLOPS does not work in DC computing. Second, DC workloads have more branch instructions, with the ratio of 19%, while the ratios of HPCC, Parsec and SPECFP are 16%, 11%, and 9%, respectively. Third, the data movement related operations, whose ratio is roughly 73%, include the load, store, address calculation (address calculation instructions are from floating point and integer instructions). Considering branch instructions, the ratio of data movement related operations and branch instructions increases to 92%.

Figure 2: Instructions Mixture of DC Workloads.

3.4 What should be included in our new metric for the DC computing

The FLOPS of DC workloads is only 0.04 GFLOPS on average (only 0.1% of the peak), which implies that the FLOPS value is far from the DC system efficiency. Furthermore, if the Roofline model is used in terms of FLOPS, the achieved performance is at most up to 0.1% of the peak FLOPS, so we need a new metric for DC computing systems.

The new metric should consider both integer and floating-point operations. As floating point and integer instructions are more diverse and all of them are different. For the new metric, our fundamental principle is that we only consider the efficient work defined by the source code, and it should correspond with the characteristics of DC workloads. So we will not consider all of floating point and integer operations. Following the rule to choose a representative minimum subset operations of DC workloads, BOPs should be calculated through analyzing the source code (architecture independent). We analyze the floating point and integer instruction breakdown of microbenchmarks of DCMIX through inserting the analysis code into the source code. We classify all operations into four classes. The first class is array addressing computations (related with data movement operation), such as loading or storing the value of array ; The second class is the arithmetic or logical operations, such as ; The third class is the comparing operation (related with conditional branch), such as ; And the fourth class is all the other operations. We find that the 47% of total floating point and integer instructions belong to address calculation, 22% of those belong to branch calculation, and 30% of those belong to arithmetic or logical operations. Furthermore, the first three classes can reflect efficient works defined by the source code, so we include them in BOPs.

4 BOPs

BOPS (Basic OPerations per Second) is the average number of BOPs (Basic OPerations) for a specific workload completed per second. In this section, we present the definition of BOPs and how to measure BOPs with or without the available source code, and then introduce how to use BOPS.

4.1 BOPs Definition

In our definition, BOPs include the integer and floating point computations of arithmetic, logical, comparing and array addressing. The arithmetic or logical, comparing and array addressing computation operations correspond to calculation, conditional branch-related, and data movement-related operations, respectively. The detailed operations of BOPs are shown in Table 2. Each operation in Table 2 is counted as 1 except for N-dimensional array addressing. Note that all operations are normalized to 64-bit operation. For arithmetic or logical operations, the number of BOPs is counted as the number of corresponding arithmetic or logical operations. For array addressing operations, we take the one-dimensional array as the example. Loading the value of indicates the addition of an offset to the address location of , so the number of BOPs increases by one. And, it can also be applied to the calculation of the multi-dimensional array. For comparing operations, we transform them to subtraction operations. We take as an example and transform it to , so the number of BOPs increases by one. Through the definition of BOPs, we can see that in comparison with the FLOPS, BOPS concerns not only the floating-point operations, but also the integer operations. On the other hand, like FLOPs, BOPs normalize all operations into 64-bit operations, and each operation is counted as 1.

Operations Normalized value
Add 1
Subtract 1
Multiply 1
Divide 1
Bitwise operation 1
Logic operation 1
Compare operation 1
One-dimensional array addressing 1
N-dimensional array addressing N
Table 2: Normalization Operations of BOPs.

The delays of different operations are not considered in the normalized calculation of BOPs, because the delays can be extremely different in diverse micro-architecture platforms. For example, the delay of the division in Intel Xeon E5645 processor is about 7-12 cycles, while in Intel Atom D510 processor, the delay can reach up to 38 cycles [6]. Hence, the consideration of delays in the normalization calculations will lead to architecture-related issue. In our BOPS definition, BOPs normalize all operations into 64-bit operations, and each operation is counted as 1.

4.2 How to Measure BOPs

BOPs can be measured at either the source code level or the instruction level.

4.2.1 Source-code level measurement

We can calculate BOPs from the source code of a workload, and this method needs some manual work (inserting the counting code). However, it is independent of the underlying system implementation, so it is fair to evaluate and compare different system and architecture implementations. As the following example shows, BOPs is not calculated in lines 1 and 2, because they are variable declarations. Line 3 consists of a loop command and two integer operations, and the number of the corresponding BOPs is (1+1) * 100 = 200 for the integer operations, while the loop command is not calculated; Line 5 consists of the array addressing operations and addition operations: the array addressing operations are counted as 100 * 1, and the addition operations are counted as 100*1, so the sum of BOPs in the example program is: 200 + 200 = 400.

1 long newClusterSize[100];
2 long j;
3 for (j=0; j<100;j++)
4 {
5  newClusterSize[j]=j+1;
6 }

To measure BOPs in the source code level, we need to insert the counting code. For the BOPs count, we will turn on the debug flag, and for the performance test, we will turn off the debug flag. In the example codes, cmp_count is the counter for the comparing operations, adr_count is that for the array addressing operations, ari_count is that for arithmetic operations, and BOPs is the sum of these three counters.

long newClusterSize[100];
long j;
 for (j=0; j<100;j++)
{
#ifdef DEBUG
    cmp_count+=2;
#endif
    newClusterSize[j]=j+1;
#ifdef DEBUG
    adr_count+=1;
    ari_count+=1;
#endif
}

Another thing we need to take into consideration is the system built-in library functions. For the calculation of the system-level functions, such as Strcmp() function, we implement user-level functions manually, and then count the number of BOPs through inserting the counting code. For the microbenchmark workloads, we can insert the counting code easily through analyzing the source code. For the component benchmarks or real applications, first, we profile the execution time of the real DC workload and find out the Top hotspot functions. Second, we analyze these hotspot functions and insert the counting code into these functions. Then, we can count BOPs for the real DC workload (more details are in Section VI).

4.2.2 Instruction level measurement under X86_64 architecture

The source-code measurement need to analyze the source code, which costs a lot especially for complex system stacks (i.e., Hadoop system stacks). Instruction level measurement can avoid this high analysis cost and the restriction of needing the source code, but it is architecture-dependent. We propose an instruction-level approach to measure BOPs, which uses the hardware performance counter to obtain BOPs. Since different types of processors have different performance counter events, for convenience, we introduce an approximate but simple instruction level measurement method under X86_64 architecture. That is, we can obtain the total number of instructions (ins), branch instructions (branch_ins), load instructions (load_ins) and store instructions (store_ins) through the hardware performance counters. And BOPs can be calculated according to the following equation.

(3)

Note that our approximate measurement method includes all of floating point and integer instructions under X86_64 architecture, which does not exactly conform to the BOPS definition. So, it only suits for the BOPS-based optimizations, and does not suit for the performance evaluation among different computer systems.

4.3 How to Measure the system with BOPS

4.3.1 The Peak BOPS of the System

BOPS is the average number of BOPs for a specific workload completed per second. The peak BOPS can be calculated by the micro-architecture with the following equation.

(4)

For our Intel Xeon E5645 experimental platform, the CPU number is 1, the core number is 6, the frequency of core is 2.4 G, BOPs per cycle is 6 (The E5645 equips two 128-bit SSE FPUs and three 128-bit SSE ALUs, and according to the execution port design, it can execute three 128-bit operations per cycle). So .

4.3.2 The BOPS Measuring Tool

We provide a BOPS measurement tool to calculate the BOPS value. At present we choose Sort in the DCMIX as the first BOPS measurement tool. To deal with the diversity of DC workloads, we will develop a series of representative workloads as the BOPS measurement tools. The scale of the Sort workload is 8E8 records, and BOPs of that is 324E9. Please note that BOPs value will change as the data scale or request number changes.

4.3.3 Measure the System with BOPS

The measurement tool can be used to measure the real performance of the workload on the specific system. Furthermore, the BOPS efficiency can be calculated by the following equation.

(5)

For example, Sort has 324E9 BOPs. We run Sort on the Xeon E5645 platform and the execution time is 11.5 seconds. . For the Xeon E5645 platform, the peak BOPS is 86.4 GBOPS, the real performance of Sort is 28.2 GBOPS, so the efficiency is 32%.

4.4 Evaluations

4.4.1 Experimental Platforms and workloads

We choose all microbenchmarks of DCMIX as DC workloads, and choose the wall clock time as the user-perceived performance. We obtain the user-perceived performance metric through collecting the wall clock time of the workload on the specific system. Three systems equipped with three typical Intel processors are chosen as the experimental platforms, which are Intel Xeon E5310, Intel Xeon E5645 and Intel Atom D510. The two former processors are typical brawny-core processors (OoO execution, four-wide instruction issue), while Intel Atom D510 is a typical wimpy-core processor (in-order execution, two-wide instruction issue). Each experimental platform is equipped with one node. The detailed settings of platforms are shown in Table 3.

CPU Type CPU Core
Intel ®Xeon E5645 6 cores@2.4 GHZ
L1 DCache L1 ICache L2 Cache L3 Cache
6 32 KB 6 32 KB 6 256 KB 12 MB
CPU Type CPU Core
Intel ®Xeon E5310 4 cores@1.6 GHZ
L1 DCache L1 ICache L2 Cache L3 Cache
4 32 KB 4 32 KB 2 4 MB None
CPU Type CPU Core
Intel ®Atom D510 2 cores@1.6 GHZ
L1 DCache L1 ICache L2 Cache L3 Cache
2 24 KB 2 32 KB 2 512 KB None
Table 3: Configurations of Hardware Platforms.

4.4.2 Overview

As shown in Fig. 3, we take all microbenchmarks of DCMIX as workloads, and show their performance on three different systems under DC-Roofline model (more details of DC-Roofline model are in Section V). From Fig. 3, we can see that all of performance metrics are unified to BOPS, which include the theory peak performance of the system (such as the Peak of E5645, Peak of E5310 and Peak of D510), and the workload performance under the specific system (such as Sort under the E5645 experimental platform, the E5310 experimental platform and the D510 experimental platform). So, we cannot only analyze the performance gaps across different systems, but also analyze the efficiency of the specific system.

Figure 3: Evaluating Three Intel Processors Platforms with BOPS.

4.4.3 The Performance Gaps across Different Experimental Platforms

For the performance gaps between E5310 and E5645, we can see that the BOPS gap is 2.3X (38.4 GBOPS v.s. 86.4 GBOPS), the FLOPS gap is 2.3X (25.6 GFLOPS v.s. 57.6 GFLOPS), and the gap of the average user-perceived performance metrics (the wall clock time) is 2.1X. This implies that FLOPS and BOPS metrics can reflect the user-perceived performance gaps (the bias is only 10%). But, for the performance gaps between D510 and E5645, we can see that the FLOPS gap is 12X (4.8 GFLOPS v.s. 57.6 GFLOPS), the BOPS gap is 6.7X (12.8 GBOPS v.s. 86.4 GBOPS), and the gap of the average user-perceived performance metrics is 7.4X. This implies that FLOPS metric cannot reflect the user-perceived performance gaps (12X v.s. 7.4X, the bias is 62%), but BOPS can do that (6.7X v.s. 7.4X, the bias is only 9%). Furthermore, for the performance gaps between D510 and E5310, we can see that the FLOPS gap is 5.3X, the BOPS gap is 3X, and the gap of the average user-perceived performance metrics is 3.4X. FLOPS metric cannot reflect the user-perceived performance gaps (5.3X v.s. 3.4X, the bias is 56%), but BOPS can do that (3X v.s. 3.4X, the bias is only 11%).

This is because that E5645/E5310 and D510 are totally different micro-architecture platforms, E5645/E5310 are designed for high performance floating point computing (OoO execution, four-wide instruction issue), while D510 is a low power microprocessor for mobile computing (In-order execution, two-wide instruction issue). So, FLOPS cannot reflect the performance gaps of DC workloads across different micro-architecture platforms (Xeon vs. Atom).

4.4.4 The Efficiency of Experimental Platforms

We use the Sort workload as the measure tool to evaluate the efficiency of DC systems. The peak BOPS is obtained by Equation 4. The real BOPs values are obtained by the source-code level measurement, and BOPS efficiency is obtained by Equation 5. In our experiments, BOPS efficiencies of E5645, E5310 and D510 are 32%, 20% and 21%, respectively, and their average FLOPS efficiencies are 0.1%, 0.2%, and 0.1%, respectively. So we see that the FLOPS value is far from the real DC system efficiency, while the BOPS value is more reasonable to reflect the efficiency of real DC systems. Furthermore, just like the result shown in Fig. 3, we can use BOPS to build the upper bound performance model for the DC workloads.

4.5 Summary

As an effective metric for DC, BOPS not only truly reflects the performance gaps across different systems, but also reflects the efficiency of DC systems. All of these are foundations of quantitative analysis for DC systems.

5 DC-Roofline Model

In this section, we present the DC-Roofline model, which depicts the upper bound performance for the given DC workload under the specific system. Then, we introduce the user cases of the DC kernel workload optimization based on the DC-Roofline model.

5.1 DC-Roofline Definition

The DC-Roofline model is inspired by the Roofline model [33]. The definitions of DC-Roofline model are described as follows.

Definition 1: The Peak Performance of DC

We choose (can be obtained from Equation 4) as the peak performance metric of DC in the DC-Roofline model.

Definition 2: Operation Intensity of DC

Operation intensity (OI) in the DC-Roofline is the ratio of BOPS and the memory bandwidth. The memory traffic () is the total number of swap bytes between CPU and the memory. The operation intensity () could be obtained from the following equation.

(6)

where the memory traffic () is calculated by multiplying the total number of memory access by 64, and the total number of memory access can be obtained by hardware performance counters. Section IV-B describes how to calculate BOPs.

Definition 3: The Upper Bound Performance of DC The attained performance bound of a given workload is depicted as follows.

(7)

where the is the peak memory bandwidth of the system.

5.2 Adding Ceilings for DC-Roofline

We add three ceilings –ILP, SIMD, and Prefetching– to specify the performance upper bound for the specific tuning settings. Among them, ILP and SIMD reflect computation limitations and Prefetching reflects memory access limitations. We evaluate our experiments on Intel Xeon E5645 platform.

5.2.1 Prefetching Ceiling

We use the Stream benchmark as the measurement tool for the Prefetching ceiling. We improve the memory bandwidth through opening the pre-fetching switch option in the system BIOS, and then the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s.

5.2.2 ILP and SIMD Ceilings

We add two calculation ceilings. The first one is SIMD and the second one is ILP. SIMD is the common method for the HPC performance improvement, which performs the same operation on multiple data simultaneously. Modern processors have 128-bit wide SIMD instructions at least (i.e., SSE, AVX, etc.). In the next sub-section, we will show that SIMD also suits for the DC workload. ILP efficiency can be described by the IPC efficiency (the peak IPC of E5645 is 4). We add the ILP ceiling with IPC no more than 2 (according to our experiments, the IPC of all of workloads is no more than 2), and add the SIMD ceiling with the SIMD upper bound performance. We use the following equation to estimate the ILP and SIMD ceilings:

(8)

where is the IPC efficiency of the workload, is the scale of SIMD (for E5645, the value is 1 under SIMD, the value is 0.5 under SISD). Therefore, ILP (Instruction-level parallelism) ceiling is 43.2 GBOPS when the IPC number is 2. Based on the ILP ceiling, the SIMD ceiling is 21.6 GBOPS when not using SIMD.

Definition 4: The Upper Bound Performance of DC Under Ceilings

The attained performance bound of a given workload under ceilings is described as follows.

(9)

5.3 Visualized DC-Roofline Model

The visualized DC-Roofline model on the Intel Xeon E5645 platform can be seen in Fig. 4. The diagonal line represents the bandwidth, and the roof line shows the peak BOPS performance. The confluence of the diagonal and roof lines at the ridge point (where the diagonal and horizontal roofs meet) allows us to evaluate the performance of the system. Similar to the original Roofline model, the ceilings – which imply the performance upper bounds for the specific tuning settings – can be added to the DC-Roofline model. There are three ceilings (ILP, SIMD, and Prefetching) in the figure.

Figure 4: Visualized the DC-Roofline model on the Intel E5645 Platform.

5.4 DC-Roofline Model Usage

The same as the Roofline mode, the DC-Roofline model is also suitable for the kernel optimization. We illustrate how to use the DC-Roofline model to optimize the performance of the DC kernel workloads on Intel Xeon E5645 platform. For the test platform, the peak BOPS is 86.4 GBOPS according to Equation 4, and the peak memory bandwidth is 13.2 GB/s by using the Stream Benchmark  [26]. The kernel workloads are microbenchmarks in the DCMIX.

5.4.1 Optimizations under the DC-Roofline Model

[Memory Bandwidth Optimizations], we improve the memory bandwidth through opening the pre-fetching switch option, and the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s in the E5645 platform. [Compiled Optimizations], compiled optimization is the basic optimization for the calculation. We improve the calculation performance through adding the compiling option with -O3 (i.e., gcc -O3). [OI Optimizations], for the same workload, higher OI means the better locality. We modify programs of the workload to reduce data movement and increase OI. [SIMD Optimizations], we apply the SIMD technique to the DC workloads, and change programs of the workload from SISD to SIMD through SSE revising.

5.4.2 Optimizations for the Sort Workload

The Sort workload sorts an integer array of a specific scale, and the sorting algorithm is the merge sort. The program of Sort is implemented by C++. Fig. 5 shows the optimization trajectories. In the first step, we perform Memory Bandwidth Optimization, BOPS increases from 6.4 GBOPS to 6.5 GBOPS. In the second step, we perform Compiled Optimization, improving the performance to 6.8 GBOPS. In the original source codes, data are loading and processing in the disk, and the OI of Sort is 1.4. In the third step, we perform OI Optimization, we revise the source code, which loads and processes all data in the memory, and the OI of Sort increases to 2.2. The performance of Sort increases to 9.5 GBOPS. In the forth step, we apply the SIMD technique to the DC workloads, and change Sort from SISD to SIMD through SSE revising. By using the SEE Sort, the attained performance is 28.2 GBOPS, which is 32% of the peak BOPS. Under the guidance of the DC-Roofline model, the improvement is achieved by 4.4 times.

Figure 5: Optimization Trajectories of the Sort Workload on the Intel E5645 Platform.

5.4.3 The Optimization Summary

Finally, we take all workloads as a whole to show their performance improvements. For the Sort workload, we execute four optimizations (Memory Bandwidth Optimizations, Compiled Optimizations, OI Optimizations and SIMD Optimizations). For other workloads, we execute two basic optimizations (Memory Bandwidth Optimizations and Compiled Optimizations). As shown in Fig. 6, all workloads have achieved performance improvements ranging from 1.1X to 4.4X.

Figure 6: DC Workloads’ Optimizations on the Intel E5645 Platform.

Moreover, we can observe the workload efficiency from the DC-Roofline model. As shown in Fig. 6, the workload, which is more closer to the ceiling has higher efficiency. For example, the efficiency of Sort is 65% under the ILP ceiling, and that of MD5 is 66% under the SIMD ceiling. The efficiency equation is calculated as follows.

(10)

5.5 Roofline Model Vs. DC-Roofline Model

Fig. 7 shows the final results by using Roofline (the left Y axis) and DC-Roofline (the right Y axis), respectively. Please note that the Y axis in the figure is in logarithmic coordinates. From the figure, we can see that if using the Roofline model in terms of FLOPS, the achieved performance is at most up to 0.1% of the peak FLOPS. For the comparison, the results using the DC-Roofline model is up to 32% of the peak BOPS. So, the DC-Roofline model is more suited for the upper bound performance model of DC.

Figure 7: The Roofline Model and the DC-Roofline Model on the Intel E5645 Platform.

5.6 Summary

As the upper bound performance model for DC, DC-Roofline not only truly reflects performance ceilings of the target DC system, but also helps to guide the optimization of DC workloads.

6 Optimizing the Real DC Workload Under DC-Roofline Model

As the real DC workload always has million lines of codes and tens of thousands of functions, it is not easy to use the DC-Roofline model directly (the Roofline or DC-Roofline model is designed for the kernel program optimization). In this section, we propose the optimization methodology for the real DC workload, which is based on the DC-Roofline model. We take the Redis workload as the example to illustrate the optimization methodology, and the experimental results show that the performance improvement of the Redis is 120% under the guidance of the optimization methodology.

6.1 The Optimization Methodology for the Real DC Workloads

Fig. 8 demonstrates the optimization methodology for real DC workload. First, we profile the execution time of the real DC workload and find out the Top hotspot functions. Second, we analyze these hotspot functions (merging functions with the same properties) and build the Kernels ( is less than or equal to ). As the independent workload, the Kernel’s codes are based on the source code of the real workload and implement a part of functions (specific hotspot functions) of the real workload. Third, we optimize these Kernels through the DC-Roofline model, respectively. Forth, we merge optimization methods of Kernels and optimize the DC workload.

Figure 8: The Optimization Methodology for the Real DC workload

6.2 The Optimization for the Redis Workload

Redis is a distributed, in-memory key-value database with durability. It supports different kinds of abstract data structures and is widely used in modern internet services. The Redis V4.0.2 has about 200,000 lines of codes and thousands of functions.

6.2.1 The Experimental Methodology

Redis, whose version is 4.0.2, is deployed with stand-alone model. We choose Redis-Benchmark as the workload generator. For the Redis-Benchmark settings, the total request number is 10 millions and 1000 parallel clients are created to simulate the concurrent environment. The query operation of each client is SET operations. We choose the queries per second (QPS) as the user-perceived performance metrics. And the platform is Intel Xeon E5645, which is the same as in Section V.

6.2.2 The hotspot Functions of Redis

There are 20 functions which occupy 69% execution time. These functions can be classified into three categories. The first category is the dictionary table management, such as dictFind(), dictSdsKeyCompare(), lookupKey(), and siphash(). The second category is the memory management, such as malloc_usable_size(), malloc(), free(), zmalloc(), and zfree(). The last category is the encapsulation of system functions. The first two categories take 55% of total execution time.

6.2.3 The Kernels of Redis

Based on the hotspot functions analysis, we build two Kernels, one is for the memory management, which is called MMK. The other is for the dictionary table management, which is called DTM. Each Kernel is constructed by the corresponding hotspot functions, and can run as an independent workload. Please note that the Kernel and the Redis workload share the same client queries.

6.2.4 The Optimizations of DTM

According to the optimization methods of the DC-Roofline Model (proposed on the Section V), we execute related optimizations. As the pre-fetching switch option has been opened in the Section V, we execute the following three optimizations: Compiled Optimization, OI Optimization and SIMD Optimization. We perform the Compiled Optimization through adding the compiling option with -O3 (gcc -O3). The optimizations can be shown in Table 4, we improve the OI of DTM from 1.5 to 3.5, and BOPS from 0.4 G to 3 G; In the DTM, the rapidly increasing of key-value pairs would trigger the reallocate operation of the dictionary table space, and this operation will bring lots of data movement cost, we propose a method to avoid this operation through pre-allocating a large table space. We call this optimization as NO_REHASH. Using NO_REHASH, we improve the OI of DTM from 3.5 to 4, and BOPS from 3 G to 3.1 G; The hash operations are main operations in the DTM, we replace the default SipHash algorithm with the HighwayHash algorithm, which is implemented by the SIMD instruction program. We call this optimization as SIMD_HASH. Using SIMD_HASH, we improve the OI of DTM from 4 to 4.7, and BOPS from 3.1 G to 3.7 G.

Type OI GBOPS
Original Version 1.5 0.4
Compiled Optimization 3.5 3
OI Optimization 4 3.2
SIMD Optimization 4.7 3.7
Table 4: Optimizations of DTM

6.2.5 The Optimizations of MMK

According to the optimization methods of the DC-Roofline Model, we do two optimizations: Compiled Optimization and OI Optimization. We do the Compiled Optimization through adding the compiling option with -O3 (gcc -O3). The optimization can be shown in Table 4, we improve the OI of MMK from 3.1 to 3.2, and BOPS from 2.2 G to 2.4 G; To reduce data movement cost, We replace the default malloc algorithm with the Jemalloc algorithm in the MMK. Jemalloc is a general purpose malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support. We call this optimization as JE_MALLOC. Using JE_MALLOC, we improve the OI of MMK from 3.2 to 90, and BOPS from 2.4 G to 2.7 G.

Type OI GBOPS
Original Version 3.1 2.2
Compiled Optimization 3.2 2.4
OI Optimization 90 2.7
Table 5: Optimizations of MMK

6.2.6 The Optimizations of Redis

Merging the above optimizations of DTM and MMK, we do the optimizations for the Redis workload. Fig. 9 shows the optimization trajectories, we execute the Compiled Optimization (adding the compiling option with -O3), OI Optimization (NO_REHASH and JE_MALLOC), and SIMD Optimization (SIMD_HASH) one by one. Please note that the peak performance of the system is 14.4 GBOPS in the Fig. 9. This is because that the Redis is a single-threaded serve and we deploy it on the single specific CPU core. As shown in Fig. 9, the OI of the Redis workload is improved from 2.9 to 3.8, and BOPS from 2.8 G to 3.4 G. And, the QPS of the Redis is improved from 122,000 requests/s to 146,000 requests/s.

Figure 9: Optimization Trajectories of the Redis Workload on the Intel E5645 Platform.

7 Conclusion

This paper proposes a new computation-centric metric–BOPS that measures the DC computing system efficiency. The metric is independent of the underlying systems and hardware implementations, and can be calculated through automatically analyzing the source code.

With several typical micro benchmarks, we attain the upper bound performance, and then we propose a BOPS-based Roofline model, named DC-Roofline, as a ceiling performance model to guide DC computing system design and optimization.

As a real-world DC workload always has million lines of codes and tens of thousands of functions, so it is not easy to use the DC-Roofline model directly. We propose a new optimization methodology as follows: we profile the hotspot functions of the real-world workload and extract the corresponding kernel workloads. The real-world application gains performance benefit from merging optimizations methods of the kernel workloads. Through experiments, we demonstrate that Redis—a typical real-world workload gains performance improvement by 1.2X.

References

  • [1] Date center growth. https://www.enterprisetech.com.
  • [2] http://www.tpc.org/tpcc.
  • [3] http://www.spec.org/cpu.
  • [4] http://top500.org/.
  • [5] “Sort benchmark home page,” http://sortbenchmark.org/.
  • [6] “Technical report,” http://www.agner.org/optimize/microarchitecture.pdf.
  • [7] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” pp. 483–485, 1967.
  • [8] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: characterization and architectural implications,” pp. 72–81, 2008.
  • [9] E. L. Boyd, W. Azeem, H. S. Lee, T. Shih, S. Hung, and E. S. Davidson, “A hierarchical approach to modeling and improving the performance of scientific applications on the ksr1,” vol. 3, pp. 188–192, 1994.
  • [10] S. P. E. Corporation, “Specweb2005: Spec benchmark for evaluating the performance of world wide web servers,” http://www.spec.org/web2005/, 2005.
  • [11] J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,” Concurrency and Computation: Practice and Experience, vol. 15, no. 9, pp. 803–820, 2003.
  • [12] W. Gao, L. Wang, J. Zhan, C. Luo, D. Zheng, Z. Jia, B. Xie, C. Zheng, Q. Yang, and H. Wang, “A dwarf-based scalable big data benchmarking methodology,” arXiv preprint arXiv:1711.03229, 2017.
  • [13] Neal Cardwell, Stefan Savage, and Thomas E Anderson. Modeling tcp latency. 3:1742–1751, 2000.
  • [14] William Josephson, Lars Ailo Bongo, Kai Li, and David Flynn. Dfs: A file system for virtualized flash storage. ACM Transactions on Storage, 6(3):14, 2010.
  • [15] Samuel Williams, Dhiraj D Kalamkar, Amik Singh, Anand M Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann S Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. Optimization of geometric multigrid for emerging multi- and manycore processors. page 96, 2012.
  • [16] Daniel Richins, Tahrina Ahmed, Russell Clapp, and Vijay Janapa Reddi. Amdahl’s law in big data analytics: Alive and kicking in tpcx-bb (bigbench). In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 630–642. IEEE, 2018.
  • [17] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.

    Dadiannao: A machine-learning supercomputer.

    international symposium on microarchitecture, pages 609–622, 2014.
  • [18] Ethan R Mollick. Establishing moore’s law. IEEE Annals of the History of Computing, 28(3):62–75, 2006.
  • [19] Shoaib Kamil, Cy P Chan, Leonid Oliker, John Shalf, and Samuel Williams. An auto-tuning framework for parallel multicore stencil computations. pages 1–12, 2010.
  • [20] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, R. Ren, C. Zheng, G. Lu, J. Li, Z. Cao, Z. Shujie, and H. Tang, “Bigdatabench: A dwarf-based big data and artificial intelligence benchmark suite,” Technical Report, Institute of Computing Technology, Chinese Academy of Sciences, 2017.
  • [21] S. Harbaugh and J. A. Forakis, “Timing studies using a synthetic whetstone benchmark,” ACM Sigada Ada Letters, no. 2, pp. 23–34, 1984.
  • [22] R. Jain, The art of computer systems performance analysis.   John Wiley & Sons Chichester, 1991, vol. 182.
  • [23] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, and A. Borchers, “In-datacenter performance analysis of a tensor processing unit,” pp. 1–12, 2017.
  • [24]

    S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” in

    International Symposium on Computer Architecture, 2016, pp. 393–405.
  • [25] C. Luo, J. Zhan, Z. Jia, L. Wang, G. Lu, L. Zhang, C. Xu, and N. Sun, “Cloudrank-d: benchmarking and ranking cloud computing systems for data processing applications,” Frontiers of Computer Science, vol. 6, no. 4, pp. 347–362, 2012.
  • [26] J. D. Mccalpin, “Stream: Sustainable memory bandwidth in high performance computers,” 1995.
  • [27] M. Nakajima, H. Noda, K. Dosaka, K. Nakata, M. Higashida, O. Yamamoto, K. Mizumoto, H. Kondo, Y. Shimazu, K. Arimoto et al., “A 40gops 250mw massively parallel processor based on matrix architecture,” in Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE International.   IEEE, 2006, pp. 1616–1625.
  • [28] U. Pesovic, Z. Jovanovic, S. Randjic, and D. Markovic, “Benchmarking performance and energy efficiency of microprocessors for wireless sensor network applications,” in MIPRO, 2012 Proceedings of the 35th International Convention.   IEEE, 2012, pp. 743–747.
  • [29]

    M. M. Tikir, L. Carrington, E. Strohmaier, and A. Snavely, “A genetic algorithms approach to modeling the performance of memory-bound computations,” pp. 1–12, 2007.

  • [30] K. Ueno and T. Suzumura, “Highly scalable graph search for the graph500 benchmark,” in International Symposium on High-Performance Parallel and Distributed Computing, 2012, pp. 149–160.
  • [31] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench: a big data benchmark suite from internet services,” in HPCA 2014.   IEEE, 2014.
  • [32] R. Weicker, “Dhrystone: a synthetic systems programming benchmark,” Communications of The ACM, vol. 27, no. 10, pp. 1013–1030, 1984.
  • [33] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
  • [34] Thomasian, “Analytic Queueing Network Models for Parallel Processing of Task Systems,” IEEE Transactions on Computers, vol. 35, no. 12, pp. 1045–1054, 1986.
  • [35] Nicholas P Carter, Aditya Agrawal, Shekhar Borkar, Romain Cledat, Howard David, Dave Dunning, Joshua B Fryman, Ivan Ganev, Roger A Golliver, Rob C Knauerhase, et al. Runnemede: An architecture for ubiquitous high-performance computing. pages 198–209, 2013.
  • [36] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. operating systems design and implementation, pages 265–283, 2016.
  • [37] Jaewon Lee, Changkyu Kim, Kun Lin, Liqun Cheng, Rama Govindaraju, and Jangwoo Kim. Wsmeter: A performance evaluation methodology for google’s production warehouse-scale computers. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 549–563. ACM, 2018.
  • [38] Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Rui Ren, Chen Zheng, Gang Lu, Jingwei Li, Zheng Cao, et al. Bigdatabench: A dwarf-based big data and ai benchmark suite. arXiv preprint arXiv:1802.08254, 2018.
  • [39] John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. Elsevier, 2012.