BOPS, Not FLOPS! A New Metric, Measuring Tool, and Roofline Performance Model For Datacenter Computing

01/28/2018
by   Lei Wang, et al.
0

The past decades witness FLOPS (Floating-point Operations per Second), as an important computation-centric performance metric, guides computer architecture evolution, bridges hardware and software co-design, and provides quantitative performance number for system optimization. However, for emerging datacenter computing (in short, DC) workloads, such as internet services or big data analytics, previous work reports on the modern CPU architecture that the average proportion of floating-point instructions only takes 1 FLOPS efficiency is only 0.1 63 for evaluating DC computer systems. To address the above issue, we propose a new computation-centric metric BOPS (Basic OPerations per Second). In our definition, Basic Operations include all of arithmetic, logical, comparing and array addressing operations for integer and floating point. BOPS is the average number of BOPs (Basic OPerations) completed each second. To that end, we present a dwarf-based measuring tool to evaluate DC computer systems in terms of our new metrics. On the basis of BOPS, also we propose a new roofline performance model for DC computing. Through the experiments, we demonstrate that our new metrics--BOPS, measuring tool, and new performance model indeed facilitate DC computer system design and optimization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 14

01/28/2018

BOPS, Not FLOPS! A New Metric and Roofline Performance Model For Datacenter Computing

The past decades witness FLOPS (Floating-point Operations per Second) as...
12/04/2020

Revisiting "What Every Computer Scientist Should Know About Floating-point Arithmetic"

The differences between the sets in which ideal arithmetics takes place ...
10/29/2021

Design and implementation of an out-of-order execution engine of floating-point arithmetic operations

In this thesis, work is undertaken towards the design in hardware descri...
08/14/2020

Manticore: A 4096-core RISC-V Chiplet Architecture for Ultra-efficient Floating-point Computing

Data-parallel problems demand ever growing floating-point (FP) operation...
10/27/2016

Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPA...
12/20/2021

Efficient Floating Point Arithmetic for Quantum Computers

One of the major promises of quantum computing is the realization of SIM...
05/08/2020

Measuring the Algorithmic Efficiency of Neural Networks

Three factors drive the advance of AI: algorithmic innovation, data, and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decades, the FLOPS (FLoating-point Operations Per Second) [11] metric is used to evaluate the performance of modern computer systems. FLOPS has led computing technology progress not only limited to high performance computing (HPC) for many years [11], and the history witnesses that FLOPS defines the concrete R&D objectives and road-maps for HPC (Gflops in the 1990, Tflops in the 2000, Pflops in the 2010, and Eflops in the 2020).

To date, to perform big data analytics or provide Internet services, more and more organizations in the world build internal datacenters, or rent hosted datacenters. As a result, DC computing has become a new paradigm of computing. It seems that the fraction of DC has outweighed HPC in terms of market share (HPC only takes 20% of total)[1]. So a natural question rises: what is the metric for DC. Is it still the same FLOPS metric?

Different from HPC, DC has unique features. For example, it is reported that DC has very low ratio of floating point operations intensity [31], which is defined as the total floating point operations divided by the total memory access bytes. Gao et al. [38, 31] has performed comprehensive and hierarchical Top-Down analysis on big data analytics, AI, and internet service workloads from an architectural perspective. It shows that for typical DC workloads, the average floating point instructions ratio is only 1% and the average FLOPS efficiency is only 0.1%, while the average IPC (Instructions Per Cycle) is 1.3 (the theoretic IPC is 4 for the experimental platform). In Section 3.2, we also reveal that the traditional FLOPS based Roofline performance model gives misleading information—the bottlenecks of all workloads are memory access—to the DC workloads’ performance optimization. These observations imply that FLOPS is inappropriate for measuring DC computer systems.

For measuring the performance, the wall clock time is a ground truth. In practice, there are several user-perceived metrics, which are derivatives of the wall clock time, used to measure application-specific systems, i.e., the transactions per minute for the online transaction system [2], and input data processed per second for the big data analysis system [25]. However, there are two limitations of user-perceived metrics. First, different user-perceived metrics can not be compared side by side. For example, transactions per minute (TPM) and data per second processing capability (GB/S) can not be used for apple-to-apple comparison. Second, user-perceived metrics are hard to measure the ceiling performance of the computer systems.

In this paper, inspired by FLOPS [11], we defined Basic OPerations per Second (in short BOPS) to evaluate the DC computing system. The contributions are as follows:

First, we propose a computation-centric metric—BOPs that measures the efficient work defined by the source code, includes floating-point operations and the arithmetic, logical, comparing, and array addressing parts of integer operations. The BOPs is independent with the underlying system and hardware implementation, and can be counted through analyzing the source code. We also use the hardware performance counters to approximate the BOPs when the source code is not available. We define BOPS as the average number of BOPs per second, and propose replacing FLOPS with BOPS to measure DC computer systems. BOPS can quantitatively measure the theoretical peak performance of the DC computer systems through analyzing micro-architecture of the systems, and the real performance through running the workloads on the systems.

Second, with several typical micro benchmarks, we attain the upper bound performance through different optimization approaches, and then we propose a BOPS based Roofline model, which we called DC-Roofline, as a quantitative performance model for guiding DC computer system design and optimization. DC-Roofline can depict performance ceilings of DC workloads with specific tuning settings on the target system. Through the experiments, we demonstrate that the DC-Roofline model indeed helps optimize the performance of typical DC workloads by 119% to 325%.

For the DC computer system and architecture community, a single computation-centric metric like BOPS is simple but powerful for exploring innovative systems and architecture. First, BOPs can be be calculated at the application’s source code level, independent with the underlying system implementation, so it is fair to evaluate and compare different system and architecture implementations. Second, it can be calculated at different levels independently. For example, it can be calculated at source code level, software’s binary code level and hardware’s instruction level, respectively, so it facilitates co-design of systems and architecture. Last but not least, it helps people understand the performance ceiling of the computer systems, and hence guides system optimization.

The remainder of the paper is organized as follows. Section 2 summarizes the related work. Section 3 states background and motivations. Section 4 defines BOPs, reports how to calculate it, and presents BOPS usages. Section 5 introduces BOPS-based DC-Roofline model. Section 6 presents how to optimize the performance with the DC-Roofline model. Section 7 draws a conclusion.

2 Related Work

Related work is summarized from two perspectives: metrics, and performance models.

2.1 Metrics

The wall clock time is the most basic performance metric for the computer system [39]

, and almost all the other performance metrics are derived from it. Based on the wall clock time, the performance metrics can be classified into two categories. One is the user-perceived metric, which can be intuitively perceived by the user, such as the TPM (transactions per minute) metric. The other is the computation-centric metric, which is related with a specific computation operation, i.e. FLOPS (FLoating-point Operations Per Second).

User-perceived metrics can be further classified into two categories: one is the metric for the whole system, and the other is the metric for components of the system. The former examples include data sorted in one minute (MinuteSort), which measures the sorting capability of a system [5], and transactions per minute (TPM) for the online transactions system [2]. The latter example includes the SPECspeed/SPECrate metric for the CPU component [3], the input/output operations per second (IOPS) metric for the storage component [14], and the data transfer latency metric for the network component [13].

There are many computation-centric metrics. FLOPS (FLoating-point Operations Per Second) is a compu-tation-centric metric, measuring the computer system, especially in fields of scientific computing that make heavy use of floating-point calculations [11]. The wide recognition of FLOPS indicates of the maturation of high performance computing. MIPS (Million Instructions Per Second) [22] is another famous computation-centric metric, which is defined as the number of million instructions the processor can execute per second. The main defect of MIPS is that it is architecture-dependent. There are many derivatives of the MIPS, including MWIPS and DMIPS [28], which use synthetic workloads to evaluate the floating point operations and integer3v operations, respectively. The metric of WSMeter [37], which is defined as quota-weighted sum of per-job’s MIPS (IPC), is also a derivative from MIPS, and hence it is architecture-dependent. Unfortunately, modern datacenters are heterogeneous, consisting of different generations of hardware.

OPS (operations per second) is a computation-centric metric too. OPS [27] is initially proposed for digital processing systems, defined as the 16-bit addition operations per second. The definitions of OPS are extended later for Intel Ubiquitous High-Performance Computing [35]

and recent artificial intelligence processors, such as Google’s Tensor Processing Unit 

[36, 23] and the Cambricon processor [24, 17]. All of them are defined in terms of a specific operation. For example, the OPS is 8-bit matrix multiplication operations in TPU and 16-bit integer operations in the Cambricon processor, respectively. However, modern datacenter workloads are comprehensive and complex, and the bias toward a specific operation will not justify the fairness of evaluation.

For each metric, the corresponding tools or benchmarks [39] are proposed to report the numbers. For the user-perceived metric—SPECspeed/SPECrate metrics, SPECCPU is the benchmark suite [3] measuring the CPU component. For the TPM metric, TPC-C [2] is the benchmark suite, measuring the online transactions system. The other example is the Sort benchmark  [5] for MinuteSort. For the computation-centric metric, Whetstone [21] and Dhrystone [32] is the measurement tool for MWIPS and DMIPS metrics, respectively. HPL [11] is widely-used measurement tool for FLOPS. As a micro-benchmark, HPL demonstrates a sophisticated design: the proportion of the floating-point addition and multiplication in the HPL is 1:1 so as to fully utilize the FPU unit of the modern processor.

2.2 Performance model

A performance model can depict and predict the performance of a specific system. There are two categories of performance models: one is an analytical model, which uses a stochastic/statistical analytical method to depict and predict the system performance, and the other is the bound model, which is relative simpler and only depicts the performance bound or bottleneck of the system. Previous work [34, 29, 9, 37] use stochastic/statistical analytical models to predict the system performance. Actually, the distributed and parallel systems always have lots of uncertain behaviors, which makes it hard to build an accurate prediction model. Instead, the bound and bottleneck analysis is more suitable. Amdahl’s Law [7] is one of the famous performance bound model for parallel processing computer systems, and it is also used for big data systems  [16]. The Roofline model [33] is another famous performance bound model. The original Roofline model [33] adopts the FLOPS as the metric of the model. On the basis of the definition of operational intensity (OI)—which is the total number of floating point instructions divided by the total number of bytes of memory access, the Roofline model can depicts the given workload’s the upper bound performance when different optimization strategies are adopted on the target system.

3 Background and Motivations

3.1 Background

Moore’s Law reveals that the number of cores per chip doubles approximately every two years [18]. However, the diversity and complexity of modern DC workloads raise great challenges in depicting the ceiling performance of the computer system spanning multiple domains – algorithm, programming, compiling, system development, and architecture design. The qualitative performance analysis of a computer system includes performance metric and performance model [39] [33].

3.1.1 The Computation-centric Metric

For system and architecture community, the computation-centric metric, such as FLOPS, is a fundamental yardstick to reflect the running performance and gaps across different systems or architecture. Generally, the computation-centric metric has performance upper bound on the specific architecture, according to the micro-architecture design. For example, the peak FLOPS is computed by

(1)

For each metric, the measuring tool is used to measure the performance of the systems and architecture in terms of the metric number, and reports the gap between the actual number and the theoretical peak one. For example, HPL [11] is a widely-used measuring tool in terms of FLOPS. The FLOPS efficiency of the specific system is the ratio of the HPL’s FLOPS to the peak FLOPS.

3.1.2 The Upper Bound Performance Model

The computation-centric metric is the foundation of the system performance model. And the performance model of a computer system can depict and predict the workload’s performance on the specific system. For example, the Roofline model [33] is a famous upper bound model based on FLOPS. There is much system optimization work [19] and  [15], performed on the basis of the Roofline model in the HPC domain.

(2)

The above formula indicates that the attained performance bound of a workload on a specific platform is limited by the processor’s computing capacity and the memory bandwidth. Peak FLOPS and Peak MemBand are the peak performance of the platform, and the operation intensity (in short, OI) is the total number of floating point operations divided by the total number of bytes of memory access. The Roofline model can be visualized as shown in Fig. 1 as an example. The x-axis is the operational intensity, and the y-axis is the floating-point performance. The horizontal line showing peak floating-point performance of the platform, and the diagonal line gives the peak memory bandwidth. To identify the bottleneck and guide the optimization, the ceilings (for example, the ILP and SIMD optimizations in the figure) can be added to provide the performance tuning guidance.

Figure 1: FLOPS Based Roofline Model. [33]

3.2 Motivation

Based on the workload characterization of the modern DC workloads, we validate the effectiveness of FLOPS and corresponding Roofline model, and explain why we propose a new metric—BOPs for DC computing.

3.2.1 Experimental workloads and platforms

Previous work has performed comprehensive characterization of modern DC workloads [38, 31]. Hereby, we choose six workloads from BigDataBench for summarizing their workload characteristics. BigDataBench  [38] is an open-source big data and AI benchmark suite based on the frequently-appearing units of computation, which consists of a suite of micro and component benchmarks. The details of chosen workloads are shown in Table. 1

. For these six workloads. Sort, Grep and WordCount are three famous micro-benchmarks. Bayes, Kmeans and RankServ are three typical component benchmarks. RankServ implements the simple search engine service. Bayes implements the classifying of a specified input text according to the learning model. Kmeans is a popular clustering algorithm of machine learning.

ID Workload Software Stack Type
1 Sort Hadoop,MPI MicroBenchmark
2 Grep Hadoop,MPI MicroBenchmark
3 WordCount Hadoop,MPI MicroBenchmark
4 RankServ C++ Component Benchmark
5 Kmeans Hadoop,MPI Component Benchmark
6 Bayes Hadoop,MPI Component Benchmark
Table 1: DC Workloads

The experimental platform is same like that in Section 4, and we choose the Intel Xeon E5645, which is a typical brawny-core processor (OoO execution, four-wide instruction issue).

3.2.2 The Limitation of FLOPS for DC

One of main targets of the computation-centric metric is that it can quantitatively measure the system performance. We use six DC workloads to reveal the limitations of FLOPS for DC. The peak value in the figure is calculated by Formula 1, and we also report the BOPS number metric for comparison. The FLOPS of DC workloads is only 0.16 GFLOPS on average (only 0.1% of the peak), while the average IPC (Instructions Per Cycle) is 1.3 (32% of the peak). These performance data imply that the FLOPS number is far from the DC system efficiency. On the other hand, the average BOPS number of the DC workloads is 9.5 GBOPS (10% of the peak).

We use the FLOPS-based roofline model to measure the ceiling performance and locate the bottlenecks of DC workloads. The details are shown in Table.  2.

Workload FLOPS OI Bottleneck
Sort 0.01G 0.01 Memory Access
Grep 0.01G 0.002 Memory Access
WordCount 0.01G 0.01 Memory Access
Bayes 0.02G 0.2 Memory Access
Kmeans 0.1G 0.5 Memory Access
RankServ 0.5G 0.6 Memory Access
Table 2: DC Workloads under The Roofline Model.

In our experiments, the peak FLOPS of E5645 is 57.6 GFLOPS, and the peak memory bandwidth is 13.2 GB/s. From Table.  2, we can observe: first, the operation intensity (OI) for six workloads is very low, and the average number is only 0.05. Second, the model indicates that the bottleneck of all six workloads is memory access. Furthermore, under the hint of the Roofline model, we increase the memory bandwidth through hardware pre-fetching, and the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s. However, only Grep and WordCount gain obvious performance improvements, which is 16% and 10%, respectively. For the other four workloads, the average performance improvement is no more than 3%, which indicates the memory access is not the exact bottleneck. In Section 6, when we use our new BOPS-based Roofline model, we confirm for the latter four workloads, the bottlenecks are calculation, exactly. Table 3 reports the performance improvement through ILP and hardware pre-fetching in terms of FLOPS and BOPS. In Table 3, Sort is the original performance, while Sort” is the performance after the optimization.

Workload BOPS FLOPS OI
Sort 8.8G 11.7 0.01G 0.01
Sort” 13.2G 14.6 0.01G 0.01
Grep 4.6G 1.2 0.01G 0.002
Grep” 5.5G 1 0.016G 0.003
WordCount 3.8G 3.2 0.01G 0.008
WordCount” 5G 4.2 0.012G 0.007
Kmeans 5G 25 0.1G 0.5
Kmeans” 11.3G 56 0.2G 0.5
Bayes 5.1G 51 0.02G 0.2
Bayes” 9.3G 47 0.02G 0.1
RankServ 8.2G 9.5 0.5G 0.6
RankServ” 13G 9.6 1G 0.9
Table 3: DC-Roofline Model Vs. Roofline Model.

Fig. 2 shows the final results using Roofline and DC-Roofline, respectively. From the figure, we can see: if using the Roofline model in terms of FLOPS, the achieved performance is at most up to 0.1% of the peak FLOPS. For comparison, the result using the DC-Roofline model is up to 10% of the peak BOPS (These results are consistent with section.3.2.2).

Figure 2: FLOPS Based Roofline Model for DC Workloads.

3.2.3 What we should include in our new metric for DC computing

The previous workload characterization work [38, 31] reveals that the DC workload has high ratio of integer to floating point operations. Hereby, we do not intend to repeat the research work. For the purpose of explaining why FLOPS does not work in DC, we run six DC workloads (The details are shown in Table. 1) with respect to HPL, SPECFP, and PARSEC.

Figure 3: Instruction Mix of DC Workloads.

For the new metric, our fundamental principle is that we only consider the efficient work defined by the source code. Fig.3 shows the instruction mix of DC workloads. As our concern is the computation-centric metric, the new metric will consider integer and FP operations, as they will execute on ALU or FPU in the processor. We will not consider branch and load/store operations. For DC workloads, the ratio of integer to floating point operations is 38, while the number is 0.3, 0.4, 0.02 for HPL, Parsec and SPECFP, respectively. That is the main reason why FLOPS doe not work in DC. This implies that our new metric should consider both integer and floating-point operations.

Not like FP operations, Integer operations are more diverse (such as Add, Shuffle, Move operations) and all of them are different. We analyze the integer instruction breakdown of DC workloads through inserting the analysis code into the source code. We classify all operations into four classes. The first class is array addressing operation (related with data movement); the second class is the integer arithmetic or logical operations; the third class is the comparing operation(related with branch); the four class is all of the other operations. The top three classes can be counted at the source code level, so we include them in the BOPs. Also, we find that the 47% of integer instructions belong to address calculation, 22% of those belong to branch calculation, and 30% of those belong to integer arithmetic or logical operations.

4 BOPs

BOPS (Basic OPerations per Second) is the average number of BOPs (Basic OPerations) of a specific workload completed every second. In this section, we present the definition of BOPs and how to measure BOPs with or without the source code available, and then introduce how to use BOPS.

4.1 BOPs Definition

In our definition, BOPs include floating-point operations and the arithmetic, logical, comparing, and array addressing parts of integer operations. For integer operations, arithmetic/logical part corresponds to calculation, comparing operations correspond to branch-related operations, array addressing operations correspond to data movement-related operations. The detailed comparison between the FLOPS and BOPS definitions is shown in Table.4.

BOPS FLOPS
Calculation Integer & FP FP
Data movement Addressing operations None
Branch Integer &FP comparing FP comparing
Table 4: Operations included into BOPS vs. FLOPS

The detailed definitions of BOPs is shown in Table.5. Each operation in Table.5 is counted as 1 except for N-dimensional array addressing. Note that all operations are normalized to 64-bit operation. For arithmetic or logical operations, the number of BOPs is counted as the number of corresponding arithmetic or logical operation. For array addressing operations, we take one dimensional array P[i] as the example: loading the value of P[i] indicates the addition of an i offset to the address location of P, so the number of BOPs increments by one. For comparing operations, we transform it to the subtraction operation. We take X <Y as an example and transform it to X-Y <0, so the number of BOPs increments by one. BOPs normalize all operations into 64-bit operations, and each operation is counted as 1 (regardless of the different delays with different operations in the real system).

Operations Normalized value
Add 1
Subtract 1
Multiply 1
Divide 1
Bitwise operation 1
Logic operation 1
Compare operation 1
One-dimensional array addressing 1
N-dimensional array addressing N
Table 5: Normalization Operations of BOPs.

Several other operations are not included in BOPs, including variables declaration, variable assignment, type conversion, branch commands, loop commands, skip commands, function call commands and return commands, as illustrated in Table. 6.

Category Descriptions
Variable declaration int i
Variable assignment i=10
Type conversion (int*) X
Branch call goto, if else, Switch case
Loop call for, while
Function call Fun() call
Return return command
Table 6: Normalization Operations Not Included in BOPs.

Delays of different operations are not considered in the normalized calculations of BOPs, because delays can be extremely different in diverse micro-architecture platforms. For example, the delay of divisions on the Intel Xeon E5645 processors is about 7-12 cycles, while on the Intel Atom D510 processors, the delay can be reach up to 38 cycles [6]. Hence, considering the delays of normalization will lead to architecture-related issue. In our BOPS definition, BOPs normalize all operations into 64-bit operations, and each operation is counted as 1.

4.2 How to Measure BOPs

BOPs can be measured at either the source code level or the instruction level.

4.2.1 Source-code level measurement

We can calculate the BOPs from the source code of a workload, and this method needs some manual work (analyzing the source codes). However, it is independent with the underlying system implementation, so it is fair to evaluate and compare different system and architecture implementations. As the below example shows, BOPs are not calculated in the first and second line because they are variable declarations. Line 3 consists of a loop command and two integer operations, and the number of the corresponding BOPs is (1+1) * 100 = 200 for integer operations, while the loop command is not calculated; Line 5 consists of the array address operations and variable assignments: address assignment is counted as 100 * 1, while variable assignment is not calculated. So the sum of the BOPs of the sample program is: 100 + 200 = 300.

1 long newClusterSize[100];
2 int j;
3 for (j=0; j<100;j++)
4 {
5  newClusterSize[j]=0;
6 }

In order to verify the reasonability of the above calculation, the binary code is presented as below, we can see that there are six operations: movq, addq, addq, movl, cmpq, jne. We count BOPs for addq, cmpq and addq. The binary code level measurement is in accordance with the source code level.

1 movq $0, (%rax)
2 addq $8, %rax
3 cmpq %rdx, %rax
4 jne .L2
5 movl 664(%rsp), %eax
6 addq $688, %rsp

Another thing we need to take into consideration is the system built-in library function. For the calculation of the system-level function, such as Strcmp function, we implement the user-level function manually, and then count the number of BOPs, which may result in small deviation in terms of the BOPs number. The implementation of the Strcmp function is shown as follows.

int strcmp(const char *s,const char *t)
{
  unsigned char c1, c2;
  while (1) {
    c1 = *s++;
    c2 = *t++;
    if (c1 != c2)
      return c1 < c2 ? -1 : 1;
    if (!c1)
      break;
  }
  return 0;
}

4.2.2 Instruction level measurement

Source-code or binary-code level measurements need to analyze the source code, which is especially costly for complex system stacks (i.e., Hadoop system stacks). Instruction level measurement can avoid this high analysis cost and the restriction of needing the source code. We propose an instruction-level approach to measuring the BOPs, which use the hardware performance counter to obtain BOPs. As different type of processors have different performance counter events, for convenience, we introduce an approximate but simple instruction level measurement method here. That is, to obtain the total numbers of instructions (ins), branch instructions (branch_ins), load instructions (load_ins) and store instructions (store_ins) through hardware performance counters. And the BOPs can be obtained according to the following formula:

(3)

Note that our approximate measurement method includes all integer instructions, which does not exactly conform to the BOPS definition. But, from our observation, the deviation of instruction code level measurement is small.

4.3 How to use BOPS

BOPS is the average number of BOPs of a specific workload completed every second. We count the floating point operations and the arithmetic, logical, comparing and array addressing part of the integer operations in BOPs. The peak BOPS is computed by the following formula:

(4)

For Intel Xeon E5645 platform, .

4.3.1 The BOPS Measuring Tools

Based on the definition, we need the BOPS Measuring Tools to report the number. With the BOPS Measuring Tools, the ceilings can be instantiated. For example, HPL [11] benchmark is a widely-used measurement tool of FLOPS and its Roofline model with the sophisticated design (fully utilize the FPU unit, the ceiling for FLOPS) and the representativeness (Dense linear algebraic equations in the HPL is a typical HPC algorithm). Because of the diversity of DC workloads, it is difficult to find one representative workload to represent all DC workloads. We develop the BOPS Measuring Tools, consisting of a series of representative workloads. We choose typical microbenchmarks from BigDataBench as the BOPS measuring tools from the perspectives of complexity, workload types and computation patterns. Currently, our BOPS measurement tools provide three workloads: Sort–an I/O intensive workload, WordCount–a CPU intensive workload, and Grep–a hybrid (both I/O and CPU intensive) workload. For each workload (a measurement tool), Table  7 summarizes the description and the BOPs number.

Workload BOPs Scale Description
Sort 529E9 10E8 records IO-Intensive
Grep 142E9 15GB text Hybrid
WordCount 179E9 15GB text CPU-Intensive
Table 7: BOPS Measuring Tools.

Sort: The Sort workload sorts an integer array of a specific scale, and the sorting algorithm uses the quick sort and merge algorithm. For example, sort an integer array of 10E8 elements. The program is implemented with C++ and MPI.

WordCount: The WordCount workload counts the words in a specified input text. For example, count the frequency of a word appearing in a 15GB text file in the format of txt. The program is implemented with C++ and MPI.

Grep: The Grep workload searches plain text file for lines that match a regular expression. The program is implemented with C++ and MPI.

Please note that the BOPs number will change as the data scale or request number increases or decreases. The measuring tools can be used to measure the real performance. For example, Sort in Table 7 has 529E9 BOPs. We run Sort on the Xeon E5645 node and the execution time is 40 seconds. . And the BOPS efficiency is calculated by the formula:

(5)

4.4 Evaluations

4.4.1 Experimental Platforms

We choose three typical representative processor platforms as the experimental platforms for DC physical platforms, which are Intel Xeon E5310, Intel Xeon E5645 and Intel Atom D510. Intel Xeon E5310 and Intel Xeon E5645 are typical brawny-core processors (OoO execution, four-wide instruction issue), while Intel Atom D510 is a typical Wimpy-core processor (in-order execution, two-wide instruction issue). Each experiment platform is equipped with four nodes. The details are shown in Table. 8.

CPU Type Intel CPU Core
Intel ®Xeon E5645 6 cores@2.40G
L1 DCache L1 ICache L2 Cache L3 Cache
6 32 KB 6 32 KB 6 256 KB 12MB
CPU Type Intel CPU Core
Intel ®Xeon E5310 4 cores@1.60G
L1 DCache L1 ICache L2 Cache L3 Cache
4 32 KB 4 32 KB 2 4MB None
CPU Type Intel CPU Core
Intel ®Atom E5645 4 cores@1.60G
L1 DCache L1 ICache L2 Cache L3 Cache
6 32 KB 6 32 KB 6 256 KB None
Table 8: Configurations of Hardware Platforms.

4.4.2 BOPS for DC

For Performance Evaluation, we compare the BOPS metric with the FLOPS and IPC. We choose Intel Xeon E5645, Intel Xeon E5310 and Intel D510 as the experimental platforms, using three BOPS measuring tools in Table 7. As shown in Table. 9, the Peak BOPS is obtained by Formula 4. For Intel Xeon E5645 platform, the Peak BOPS is 86.4 GBOPS. The Real BOPs numbers are reported in Table.7, and we calculate their average BOPS. BOPS efficiency is obtained by Formula 5. In Table.9, the BOPS Efficiency of E5645, E5310 and D510 is 9.4%, 9.3% and 10%, respectively, and the FLOPS Efficiency is 0.1%, 0.2%, and 0.1%, respectively. On the other hand, the IPC Efficiency is 32%,40% and 25%, respectivley. So we can see that the FLOPS number is far from the DC system efficiency, while the BOPS number is more reasonable to reflect the efficiency of the real DC system.

E5645 D510 E5310
Peak BOPS 86.4G 12.8G 38.4G
Real BOPS 8.2G 1.3G 4.1G
BOPS Efficiency 9.4% 9.3% 10%
Peak FLOPS 57.6G 4.8G 25.6G
Real FLOPS 0.1G 0.003G 0.03G
FLOPS Efficiency 0.1% 0.2% 0.1%
Peak IPC 4 2 4
Real IPC 1.3 0.5 1
IPC Efficiency 32% 40% 25%
Table 9: BOPS, FLOPS and IPC for DC.

4.4.3 BOPS for traditional workloads

We measure the BOPS of the traditional benchmarks. As shown in Table. 10, we choose HPL, Graph500 [30] and Stream [26]. BOPS performance and efficiency of the workload are compared with those of the FLOPS metric. From Table. 10, we can see that BOPS are also suited for measuring the traditional workloads.

HPL Graph500 Stream
GFLOPS 38.9 0.05 0.8
FLOPS efficiency 68% 0.04% 0.7%
GBOPS 41 12 13
BOPS efficiency 47% 18% 20%
Table 10: BOPS for Traditional Workloads.

5 DC-Roofline Model

In this section, we present the DC-Roofline model, which depicts the upper bound performance of the given DC workload under the specific setting of the target system.

5.1 DC-Roofline Definition

DC-Roofline model is inspired by the Roofline model, and the main idea of DC-Roofline model is replacing FLOPS with BOPS. The definitions of DC-Roofline model are as follows:

Definition 1: The Peak Performance of DC

We choose (can be obtained from Formula 4) as the peak performance metric of DC in the DC-Roofline model.

Definition 2: Operation Intensity of DC

Operation intensity (OI) in DC-Roofline is the ratio between the BOPS and memory bandwidth. The memory traffic () is the total number of swap bytes between the CPU and the memory. The operation intensity () could be obtained according to the formula as follows:

(6)

The memory traffic () is calculated by (total number of memory accesses * 64), and total number of memory accesses can be obtained through the hardware performance counter. How to count BOPs is introduced in Section 4.3.

Definition 3: The Upper Bound Performance of DC The attained performance bound of a given workload is depicted as:

(7)

The is peak memory bandwidth of the system.

5.2 Adding Ceilings for DC-Roofline

We add three ceilings–ILP, SIMD, and pre-fetching–to specify the performance upper bound for specific tuning settings. Among them, ILP and SIMD reflect computation limitations and the pre-fetching reflects memory access limitations. We evaluate our experiments on the Intel Xeon E5645 platform.

5.2.1 Pre-fetching Ceiling

We use the Stream benchmark as the measuring tool for the Pre-fetching ceiling. We improve the memory bandwidth through opening the pre-fetching switch option in the system BIOS, and then the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s.

5.2.2 ILP and SIMD Ceilings

We use the following formula to estimate the ILP and SIMD ceilings:

(8)

The Port Efficiency is the port efficiency of pipeline (According to the user manual of Intel E5645 and Gao et al. [38] work, port efficiency is always lower than 50%), ILP Efficiency is IPC efficiency of the peak IPC (E5645’s peak IPC is 4), SIMD Scale is the scale of SIMD (for E5645, the value is 2 under SIMD, otherwise it is 1). Therefore, the ILP (Instruction-level parallelism) ceiling is 21.6 GBOPS when the IPC number is 2. Based on the ILP ceiling, the SIMD ceiling is 43.2 GBOPS.

5.3 Visualized DC-Roofline Model

The visualized DC-Roofline model can be depicted as Fig. 4. In Fig. 4, y axis reflects the peak BOPS that the target system can achieve. The diagonal line represents the bandwidth, and the roof line shows the peak BOPS performance. The confluence of the diagonal and roof lines at the ridge point (where the diagonal and horizontal roofs meet) allow us to evaluate the performance of the system. Similar to the original Roofline model, the ceilings – which imply the performance upper bounds for specific tuning settings – can be added to the DC-Roofline model. There are three ceilings (ILP, SIMD, and Pre-fetching) in the figure.

Figure 4: Visualized DC-Roofline Model.

6 DC Roofline Model Usage

In this section, we illustrate how to use the DC Roofline model to optimize the performance of six DC workloads on the Intel Xeon E5645 platform. As shown in Table.11, the top five are typical DC analytical workloads, while RankServ is a service workload. The peak BOPS is 86.4 GBOPS according to Formula 4, and the peak memory bandwidth is 13.2 GB/s using the Stream benchmark [26].

Workload Stack Scale Description
Sort MPI 10E8 records IO-Intensive
Grep MPI 15GB text Hybrid
WordCount MPI 15GB text CPU-Intensive
Bayes MPI 1GB text CPU-Intensive
Kmeans MPI 1GB text CPU-Intensive
RankServ C++ 20*100 requests Thread-Intensive
Table 11: Features of Six Workloads.

6.1 Performance Analysis under DC-Roolfine

The six workloads’ corresponding numbers of BOPS and operation intensity are shown in Table.12. We find that the operation intensity of six workloads ranges from 1.2 to 51. The performance bottlenecks of Sort, Bayes, Kmeans, and RankServ are the calculation, while that of the Grep and WordCount are memory access.

Workload BOPS Bottleneck
Sort 8.8G 11.7 Calculation
Grep 4.6G 1.2 Memory Access
WordCount 3.8G 3.2 Memory Access
Bayes 5.1G 51 Calculation
Kmeans 5.0G 25 Calculation
RankServ 8.2G 9.57 Calculation
Table 12: DC Workloads under the DC-Roofline Model.

6.2 Calculation Optimizations

6.2.1 ILP Optimizations

We improved ILP through adding the compiling optimization option with -O2 (i.e., gcc -o2). As shown in Table. 13, the BOPS number of Sort, Bayes, Kmeans, and RankServ is improved significantly by 50%, 68% , 120%, and 48%, respectively. These implied that the bottlenecks of Sort, Bayes, Kmeans, and RankServ are indeed calculation, because they benefit a lot from ILP optimization.

Workload BOPS IPC
Original_Sort 8.8G 11.7 1.6
ILP_Sort 13.2G 17.6 1.7
Original_Grep 4.6G 1.2 0.6
ILP_Grep 4.9G 1.3 0.6
Original_WordCount 3.8G 3.2 0.9
ILP_WordCount 4.5G 4 0.9
Original_Bayes 5.1G 51 1.2
ILP_Bayes 8.6G 86 1.5
Original_Kmeans 5.0G 25 1.7
ILP_Kmeans 11.3G 56 1.8
Original_RankServ 9.5G 9.5 1.3
ILP_RankServ 12.2G 9.3 1.4
Table 13: ILP Optimization for DC Workloads.

6.2.2 SIMD Optimization

SIMD is the common method for the HPC performance improvement, which performs the same operation on multiple data simultaneously. Modern processors have 128-bit wide SIMD instructions at least (i.e., SSE, AVX, etc). We apply the SIMD technique on the DC workloads, and change Sort from SISD to SIMD through SSE revising (as it needs time to revise all of workloads with SSE, we develop the SSE version of Sort at first). Note that the SIMD optimization is after the ILP optimization, and we still achieve 2.2X performance improvement than the SISD version with the only ILP optimization. Using the SEE_Sort, the attainable performance is 28.6 GBOPS, which is 33% of the peak BOPS.

6.3 Memory Optimizations

We improved the memory bandwidth through opening the pre-fetching switch option, and the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s. Note that hardware pre-fetching is after the ILP optimization. As shown in Table. 14, Grep and WordCount are improved significantly by 16% and 10%, respectively. The performance gains of Sort, Bayes, RankServ and Kmeans are not obvious. Especially for Sort and Kmeans, they nearly have no performance gains. These implied that the bottlenecks for Grep and WordCount are indeed memory access, because they benefit a lot from hardware pre-fetching.

Workload BOPS Bandwidth
ILP_Sort 13.2G 17.6 0.75 GB/S
Prefech_Sort 13.2G 14.6 0.9 GB/S
ILP_Grep 4.9G 1.3 3.8 GB/S
Prefech_Grep 5.5G 1 5.6 GB/S
ILP_WordCount 4.5G 4 1.1 GB/S
Prefech_WordCount 5G 4.2 1.2 GB/S
ILP_Bayes 8.6G 86 0.1 GB/S
Prefech_Bayes 9.3G 47 0.2 GB/S
ILP_Kmeans 11.3G 56 0.2 GB/S
Prefech_Kmeans 11.3G 56 0.2 GB/S
ILP_RankServ 12.2G 9.3 0.2 GB/S
Prefech_RankServ 13G 9.6 0.2 GB/S
Table 14: Pre-fetching Optimization for DC Workloads.

Optimization Summary Finally. we take all of six workloads as a whole to show their performance promotions. As shown in Fig 5, all workloads have performance gains, varying from 119% to 325%.

Figure 5: DC Workloads Optimizations Under the DC-Roofline Model.

6.4 Different Hardware Platforms Under DC-Roofline Model

We evaluate different hardware platforms under DC-Roofline model. We choose Sort, Grep and WordCount. And the software stack is MPI. The hardware platforms are Intel E5645, Intel E5310 and Intel D510. In Table.15, Sort-5310 implied Sort workload that run on the E5310 platform. From the table, we can see that workloads under Xeon E5645 and Xeon E5310 have same performance bottlenecks (Sort’s performance bottleneck is calculation, while the others are memory access), but the BOPS and OI of the same workload are different on the different platforms. On the other hand, all of workloads under Atom D510 are memory access bottlenecks (as D510 platform is equipped with DDR2 memory). These imply that for the same workloads, though the number of BOPs is independent with underlying system and architecture implementation, the DC-Roofline model is hardware-dependent and we need to build different models for different hardware platforms.

Workload BOPS Bottleneck
Sort-5310 8.7G 10.8 Calculation
Grep-5310 1.4G 0.7 Memory Access
WordCount-5310 1.4G 3.5 Memory Access
Sort-5645 13.2G 14.6 Calculation
Grep-5645 5.5G 1 Memory Access
WordCount-5645 5G 4.2 Memory Access
Sort-510 2.6G 6.5 Memory Access
Grep-510 0.6G 3 Memory Access
WordCount-510 0.5G 5 Memory Access
Table 15: DC-Roofline Model for Different Hardware Platforms.

7 Conclusion

This paper proposes a new computation-centric metric– BOPs that measures the efficient work defined by the source code, includes floating-point operations and the arithmetic, logical, comparing, and array addressing parts of integer operations. The BOPs is independent with the underlying system and hardware implementation, and can be counted through analyzing the source code. We define BOPS as the average number of BOPs per second, and propose replacing FLOPS with BOPS to measure DC computing computer systems.

With several typical micro benchmarks, we attain the upper bound performance through different optimization approaches, and then we proposed a BOPS based Roofline model, which we called DC-Roofline, as a quantitative ceiling performance model for guiding DC computer system design and optimization. Through the experiments, we demonstrate that the DC-Roofline model indeed helps the optimization of the DC computer system with the improvement varying from 119% to 325%.

References

  • [1] Date center growth. https://www.enterprisetech.com.
  • [2] http://www.tpc.org/tpcc.
  • [3] http://www.spec.org/cpu.
  • [4] http://top500.org/.
  • [5] “Sort benchmark home page,” http://sortbenchmark.org/.
  • [6] “Technical report,” http://www.agner.org/optimize/microarchitecture.pdf.
  • [7] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” pp. 483–485, 1967.
  • [8] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: characterization and architectural implications,” pp. 72–81, 2008.
  • [9] E. L. Boyd, W. Azeem, H. S. Lee, T. Shih, S. Hung, and E. S. Davidson, “A hierarchical approach to modeling and improving the performance of scientific applications on the ksr1,” vol. 3, pp. 188–192, 1994.
  • [10] S. P. E. Corporation, “Specweb2005: Spec benchmark for evaluating the performance of world wide web servers,” http://www.spec.org/web2005/, 2005.
  • [11] J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,” Concurrency and Computation: Practice and Experience, vol. 15, no. 9, pp. 803–820, 2003.
  • [12] W. Gao, L. Wang, J. Zhan, C. Luo, D. Zheng, Z. Jia, B. Xie, C. Zheng, Q. Yang, and H. Wang, “A dwarf-based scalable big data benchmarking methodology,” arXiv preprint arXiv:1711.03229, 2017.
  • [13] Neal Cardwell, Stefan Savage, and Thomas E Anderson. Modeling tcp latency. 3:1742–1751, 2000.
  • [14] William Josephson, Lars Ailo Bongo, Kai Li, and David Flynn. Dfs: A file system for virtualized flash storage. ACM Transactions on Storage, 6(3):14, 2010.
  • [15] Samuel Williams, Dhiraj D Kalamkar, Amik Singh, Anand M Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann S Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. Optimization of geometric multigrid for emerging multi- and manycore processors. page 96, 2012.
  • [16] Daniel Richins, Tahrina Ahmed, Russell Clapp, and Vijay Janapa Reddi. Amdahl’s law in big data analytics: Alive and kicking in tpcx-bb (bigbench). In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 630–642. IEEE, 2018.
  • [17] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A machine-learning supercomputer. international symposium on microarchitecture, pages 609–622, 2014.
  • [18] Ethan R Mollick. Establishing moore’s law. IEEE Annals of the History of Computing, 28(3):62–75, 2006.
  • [19] Shoaib Kamil, Cy P Chan, Leonid Oliker, John Shalf, and Samuel Williams. An auto-tuning framework for parallel multicore stencil computations. pages 1–12, 2010.
  • [20] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, R. Ren, C. Zheng, G. Lu, J. Li, Z. Cao, Z. Shujie, and H. Tang, “Bigdatabench: A dwarf-based big data and artificial intelligence benchmark suite,” Technical Report, Institute of Computing Technology, Chinese Academy of Sciences, 2017.
  • [21] S. Harbaugh and J. A. Forakis, “Timing studies using a synthetic whetstone benchmark,” ACM Sigada Ada Letters, no. 2, pp. 23–34, 1984.
  • [22] R. Jain, The art of computer systems performance analysis.   John Wiley & Sons Chichester, 1991, vol. 182.
  • [23] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, and A. Borchers, “In-datacenter performance analysis of a tensor processing unit,” pp. 1–12, 2017.
  • [24]

    S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” in

    International Symposium on Computer Architecture, 2016, pp. 393–405.
  • [25] C. Luo, J. Zhan, Z. Jia, L. Wang, G. Lu, L. Zhang, C. Xu, and N. Sun, “Cloudrank-d: benchmarking and ranking cloud computing systems for data processing applications,” Frontiers of Computer Science, vol. 6, no. 4, pp. 347–362, 2012.
  • [26] J. D. Mccalpin, “Stream: Sustainable memory bandwidth in high performance computers,” 1995.
  • [27] M. Nakajima, H. Noda, K. Dosaka, K. Nakata, M. Higashida, O. Yamamoto, K. Mizumoto, H. Kondo, Y. Shimazu, K. Arimoto et al., “A 40gops 250mw massively parallel processor based on matrix architecture,” in Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE International.   IEEE, 2006, pp. 1616–1625.
  • [28] U. Pesovic, Z. Jovanovic, S. Randjic, and D. Markovic, “Benchmarking performance and energy efficiency of microprocessors for wireless sensor network applications,” in MIPRO, 2012 Proceedings of the 35th International Convention.   IEEE, 2012, pp. 743–747.
  • [29]

    M. M. Tikir, L. Carrington, E. Strohmaier, and A. Snavely, “A genetic algorithms approach to modeling the performance of memory-bound computations,” pp. 1–12, 2007.

  • [30] K. Ueno and T. Suzumura, “Highly scalable graph search for the graph500 benchmark,” in International Symposium on High-Performance Parallel and Distributed Computing, 2012, pp. 149–160.
  • [31] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench: a big data benchmark suite from internet services,” in HPCA 2014.   IEEE, 2014.
  • [32] R. Weicker, “Dhrystone: a synthetic systems programming benchmark,” Communications of The ACM, vol. 27, no. 10, pp. 1013–1030, 1984.
  • [33] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
  • [34] Thomasian, “Analytic Queueing Network Models for Parallel Processing of Task Systems,” IEEE Transactions on Computers, vol. 35, no. 12, pp. 1045–1054, 1986.
  • [35] Nicholas P Carter, Aditya Agrawal, Shekhar Borkar, Romain Cledat, Howard David, Dave Dunning, Joshua B Fryman, Ivan Ganev, Roger A Golliver, Rob C Knauerhase, et al. Runnemede: An architecture for ubiquitous high-performance computing. pages 198–209, 2013.
  • [36] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. operating systems design and implementation, pages 265–283, 2016.
  • [37] Jaewon Lee, Changkyu Kim, Kun Lin, Liqun Cheng, Rama Govindaraju, and Jangwoo Kim. Wsmeter: A performance evaluation methodology for google’s production warehouse-scale computers. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 549–563. ACM, 2018.
  • [38] Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Rui Ren, Chen Zheng, Gang Lu, Jingwei Li, Zheng Cao, et al. Bigdatabench: A dwarf-based big data and ai benchmark suite. arXiv preprint arXiv:1802.08254, 2018.
  • [39] John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. Elsevier, 2012.