1 Introduction
In the past decades, the FLOPS (FLoatingpoint Operations Per Second) [11] metric is used to evaluate the performance of modern computer systems. FLOPS has led computing technology progress not only limited to high performance computing (HPC) for many years [11], and the history witnesses that FLOPS defines the concrete R&D objectives and roadmaps for HPC (Gflops in the 1990, Tflops in the 2000, Pflops in the 2010, and Eflops in the 2020).
To date, to perform big data analytics or provide Internet services, more and more organizations in the world build internal datacenters, or rent hosted datacenters. As a result, DC computing has become a new paradigm of computing. It seems that the fraction of DC has outweighed HPC in terms of market share (HPC only takes 20% of total)[1]. So a natural question rises: what is the metric for DC. Is it still the same FLOPS metric?
Different from HPC, DC has unique features. For example, it is reported that DC has very low ratio of floating point operations intensity [31], which is defined as the total floating point operations divided by the total memory access bytes. Gao et al. [38, 31] has performed comprehensive and hierarchical TopDown analysis on big data analytics, AI, and internet service workloads from an architectural perspective. It shows that for typical DC workloads, the average floating point instructions ratio is only 1% and the average FLOPS efficiency is only 0.1%, while the average IPC (Instructions Per Cycle) is 1.3 (the theoretic IPC is 4 for the experimental platform). In Section 3.2, we also reveal that the traditional FLOPS based Roofline performance model gives misleading information—the bottlenecks of all workloads are memory access—to the DC workloads’ performance optimization. These observations imply that FLOPS is inappropriate for measuring DC computer systems.
For measuring the performance, the wall clock time is a ground truth. In practice, there are several userperceived metrics, which are derivatives of the wall clock time, used to measure applicationspecific systems, i.e., the transactions per minute for the online transaction system [2], and input data processed per second for the big data analysis system [25]. However, there are two limitations of userperceived metrics. First, different userperceived metrics can not be compared side by side. For example, transactions per minute (TPM) and data per second processing capability (GB/S) can not be used for appletoapple comparison. Second, userperceived metrics are hard to measure the ceiling performance of the computer systems.
In this paper, inspired by FLOPS [11], we defined Basic OPerations per Second (in short BOPS) to evaluate the DC computing system. The contributions are as follows:
First, we propose a computationcentric metric—BOPs that measures the efficient work defined by the source code, includes floatingpoint operations and the arithmetic, logical, comparing, and array addressing parts of integer operations. The BOPs is independent with the underlying system and hardware implementation, and can be counted through analyzing the source code. We also use the hardware performance counters to approximate the BOPs when the source code is not available. We define BOPS as the average number of BOPs per second, and propose replacing FLOPS with BOPS to measure DC computer systems. BOPS can quantitatively measure the theoretical peak performance of the DC computer systems through analyzing microarchitecture of the systems, and the real performance through running the workloads on the systems.
Second, with several typical micro benchmarks, we attain the upper bound performance through different optimization approaches, and then we propose a BOPS based Roofline model, which we called DCRoofline, as a quantitative performance model for guiding DC computer system design and optimization. DCRoofline can depict performance ceilings of DC workloads with specific tuning settings on the target system. Through the experiments, we demonstrate that the DCRoofline model indeed helps optimize the performance of typical DC workloads by 119% to 325%.
For the DC computer system and architecture community, a single computationcentric metric like BOPS is simple but powerful for exploring innovative systems and architecture. First, BOPs can be be calculated at the application’s source code level, independent with the underlying system implementation, so it is fair to evaluate and compare different system and architecture implementations. Second, it can be calculated at different levels independently. For example, it can be calculated at source code level, software’s binary code level and hardware’s instruction level, respectively, so it facilitates codesign of systems and architecture. Last but not least, it helps people understand the performance ceiling of the computer systems, and hence guides system optimization.
The remainder of the paper is organized as follows. Section 2 summarizes the related work. Section 3 states background and motivations. Section 4 defines BOPs, reports how to calculate it, and presents BOPS usages. Section 5 introduces BOPSbased DCRoofline model. Section 6 presents how to optimize the performance with the DCRoofline model. Section 7 draws a conclusion.
2 Related Work
Related work is summarized from two perspectives: metrics, and performance models.
2.1 Metrics
The wall clock time is the most basic performance metric for the computer system [39]
, and almost all the other performance metrics are derived from it. Based on the wall clock time, the performance metrics can be classified into two categories. One is the userperceived metric, which can be intuitively perceived by the user, such as the TPM (transactions per minute) metric. The other is the computationcentric metric, which is related with a specific computation operation, i.e. FLOPS (FLoatingpoint Operations Per Second).
Userperceived metrics can be further classified into two categories: one is the metric for the whole system, and the other is the metric for components of the system. The former examples include data sorted in one minute (MinuteSort), which measures the sorting capability of a system [5], and transactions per minute (TPM) for the online transactions system [2]. The latter example includes the SPECspeed/SPECrate metric for the CPU component [3], the input/output operations per second (IOPS) metric for the storage component [14], and the data transfer latency metric for the network component [13].
There are many computationcentric metrics. FLOPS (FLoatingpoint Operations Per Second) is a computationcentric metric, measuring the computer system, especially in fields of scientific computing that make heavy use of floatingpoint calculations [11]. The wide recognition of FLOPS indicates of the maturation of high performance computing. MIPS (Million Instructions Per Second) [22] is another famous computationcentric metric, which is defined as the number of million instructions the processor can execute per second. The main defect of MIPS is that it is architecturedependent. There are many derivatives of the MIPS, including MWIPS and DMIPS [28], which use synthetic workloads to evaluate the floating point operations and integer3v operations, respectively. The metric of WSMeter [37], which is defined as quotaweighted sum of perjob’s MIPS (IPC), is also a derivative from MIPS, and hence it is architecturedependent. Unfortunately, modern datacenters are heterogeneous, consisting of different generations of hardware.
OPS (operations per second) is a computationcentric metric too. OPS [27] is initially proposed for digital processing systems, defined as the 16bit addition operations per second. The definitions of OPS are extended later for Intel Ubiquitous HighPerformance Computing [35]
and recent artificial intelligence processors, such as Google’s Tensor Processing Unit
[36, 23] and the Cambricon processor [24, 17]. All of them are defined in terms of a specific operation. For example, the OPS is 8bit matrix multiplication operations in TPU and 16bit integer operations in the Cambricon processor, respectively. However, modern datacenter workloads are comprehensive and complex, and the bias toward a specific operation will not justify the fairness of evaluation.For each metric, the corresponding tools or benchmarks [39] are proposed to report the numbers. For the userperceived metric—SPECspeed/SPECrate metrics, SPECCPU is the benchmark suite [3] measuring the CPU component. For the TPM metric, TPCC [2] is the benchmark suite, measuring the online transactions system. The other example is the Sort benchmark [5] for MinuteSort. For the computationcentric metric, Whetstone [21] and Dhrystone [32] is the measurement tool for MWIPS and DMIPS metrics, respectively. HPL [11] is widelyused measurement tool for FLOPS. As a microbenchmark, HPL demonstrates a sophisticated design: the proportion of the floatingpoint addition and multiplication in the HPL is 1:1 so as to fully utilize the FPU unit of the modern processor.
2.2 Performance model
A performance model can depict and predict the performance of a specific system. There are two categories of performance models: one is an analytical model, which uses a stochastic/statistical analytical method to depict and predict the system performance, and the other is the bound model, which is relative simpler and only depicts the performance bound or bottleneck of the system. Previous work [34, 29, 9, 37] use stochastic/statistical analytical models to predict the system performance. Actually, the distributed and parallel systems always have lots of uncertain behaviors, which makes it hard to build an accurate prediction model. Instead, the bound and bottleneck analysis is more suitable. Amdahl’s Law [7] is one of the famous performance bound model for parallel processing computer systems, and it is also used for big data systems [16]. The Roofline model [33] is another famous performance bound model. The original Roofline model [33] adopts the FLOPS as the metric of the model. On the basis of the definition of operational intensity (OI)—which is the total number of floating point instructions divided by the total number of bytes of memory access, the Roofline model can depicts the given workload’s the upper bound performance when different optimization strategies are adopted on the target system.
3 Background and Motivations
3.1 Background
Moore’s Law reveals that the number of cores per chip doubles approximately every two years [18]. However, the diversity and complexity of modern DC workloads raise great challenges in depicting the ceiling performance of the computer system spanning multiple domains – algorithm, programming, compiling, system development, and architecture design. The qualitative performance analysis of a computer system includes performance metric and performance model [39] [33].
3.1.1 The Computationcentric Metric
For system and architecture community, the computationcentric metric, such as FLOPS, is a fundamental yardstick to reflect the running performance and gaps across different systems or architecture. Generally, the computationcentric metric has performance upper bound on the specific architecture, according to the microarchitecture design. For example, the peak FLOPS is computed by
(1) 
For each metric, the measuring tool is used to measure the performance of the systems and architecture in terms of the metric number, and reports the gap between the actual number and the theoretical peak one. For example, HPL [11] is a widelyused measuring tool in terms of FLOPS. The FLOPS efficiency of the specific system is the ratio of the HPL’s FLOPS to the peak FLOPS.
3.1.2 The Upper Bound Performance Model
The computationcentric metric is the foundation of the system performance model. And the performance model of a computer system can depict and predict the workload’s performance on the specific system. For example, the Roofline model [33] is a famous upper bound model based on FLOPS. There is much system optimization work [19] and [15], performed on the basis of the Roofline model in the HPC domain.
(2) 
The above formula indicates that the attained performance bound of a workload on a specific platform is limited by the processor’s computing capacity and the memory bandwidth. Peak FLOPS and Peak MemBand are the peak performance of the platform, and the operation intensity (in short, OI) is the total number of floating point operations divided by the total number of bytes of memory access. The Roofline model can be visualized as shown in Fig. 1 as an example. The xaxis is the operational intensity, and the yaxis is the floatingpoint performance. The horizontal line showing peak floatingpoint performance of the platform, and the diagonal line gives the peak memory bandwidth. To identify the bottleneck and guide the optimization, the ceilings (for example, the ILP and SIMD optimizations in the figure) can be added to provide the performance tuning guidance.
3.2 Motivation
Based on the workload characterization of the modern DC workloads, we validate the effectiveness of FLOPS and corresponding Roofline model, and explain why we propose a new metric—BOPs for DC computing.
3.2.1 Experimental workloads and platforms
Previous work has performed comprehensive characterization of modern DC workloads [38, 31]. Hereby, we choose six workloads from BigDataBench for summarizing their workload characteristics. BigDataBench [38] is an opensource big data and AI benchmark suite based on the frequentlyappearing units of computation, which consists of a suite of micro and component benchmarks. The details of chosen workloads are shown in Table. 1
. For these six workloads. Sort, Grep and WordCount are three famous microbenchmarks. Bayes, Kmeans and RankServ are three typical component benchmarks. RankServ implements the simple search engine service. Bayes implements the classifying of a specified input text according to the learning model. Kmeans is a popular clustering algorithm of machine learning.
ID  Workload  Software Stack  Type 
1  Sort  Hadoop,MPI  MicroBenchmark 
2  Grep  Hadoop,MPI  MicroBenchmark 
3  WordCount  Hadoop,MPI  MicroBenchmark 
4  RankServ  C++  Component Benchmark 
5  Kmeans  Hadoop,MPI  Component Benchmark 
6  Bayes  Hadoop,MPI  Component Benchmark 
The experimental platform is same like that in Section 4, and we choose the Intel Xeon E5645, which is a typical brawnycore processor (OoO execution, fourwide instruction issue).
3.2.2 The Limitation of FLOPS for DC
One of main targets of the computationcentric metric is that it can quantitatively measure the system performance. We use six DC workloads to reveal the limitations of FLOPS for DC. The peak value in the figure is calculated by Formula 1, and we also report the BOPS number metric for comparison. The FLOPS of DC workloads is only 0.16 GFLOPS on average (only 0.1% of the peak), while the average IPC (Instructions Per Cycle) is 1.3 (32% of the peak). These performance data imply that the FLOPS number is far from the DC system efficiency. On the other hand, the average BOPS number of the DC workloads is 9.5 GBOPS (10% of the peak).
We use the FLOPSbased roofline model to measure the ceiling performance and locate the bottlenecks of DC workloads. The details are shown in Table. 2.
Workload  FLOPS  OI  Bottleneck 
Sort  0.01G  0.01  Memory Access 
Grep  0.01G  0.002  Memory Access 
WordCount  0.01G  0.01  Memory Access 
Bayes  0.02G  0.2  Memory Access 
Kmeans  0.1G  0.5  Memory Access 
RankServ  0.5G  0.6  Memory Access 
In our experiments, the peak FLOPS of E5645 is 57.6 GFLOPS, and the peak memory bandwidth is 13.2 GB/s. From Table. 2, we can observe: first, the operation intensity (OI) for six workloads is very low, and the average number is only 0.05. Second, the model indicates that the bottleneck of all six workloads is memory access. Furthermore, under the hint of the Roofline model, we increase the memory bandwidth through hardware prefetching, and the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s. However, only Grep and WordCount gain obvious performance improvements, which is 16% and 10%, respectively. For the other four workloads, the average performance improvement is no more than 3%, which indicates the memory access is not the exact bottleneck. In Section 6, when we use our new BOPSbased Roofline model, we confirm for the latter four workloads, the bottlenecks are calculation, exactly. Table 3 reports the performance improvement through ILP and hardware prefetching in terms of FLOPS and BOPS. In Table 3, Sort is the original performance, while Sort” is the performance after the optimization.
Workload  BOPS  FLOPS  OI  
Sort  8.8G  11.7  0.01G  0.01 
Sort”  13.2G  14.6  0.01G  0.01 
Grep  4.6G  1.2  0.01G  0.002 
Grep”  5.5G  1  0.016G  0.003 
WordCount  3.8G  3.2  0.01G  0.008 
WordCount”  5G  4.2  0.012G  0.007 
Kmeans  5G  25  0.1G  0.5 
Kmeans”  11.3G  56  0.2G  0.5 
Bayes  5.1G  51  0.02G  0.2 
Bayes”  9.3G  47  0.02G  0.1 
RankServ  8.2G  9.5  0.5G  0.6 
RankServ”  13G  9.6  1G  0.9 
Fig. 2 shows the final results using Roofline and DCRoofline, respectively. From the figure, we can see: if using the Roofline model in terms of FLOPS, the achieved performance is at most up to 0.1% of the peak FLOPS. For comparison, the result using the DCRoofline model is up to 10% of the peak BOPS (These results are consistent with section.3.2.2).
3.2.3 What we should include in our new metric for DC computing
The previous workload characterization work [38, 31] reveals that the DC workload has high ratio of integer to floating point operations. Hereby, we do not intend to repeat the research work. For the purpose of explaining why FLOPS does not work in DC, we run six DC workloads (The details are shown in Table. 1) with respect to HPL, SPECFP, and PARSEC.
For the new metric, our fundamental principle is that we only consider the efficient work defined by the source code. Fig.3 shows the instruction mix of DC workloads. As our concern is the computationcentric metric, the new metric will consider integer and FP operations, as they will execute on ALU or FPU in the processor. We will not consider branch and load/store operations. For DC workloads, the ratio of integer to floating point operations is 38, while the number is 0.3, 0.4, 0.02 for HPL, Parsec and SPECFP, respectively. That is the main reason why FLOPS doe not work in DC. This implies that our new metric should consider both integer and floatingpoint operations.
Not like FP operations, Integer operations are more diverse (such as Add, Shuffle, Move operations) and all of them are different. We analyze the integer instruction breakdown of DC workloads through inserting the analysis code into the source code. We classify all operations into four classes. The first class is array addressing operation (related with data movement); the second class is the integer arithmetic or logical operations; the third class is the comparing operation(related with branch); the four class is all of the other operations. The top three classes can be counted at the source code level, so we include them in the BOPs. Also, we find that the 47% of integer instructions belong to address calculation, 22% of those belong to branch calculation, and 30% of those belong to integer arithmetic or logical operations.
4 BOPs
BOPS (Basic OPerations per Second) is the average number of BOPs (Basic OPerations) of a specific workload completed every second. In this section, we present the definition of BOPs and how to measure BOPs with or without the source code available, and then introduce how to use BOPS.
4.1 BOPs Definition
In our definition, BOPs include floatingpoint operations and the arithmetic, logical, comparing, and array addressing parts of integer operations. For integer operations, arithmetic/logical part corresponds to calculation, comparing operations correspond to branchrelated operations, array addressing operations correspond to data movementrelated operations. The detailed comparison between the FLOPS and BOPS definitions is shown in Table.4.
BOPS  FLOPS  
Calculation  Integer & FP  FP 
Data movement  Addressing operations  None 
Branch  Integer &FP comparing  FP comparing 
The detailed definitions of BOPs is shown in Table.5. Each operation in Table.5 is counted as 1 except for Ndimensional array addressing. Note that all operations are normalized to 64bit operation. For arithmetic or logical operations, the number of BOPs is counted as the number of corresponding arithmetic or logical operation. For array addressing operations, we take one dimensional array P[i] as the example: loading the value of P[i] indicates the addition of an i offset to the address location of P, so the number of BOPs increments by one. For comparing operations, we transform it to the subtraction operation. We take X <Y as an example and transform it to XY <0, so the number of BOPs increments by one. BOPs normalize all operations into 64bit operations, and each operation is counted as 1 (regardless of the different delays with different operations in the real system).
Operations  Normalized value 
Add  1 
Subtract  1 
Multiply  1 
Divide  1 
Bitwise operation  1 
Logic operation  1 
Compare operation  1 
Onedimensional array addressing  1 
Ndimensional array addressing  N 
Several other operations are not included in BOPs, including variables declaration, variable assignment, type conversion, branch commands, loop commands, skip commands, function call commands and return commands, as illustrated in Table. 6.
Category  Descriptions 
Variable declaration  int i 
Variable assignment  i=10 
Type conversion  (int*) X 
Branch call  goto, if else, Switch case 
Loop call  for, while 
Function call  Fun() call 
Return  return command 
Delays of different operations are not considered in the normalized calculations of BOPs, because delays can be extremely different in diverse microarchitecture platforms. For example, the delay of divisions on the Intel Xeon E5645 processors is about 712 cycles, while on the Intel Atom D510 processors, the delay can be reach up to 38 cycles [6]. Hence, considering the delays of normalization will lead to architecturerelated issue. In our BOPS definition, BOPs normalize all operations into 64bit operations, and each operation is counted as 1.
4.2 How to Measure BOPs
BOPs can be measured at either the source code level or the instruction level.
4.2.1 Sourcecode level measurement
We can calculate the BOPs from the source code of a workload, and this method needs some manual work (analyzing the source codes). However, it is independent with the underlying system implementation, so it is fair to evaluate and compare different system and architecture implementations. As the below example shows, BOPs are not calculated in the first and second line because they are variable declarations. Line 3 consists of a loop command and two integer operations, and the number of the corresponding BOPs is (1+1) * 100 = 200 for integer operations, while the loop command is not calculated; Line 5 consists of the array address operations and variable assignments: address assignment is counted as 100 * 1, while variable assignment is not calculated. So the sum of the BOPs of the sample program is: 100 + 200 = 300.
In order to verify the reasonability of the above calculation, the binary code is presented as below, we can see that there are six operations: movq, addq, addq, movl, cmpq, jne. We count BOPs for addq, cmpq and addq. The binary code level measurement is in accordance with the source code level.
Another thing we need to take into consideration is the system builtin library function. For the calculation of the systemlevel function, such as Strcmp function, we implement the userlevel function manually, and then count the number of BOPs, which may result in small deviation in terms of the BOPs number. The implementation of the Strcmp function is shown as follows.
4.2.2 Instruction level measurement
Sourcecode or binarycode level measurements need to analyze the source code, which is especially costly for complex system stacks (i.e., Hadoop system stacks). Instruction level measurement can avoid this high analysis cost and the restriction of needing the source code. We propose an instructionlevel approach to measuring the BOPs, which use the hardware performance counter to obtain BOPs. As different type of processors have different performance counter events, for convenience, we introduce an approximate but simple instruction level measurement method here. That is, to obtain the total numbers of instructions (ins), branch instructions (branch_ins), load instructions (load_ins) and store instructions (store_ins) through hardware performance counters. And the BOPs can be obtained according to the following formula:
(3) 
Note that our approximate measurement method includes all integer instructions, which does not exactly conform to the BOPS definition. But, from our observation, the deviation of instruction code level measurement is small.
4.3 How to use BOPS
BOPS is the average number of BOPs of a specific workload completed every second. We count the floating point operations and the arithmetic, logical, comparing and array addressing part of the integer operations in BOPs. The peak BOPS is computed by the following formula:
(4) 
For Intel Xeon E5645 platform, .
4.3.1 The BOPS Measuring Tools
Based on the definition, we need the BOPS Measuring Tools to report the number. With the BOPS Measuring Tools, the ceilings can be instantiated. For example, HPL [11] benchmark is a widelyused measurement tool of FLOPS and its Roofline model with the sophisticated design (fully utilize the FPU unit, the ceiling for FLOPS) and the representativeness (Dense linear algebraic equations in the HPL is a typical HPC algorithm). Because of the diversity of DC workloads, it is difficult to find one representative workload to represent all DC workloads. We develop the BOPS Measuring Tools, consisting of a series of representative workloads. We choose typical microbenchmarks from BigDataBench as the BOPS measuring tools from the perspectives of complexity, workload types and computation patterns. Currently, our BOPS measurement tools provide three workloads: Sort–an I/O intensive workload, WordCount–a CPU intensive workload, and Grep–a hybrid (both I/O and CPU intensive) workload. For each workload (a measurement tool), Table 7 summarizes the description and the BOPs number.
Workload  BOPs  Scale  Description 
Sort  529E9  10E8 records  IOIntensive 
Grep  142E9  15GB text  Hybrid 
WordCount  179E9  15GB text  CPUIntensive 
Sort: The Sort workload sorts an integer array of a specific scale, and the sorting algorithm uses the quick sort and merge algorithm. For example, sort an integer array of 10E8 elements. The program is implemented with C++ and MPI.
WordCount: The WordCount workload counts the words in a specified input text. For example, count the frequency of a word appearing in a 15GB text file in the format of txt. The program is implemented with C++ and MPI.
Grep: The Grep workload searches plain text file for lines that match a regular expression. The program is implemented with C++ and MPI.
Please note that the BOPs number will change as the data scale or request number increases or decreases. The measuring tools can be used to measure the real performance. For example, Sort in Table 7 has 529E9 BOPs. We run Sort on the Xeon E5645 node and the execution time is 40 seconds. . And the BOPS efficiency is calculated by the formula:
(5) 
4.4 Evaluations
4.4.1 Experimental Platforms
We choose three typical representative processor platforms as the experimental platforms for DC physical platforms, which are Intel Xeon E5310, Intel Xeon E5645 and Intel Atom D510. Intel Xeon E5310 and Intel Xeon E5645 are typical brawnycore processors (OoO execution, fourwide instruction issue), while Intel Atom D510 is a typical Wimpycore processor (inorder execution, twowide instruction issue). Each experiment platform is equipped with four nodes. The details are shown in Table. 8.
CPU Type  Intel CPU Core  
Intel ®Xeon E5645  6 cores@2.40G  
L1 DCache  L1 ICache  L2 Cache  L3 Cache 
6 32 KB  6 32 KB  6 256 KB  12MB 
CPU Type  Intel CPU Core  
Intel ®Xeon E5310  4 cores@1.60G  
L1 DCache  L1 ICache  L2 Cache  L3 Cache 
4 32 KB  4 32 KB  2 4MB  None 
CPU Type  Intel CPU Core  
Intel ®Atom E5645  4 cores@1.60G  
L1 DCache  L1 ICache  L2 Cache  L3 Cache 
6 32 KB  6 32 KB  6 256 KB  None 
4.4.2 BOPS for DC
For Performance Evaluation, we compare the BOPS metric with the FLOPS and IPC. We choose Intel Xeon E5645, Intel Xeon E5310 and Intel D510 as the experimental platforms, using three BOPS measuring tools in Table 7. As shown in Table. 9, the Peak BOPS is obtained by Formula 4. For Intel Xeon E5645 platform, the Peak BOPS is 86.4 GBOPS. The Real BOPs numbers are reported in Table.7, and we calculate their average BOPS. BOPS efficiency is obtained by Formula 5. In Table.9, the BOPS Efficiency of E5645, E5310 and D510 is 9.4%, 9.3% and 10%, respectively, and the FLOPS Efficiency is 0.1%, 0.2%, and 0.1%, respectively. On the other hand, the IPC Efficiency is 32%,40% and 25%, respectivley. So we can see that the FLOPS number is far from the DC system efficiency, while the BOPS number is more reasonable to reflect the efficiency of the real DC system.
E5645  D510  E5310  
Peak BOPS  86.4G  12.8G  38.4G 
Real BOPS  8.2G  1.3G  4.1G 
BOPS Efficiency  9.4%  9.3%  10% 
Peak FLOPS  57.6G  4.8G  25.6G 
Real FLOPS  0.1G  0.003G  0.03G 
FLOPS Efficiency  0.1%  0.2%  0.1% 
Peak IPC  4  2  4 
Real IPC  1.3  0.5  1 
IPC Efficiency  32%  40%  25% 
4.4.3 BOPS for traditional workloads
We measure the BOPS of the traditional benchmarks. As shown in Table. 10, we choose HPL, Graph500 [30] and Stream [26]. BOPS performance and efficiency of the workload are compared with those of the FLOPS metric. From Table. 10, we can see that BOPS are also suited for measuring the traditional workloads.
HPL  Graph500  Stream  
GFLOPS  38.9  0.05  0.8 
FLOPS efficiency  68%  0.04%  0.7% 
GBOPS  41  12  13 
BOPS efficiency  47%  18%  20% 
5 DCRoofline Model
In this section, we present the DCRoofline model, which depicts the upper bound performance of the given DC workload under the specific setting of the target system.
5.1 DCRoofline Definition
DCRoofline model is inspired by the Roofline model, and the main idea of DCRoofline model is replacing FLOPS with BOPS. The definitions of DCRoofline model are as follows:
Definition 1: The Peak Performance of DC
We choose (can be obtained from Formula 4) as the peak performance metric of DC in the DCRoofline model.
Definition 2: Operation Intensity of DC
Operation intensity (OI) in DCRoofline is the ratio between the BOPS and memory bandwidth. The memory traffic () is the total number of swap bytes between the CPU and the memory. The operation intensity () could be obtained according to the formula as follows:
(6) 
The memory traffic () is calculated by (total number of memory accesses * 64), and total number of memory accesses can be obtained through the hardware performance counter. How to count BOPs is introduced in Section 4.3.
Definition 3: The Upper Bound Performance of DC The attained performance bound of a given workload is depicted as:
(7) 
The is peak memory bandwidth of the system.
5.2 Adding Ceilings for DCRoofline
We add three ceilings–ILP, SIMD, and prefetching–to specify the performance upper bound for specific tuning settings. Among them, ILP and SIMD reflect computation limitations and the prefetching reflects memory access limitations. We evaluate our experiments on the Intel Xeon E5645 platform.
5.2.1 Prefetching Ceiling
We use the Stream benchmark as the measuring tool for the Prefetching ceiling. We improve the memory bandwidth through opening the prefetching switch option in the system BIOS, and then the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s.
5.2.2 ILP and SIMD Ceilings
We use the following formula to estimate the ILP and SIMD ceilings:
(8) 
The Port Efficiency is the port efficiency of pipeline (According to the user manual of Intel E5645 and Gao et al. [38] work, port efficiency is always lower than 50%), ILP Efficiency is IPC efficiency of the peak IPC (E5645’s peak IPC is 4), SIMD Scale is the scale of SIMD (for E5645, the value is 2 under SIMD, otherwise it is 1). Therefore, the ILP (Instructionlevel parallelism) ceiling is 21.6 GBOPS when the IPC number is 2. Based on the ILP ceiling, the SIMD ceiling is 43.2 GBOPS.
5.3 Visualized DCRoofline Model
The visualized DCRoofline model can be depicted as Fig. 4. In Fig. 4, y axis reflects the peak BOPS that the target system can achieve. The diagonal line represents the bandwidth, and the roof line shows the peak BOPS performance. The confluence of the diagonal and roof lines at the ridge point (where the diagonal and horizontal roofs meet) allow us to evaluate the performance of the system. Similar to the original Roofline model, the ceilings – which imply the performance upper bounds for specific tuning settings – can be added to the DCRoofline model. There are three ceilings (ILP, SIMD, and Prefetching) in the figure.
6 DC Roofline Model Usage
In this section, we illustrate how to use the DC Roofline model to optimize the performance of six DC workloads on the Intel Xeon E5645 platform. As shown in Table.11, the top five are typical DC analytical workloads, while RankServ is a service workload. The peak BOPS is 86.4 GBOPS according to Formula 4, and the peak memory bandwidth is 13.2 GB/s using the Stream benchmark [26].
Workload  Stack  Scale  Description 
Sort  MPI  10E8 records  IOIntensive 
Grep  MPI  15GB text  Hybrid 
WordCount  MPI  15GB text  CPUIntensive 
Bayes  MPI  1GB text  CPUIntensive 
Kmeans  MPI  1GB text  CPUIntensive 
RankServ  C++  20*100 requests  ThreadIntensive 
6.1 Performance Analysis under DCRoolfine
The six workloads’ corresponding numbers of BOPS and operation intensity are shown in Table.12. We find that the operation intensity of six workloads ranges from 1.2 to 51. The performance bottlenecks of Sort, Bayes, Kmeans, and RankServ are the calculation, while that of the Grep and WordCount are memory access.
Workload  BOPS  Bottleneck  
Sort  8.8G  11.7  Calculation 
Grep  4.6G  1.2  Memory Access 
WordCount  3.8G  3.2  Memory Access 
Bayes  5.1G  51  Calculation 
Kmeans  5.0G  25  Calculation 
RankServ  8.2G  9.57  Calculation 
6.2 Calculation Optimizations
6.2.1 ILP Optimizations
We improved ILP through adding the compiling optimization option with O2 (i.e., gcc o2). As shown in Table. 13, the BOPS number of Sort, Bayes, Kmeans, and RankServ is improved significantly by 50%, 68% , 120%, and 48%, respectively. These implied that the bottlenecks of Sort, Bayes, Kmeans, and RankServ are indeed calculation, because they benefit a lot from ILP optimization.
Workload  BOPS  IPC  
Original_Sort  8.8G  11.7  1.6 
ILP_Sort  13.2G  17.6  1.7 
Original_Grep  4.6G  1.2  0.6 
ILP_Grep  4.9G  1.3  0.6 
Original_WordCount  3.8G  3.2  0.9 
ILP_WordCount  4.5G  4  0.9 
Original_Bayes  5.1G  51  1.2 
ILP_Bayes  8.6G  86  1.5 
Original_Kmeans  5.0G  25  1.7 
ILP_Kmeans  11.3G  56  1.8 
Original_RankServ  9.5G  9.5  1.3 
ILP_RankServ  12.2G  9.3  1.4 
6.2.2 SIMD Optimization
SIMD is the common method for the HPC performance improvement, which performs the same operation on multiple data simultaneously. Modern processors have 128bit wide SIMD instructions at least (i.e., SSE, AVX, etc). We apply the SIMD technique on the DC workloads, and change Sort from SISD to SIMD through SSE revising (as it needs time to revise all of workloads with SSE, we develop the SSE version of Sort at first). Note that the SIMD optimization is after the ILP optimization, and we still achieve 2.2X performance improvement than the SISD version with the only ILP optimization. Using the SEE_Sort, the attainable performance is 28.6 GBOPS, which is 33% of the peak BOPS.
6.3 Memory Optimizations
We improved the memory bandwidth through opening the prefetching switch option, and the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s. Note that hardware prefetching is after the ILP optimization. As shown in Table. 14, Grep and WordCount are improved significantly by 16% and 10%, respectively. The performance gains of Sort, Bayes, RankServ and Kmeans are not obvious. Especially for Sort and Kmeans, they nearly have no performance gains. These implied that the bottlenecks for Grep and WordCount are indeed memory access, because they benefit a lot from hardware prefetching.
Workload  BOPS  Bandwidth  
ILP_Sort  13.2G  17.6  0.75 GB/S 
Prefech_Sort  13.2G  14.6  0.9 GB/S 
ILP_Grep  4.9G  1.3  3.8 GB/S 
Prefech_Grep  5.5G  1  5.6 GB/S 
ILP_WordCount  4.5G  4  1.1 GB/S 
Prefech_WordCount  5G  4.2  1.2 GB/S 
ILP_Bayes  8.6G  86  0.1 GB/S 
Prefech_Bayes  9.3G  47  0.2 GB/S 
ILP_Kmeans  11.3G  56  0.2 GB/S 
Prefech_Kmeans  11.3G  56  0.2 GB/S 
ILP_RankServ  12.2G  9.3  0.2 GB/S 
Prefech_RankServ  13G  9.6  0.2 GB/S 
Optimization Summary Finally. we take all of six workloads as a whole to show their performance promotions. As shown in Fig 5, all workloads have performance gains, varying from 119% to 325%.
6.4 Different Hardware Platforms Under DCRoofline Model
We evaluate different hardware platforms under DCRoofline model. We choose Sort, Grep and WordCount. And the software stack is MPI. The hardware platforms are Intel E5645, Intel E5310 and Intel D510. In Table.15, Sort5310 implied Sort workload that run on the E5310 platform. From the table, we can see that workloads under Xeon E5645 and Xeon E5310 have same performance bottlenecks (Sort’s performance bottleneck is calculation, while the others are memory access), but the BOPS and OI of the same workload are different on the different platforms. On the other hand, all of workloads under Atom D510 are memory access bottlenecks (as D510 platform is equipped with DDR2 memory). These imply that for the same workloads, though the number of BOPs is independent with underlying system and architecture implementation, the DCRoofline model is hardwaredependent and we need to build different models for different hardware platforms.
Workload  BOPS  Bottleneck  
Sort5310  8.7G  10.8  Calculation 
Grep5310  1.4G  0.7  Memory Access 
WordCount5310  1.4G  3.5  Memory Access 
Sort5645  13.2G  14.6  Calculation 
Grep5645  5.5G  1  Memory Access 
WordCount5645  5G  4.2  Memory Access 
Sort510  2.6G  6.5  Memory Access 
Grep510  0.6G  3  Memory Access 
WordCount510  0.5G  5  Memory Access 
7 Conclusion
This paper proposes a new computationcentric metric– BOPs that measures the efficient work defined by the source code, includes floatingpoint operations and the arithmetic, logical, comparing, and array addressing parts of integer operations. The BOPs is independent with the underlying system and hardware implementation, and can be counted through analyzing the source code. We define BOPS as the average number of BOPs per second, and propose replacing FLOPS with BOPS to measure DC computing computer systems.
With several typical micro benchmarks, we attain the upper bound performance through different optimization approaches, and then we proposed a BOPS based Roofline model, which we called DCRoofline, as a quantitative ceiling performance model for guiding DC computer system design and optimization. Through the experiments, we demonstrate that the DCRoofline model indeed helps the optimization of the DC computer system with the improvement varying from 119% to 325%.
References
 [1] Date center growth. https://www.enterprisetech.com.
 [2] http://www.tpc.org/tpcc.
 [3] http://www.spec.org/cpu.
 [4] http://top500.org/.
 [5] “Sort benchmark home page,” http://sortbenchmark.org/.
 [6] “Technical report,” http://www.agner.org/optimize/microarchitecture.pdf.
 [7] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” pp. 483–485, 1967.
 [8] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: characterization and architectural implications,” pp. 72–81, 2008.
 [9] E. L. Boyd, W. Azeem, H. S. Lee, T. Shih, S. Hung, and E. S. Davidson, “A hierarchical approach to modeling and improving the performance of scientific applications on the ksr1,” vol. 3, pp. 188–192, 1994.
 [10] S. P. E. Corporation, “Specweb2005: Spec benchmark for evaluating the performance of world wide web servers,” http://www.spec.org/web2005/, 2005.
 [11] J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,” Concurrency and Computation: Practice and Experience, vol. 15, no. 9, pp. 803–820, 2003.
 [12] W. Gao, L. Wang, J. Zhan, C. Luo, D. Zheng, Z. Jia, B. Xie, C. Zheng, Q. Yang, and H. Wang, “A dwarfbased scalable big data benchmarking methodology,” arXiv preprint arXiv:1711.03229, 2017.
 [13] Neal Cardwell, Stefan Savage, and Thomas E Anderson. Modeling tcp latency. 3:1742–1751, 2000.
 [14] William Josephson, Lars Ailo Bongo, Kai Li, and David Flynn. Dfs: A file system for virtualized flash storage. ACM Transactions on Storage, 6(3):14, 2010.
 [15] Samuel Williams, Dhiraj D Kalamkar, Amik Singh, Anand M Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann S Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. Optimization of geometric multigrid for emerging multi and manycore processors. page 96, 2012.
 [16] Daniel Richins, Tahrina Ahmed, Russell Clapp, and Vijay Janapa Reddi. Amdahl’s law in big data analytics: Alive and kicking in tpcxbb (bigbench). In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 630–642. IEEE, 2018.
 [17] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A machinelearning supercomputer. international symposium on microarchitecture, pages 609–622, 2014.
 [18] Ethan R Mollick. Establishing moore’s law. IEEE Annals of the History of Computing, 28(3):62–75, 2006.
 [19] Shoaib Kamil, Cy P Chan, Leonid Oliker, John Shalf, and Samuel Williams. An autotuning framework for parallel multicore stencil computations. pages 1–12, 2010.
 [20] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, R. Ren, C. Zheng, G. Lu, J. Li, Z. Cao, Z. Shujie, and H. Tang, “Bigdatabench: A dwarfbased big data and artificial intelligence benchmark suite,” Technical Report, Institute of Computing Technology, Chinese Academy of Sciences, 2017.
 [21] S. Harbaugh and J. A. Forakis, “Timing studies using a synthetic whetstone benchmark,” ACM Sigada Ada Letters, no. 2, pp. 23–34, 1984.
 [22] R. Jain, The art of computer systems performance analysis. John Wiley & Sons Chichester, 1991, vol. 182.
 [23] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, and A. Borchers, “Indatacenter performance analysis of a tensor processing unit,” pp. 1–12, 2017.

[24]
S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” in
International Symposium on Computer Architecture, 2016, pp. 393–405.  [25] C. Luo, J. Zhan, Z. Jia, L. Wang, G. Lu, L. Zhang, C. Xu, and N. Sun, “Cloudrankd: benchmarking and ranking cloud computing systems for data processing applications,” Frontiers of Computer Science, vol. 6, no. 4, pp. 347–362, 2012.
 [26] J. D. Mccalpin, “Stream: Sustainable memory bandwidth in high performance computers,” 1995.
 [27] M. Nakajima, H. Noda, K. Dosaka, K. Nakata, M. Higashida, O. Yamamoto, K. Mizumoto, H. Kondo, Y. Shimazu, K. Arimoto et al., “A 40gops 250mw massively parallel processor based on matrix architecture,” in SolidState Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE International. IEEE, 2006, pp. 1616–1625.
 [28] U. Pesovic, Z. Jovanovic, S. Randjic, and D. Markovic, “Benchmarking performance and energy efficiency of microprocessors for wireless sensor network applications,” in MIPRO, 2012 Proceedings of the 35th International Convention. IEEE, 2012, pp. 743–747.

[29]
M. M. Tikir, L. Carrington, E. Strohmaier, and A. Snavely, “A genetic algorithms approach to modeling the performance of memorybound computations,” pp. 1–12, 2007.
 [30] K. Ueno and T. Suzumura, “Highly scalable graph search for the graph500 benchmark,” in International Symposium on HighPerformance Parallel and Distributed Computing, 2012, pp. 149–160.
 [31] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench: a big data benchmark suite from internet services,” in HPCA 2014. IEEE, 2014.
 [32] R. Weicker, “Dhrystone: a synthetic systems programming benchmark,” Communications of The ACM, vol. 27, no. 10, pp. 1013–1030, 1984.
 [33] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
 [34] Thomasian, “Analytic Queueing Network Models for Parallel Processing of Task Systems,” IEEE Transactions on Computers, vol. 35, no. 12, pp. 1045–1054, 1986.
 [35] Nicholas P Carter, Aditya Agrawal, Shekhar Borkar, Romain Cledat, Howard David, Dave Dunning, Joshua B Fryman, Ivan Ganev, Roger A Golliver, Rob C Knauerhase, et al. Runnemede: An architecture for ubiquitous highperformance computing. pages 198–209, 2013.
 [36] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for largescale machine learning. operating systems design and implementation, pages 265–283, 2016.
 [37] Jaewon Lee, Changkyu Kim, Kun Lin, Liqun Cheng, Rama Govindaraju, and Jangwoo Kim. Wsmeter: A performance evaluation methodology for google’s production warehousescale computers. In Proceedings of the TwentyThird International Conference on Architectural Support for Programming Languages and Operating Systems, pages 549–563. ACM, 2018.
 [38] Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Rui Ren, Chen Zheng, Gang Lu, Jingwei Li, Zheng Cao, et al. Bigdatabench: A dwarfbased big data and ai benchmark suite. arXiv preprint arXiv:1802.08254, 2018.
 [39] John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. Elsevier, 2012.
Comments
There are no comments yet.