1 Introduction
To perform big data analysis or provide Internet services, more and more organizations are building internal datacenters, or renting hosted datacenters. As a result, DC (datacenter) computing has become a new paradigm of computing. It seems that the proportion of DC has outweighed HPC (High Performance Computing) in terms of market share (HPC only takes 20% of total) [1].
To measure the performance of DC, the wall clock time is used as a ground truth metric. In practice, there are several userperceived metrics—derivatives of the wall clock time that are used to measure the applicationspecific systems, such as transactions per minute for the online transaction system [2], and input data processed per second for the big data analysis system [25]. However, there are two limitations in these userperceived metrics. First, different userperceived metrics cannot be used to perform appletoapple comparison. For example, transactions per minute (TPM) and data per second processing capability (GB/S) cannot be used for appletoapple comparison. Second, userperceived metrics are hard to measure the upper bound performance of computer systems, which is the foundation of the performance model. So, a single computationcentric metric like FLOPS is simple but powerful in the system and architecture community. The performance numbers of the metric can be measured by the microarchitecture of the system, the specific micro benchmark and the realworld workload. By using these different numbers, we can build the upper bound model, which allows us to better understand the performance ceiling of the computer system, and then guide the system codesign. As the most important computationcentric metric, the FLOPS (FLoatingpoint Operations Per Second) [11] metric and its upper bound model–Roofline model have led the computing technology progress, not limited to high performance computing (HPC), for many years. So a natural question arises: what is the metric for DC, and is it still the FLOPS metric?
Different from HPC, DC has many unique characteristics. For typical DC workloads, the average floating point instruction ratio is only 1% and the average FLOPS efficiency is only 0.1%, while the average IPC (Instructions Per Cycle) is 1.1 (the theoretic IPC is 4 for the experimental platform). We also found that the FLOPS gap is 12X between two systems equipped with Intel Xeon or Intel Atom processors, but the average userperceived performance gap of the DC workloads is only 7.4X. These observations imply that FLOPS is inappropriate to measure the DC systems from the perspectives of the performance gaps between different systems or system efficiencies. OPS (operations per second) is another computationcentric metric. OPS [27]
is initially proposed for digital processing systems, which is defined as the 16bit addition operations per second. The definitions of OPS are extended to the artificial intelligence processor, such as Google’s Tensor Processing Unit
[36, 23] and the Cambricon processor [24, 17]. All of them are defined in terms of a specific operation, such as the specific matrix multiplication operations. However, these matrix operations are only a fraction of diverse operations in DC workloads. So, OPS is only suitable for specific accelerators, not generalized DC computing systems.In this paper, inspired by FLOPS [11], Basic OPerations per Second (BOPS for short) is proposed to evaluate DC computing systems. The contributions of the paper are described as follows.
First, we propose BOPs (Basic OPerations), which includes the integer and floating point computations of arithmetic, logical, comparing and array addressing. We define BOPS as the average number of BOPs per second. BOPs can be calculated at the source code level of the application, independent of the underlying system implementation. So it is fair to evaluate different computing systems and facilitate codesign of systems and architectures. We also take three systems equipped with three typical Intel processors as examples to illustrate that BOPS not only truly reflects the performance gaps across different DC systems, but also reflects the system efficiency of the DC system. The bias between the BOPS gap and the average userperceived performance gap is no more than 11%, and the BOPS efficiency of the Sort workload achieves 32%.
Second, by using several typical micro benchmarks, we measure the upper bound performance following different optimization methods, and then we propose a BOPSbased Roofline model, named DCRoofline. DCRoofline not only depicts performance ceilings of the DC workloads on the target systems, but also is helpful to guide the optimization of the DC computing systems. And for Sort—a typical kernel DC workload, the performance improvement is 4.4X.
Third, a realworld DC workload always has million lines of codes and tens of thousands of functions, so it is not easy to use the DCRoofline model directly. We propose a new optimization methodology. We profile the hotspot functions of the realworld workload and extract the corresponding kernel workloads. The realworld application gains performance benefit from merging optimizations methods of kernel workloads, which are under the guidance of DCRoofline. Through experiments, we demonstrate that Redis—a typical realworld workload gains performance improvement by 1.2X.
The remainder of the paper is organized as follows. Section 2 summarizes the related work. Section 3 states background and motivations. Section 4 defines BOPs, and reports how to use it. Section 5 introduces BOPSbased DCRoofline model and its usage. Section 6 presents how to optimize the performance of realworld DC workloads under the DCRoofline model. Section 7 draws a conclusion.
2 Related Work
In this section, the related work of this paper is introduced from two perspectives: metrics, and performance models.
2.1 Metrics
The wall clock time is the most basic performance metric for the computer system [39]
, and almost all the other performance metrics are derived from it. Based on the wall clock time, the performance metrics can be classified into two categories. One is the userperceived metrics, which can be intuitively perceived by the user, such as the TPM (transactions per minute) metric. The other is the computationcentric metrics, which are related to specific computation operations, such as FLOPS (FLoatingpoint Operations Per Second).
Userperceived metrics can be further classified into two categories: one is the metrics for the whole system, and the other is the metrics for components of the system. The examples of the former include data sorted in one minute (MinuteSort), which measures the sorting capability of a system [5], and transactions per minute (TPM) for the online transactions system [2]. The examples of the latter include the SPECspeed/SPECrate for the CPU component [3], the input/output operations per second (IOPS) for the storage component [14], and the data transfer latency for the network component [13].
There are many computationcentric metrics. FLOPS (FLoatingpoint Operations Per Second) is a computationcentric metric to measure the computer system, especially in fields of scientific computing that make heavy use of floatingpoint calculations [11]. The wide recognition of FLOPS indicates the maturation of high performance computing. MIPS (Million Instructions Per Second) [22] is another famous computationcentric metric, which is defined as the million number of instructions the processor can execute in a second. The main limitation of MIPS is that it is architecturedependent. There are many derivatives of the MIPS, including MWIPS and DMIPS [28], which use synthetic workloads to evaluate the floating point operations and integer operations, respectively. The metric of WSMeter [37], which is defined as the quotaweighted sum of MIPS of a job(IPC), is also a derivative of MIPS, and hence it is also architecturedependent. Unfortunately, modern datacenters are heterogeneous, which consist of different generations of hardwares.
OPS (operations per second) is another computationcentric metric. OPS [27] is initially proposed for digital processing systems, which is defined as the 16bit addition operations per second. The definitions of OPS are then extended to Intel Ubiquitous HighPerformance Computing [35] and artificial intelligence processors, such as Google’s Tensor Processing Unit [36, 23] and the Cambricon processor [24, 17]. All of these definitions are in terms of a specific operation. For example, the OPS is 8bit matrix multiplication operations in TPU and 16bit integer operations in the Cambricon processor, respectively. However, the workloads of modern DCs are comprehensive and complex, and the bias to the specific operation cannot ensure the evaluation fairness.
For each kind of metrics, the corresponding tools or benchmarks [39] are proposed to calculate the values. For userperceived metrics—SPECspeed/SPECrate, SPECCPU is the benchmark suite [3] to measure the CPU component. For TPM metrics, TPCC [2] is the benchmark suite to measure the online transactions system. The other example is the Sort benchmark [5] for MinuteSort. For computationcentric metrics, Whetstone [21] and Dhrystone [32] are the measurement tools for MWIPS and DMIPS, respectively. HPL [11] is a widelyused measurement tool for FLOPS. As a microbenchmark, HPL demonstrates a sophisticated design: the proportion of the floatingpoint addition and multiplication in HPL is 1:1, so as to fully utilize the FPU unit of the modern processor.
2.2 Performance model
A performance model can depict and predict the performance of a specific system. There are two categories of performance models: one is the analytical model, which uses stochastic/statistical analytical methods to depict and predict the system performance, and the other is the bound model, which is relatively simpler and only depicts the performance bound or bottleneck of systems. Previous work [34, 29, 9, 37] use stochastic/statistical analytical models to predict the system performance. Actually, the distributed and parallel systems always have lots of uncertain behaviors, so it is hard to build the accurate prediction models for them. Instead, the bound and bottleneck analysis is more suitable. Amdahl’s Law [7] is one of the famous performance bound models for parallel processing computer systems, and it is also used for big data systems [16]. The Roofline model [33] is another famous performance bound model. The original Roofline model [33] adopts FLOPS as the metric. On the basis of the definition of operational intensity (OI)—which is the total number of floating point instructions divided by the total byte number of memory access. The Roofline model can depict the upper bound performance of given workloads, when different optimization strategies are adopted to the target system.
3 Background and Motivations
3.1 Background
The diversity and complexity of modern DC workloads raise great challenges for quantitatively depicting the performance of the computer system spanned multiple domains – algorithm, programming, compiling, system development, and architecture design. And the computationcentric metric and the performance model are two key elements for quantitatively depicting the performance of the system [39] [33].
3.1.1 The Computationcentric Metric
For system and architecture community, the computationcentric metric, such as FLOPS, is a fundamental yardstick to reflect the running performance and gap across different systems or architectures. Generally, a computationcentric metric has performance upper bound on the specific architecture according to the microarchitecture design. For example, the peak FLOPS is computed as follows.
(1) 
The measurement tool is used to measure the performance of systems and architectures in terms of metric values, and report the gap between the real value and the theoretical peak one. For example, HPL [11] is a widelyused measurement tool in terms of FLOPS. The FLOPS efficiency of a specific system is the ratio of the HPL’s FLOPS to the peak FLOPS.
3.1.2 The Upper Bound Performance Model
The computationcentric metric is the foundation of the system performance model. And the performance model of a computer system can depict and predict the workload performance of the specific system. For example, the Roofline model [33] is a famous upper bound model based on FLOPS. There are many system optimization works [19] [15], which are performed on the basis of the Roofline model in the HPC domain.
(2) 
The above equation indicates that the attained workload performance bound of a specific platform is limited by the computing capacity of processor and the bandwidth of memory. and are the peak performance of the platform, and the operation intensity (i.e., ) is the total number of floating point operations divided by the total byte number of memory access. To identify the bottleneck and guide the optimization, the ceilings (for example, the ILP and SIMD optimizations) can be added to provide the performance tuning guidance.
3.2 Requirements of the DC computing metric
We define the requirements from the following three perspectives.
First, the metric should reflect the performance gaps among different DC systems. Userperceived metrics always reflect the running performance. For example, data processed per second (GB/s), which divides the input data size by total running time, is a userperceived metric and effectively reflect the data processing capability. Also, the computationcentric metric should preserve this characteristic and reflect the performance gap.
Second, the metric is the facility for hardware and software codesign of DC systems. For the codesign of different layers of system stacks, i.e., application, software system and hardware system, the metric should support different level measurements, spanning source code, binary code, and hardware instruction level.
Third, the metric can reflect the upper bound performance of a specific system. Focusing on different system design, the metric should be sensitive to design decisions and reflect theoretical performance upper bound. Then, the gap between real and theoretical values is useful to understand the performance bottlenecks and guide the optimizations.
3.3 The characteristics of DC Workloads
In this subsection, the characteristics of the DC workloads and traditional benchmarks are depicted. We choose the DC benchmark suite: DCMIX as the DC workloads. DCMIX is designed for modern datacenter computing systems, which has 17 typical datacenter workloads and includes online service and data analysis workloads. Latencies of DCMIX workloads are ranged from microseconds to minutes. The applications of DCMIX involve big data, artificial intelligence, transaction processing database, and so on. As shown in Table 1, there are two types of benchmarks in the DCMIX, which are MicroBenchmarks (kernel workloads) and Component benchmarks (real DC workloads).
Name  Type  Domain  Category 
Sort  offline analytics  Big Data  MicroBench 
Count  offline analytics  Big Data  MicroBench 
MD5  offline analytics  Big Data  MicroBench 
Multiply  offline analytics  AI  MicroBench 
FFT  offline analytics  AI  MicroBench 
Union  offline analytics  OLTP  MicroBench 
Redis  online service  Big Data  Component 
Xapian  online service  Big Data  Component 
Masstree  online service  Big Data  Component 
Bayes  offline analytics  Big Data  Component 
Imgdnn  online service  AI  Component 
Moses  online service  AI  Component 
Sphinx  online service  AI  Component 
Alexnet  offline analytics  AI  Component 
Silo  online service  OLTP  Component 
Shore  online service  OLTP  Component 
For traditional benchmarks, we choose HPCC, PARSEC, and SPECPU. We have used HPCC 1.4, which is a representative HPC benchmark suite, for the experiment. We run all of the seven benchmarks in HPCC. PARSEC is a benchmark suite composed of multithreaded programs, and deploys PARSEC 3.0. For SPEC CPU2006, we run the official floatingpoint benchmarks (SPECFP) applications with the first reference inputs.
The experimental platform is the same like that in Section IV, and we choose the Intel Xeon E5645, which is a typical brawnycore processor (OoO execution, fourwide instruction issue).
We choose GIPS (GigaInstructions per Second) and GFLOPS (GigaFloating point Operations Per Second) as the performance metrics, which are all derived from the wall clock time. Corresponding to performance metrics, we choose IPC, CPU utilization and memory bandwidth utilization as the efficiency metrics.
As shown in the Fig. 1 (please note that the Y axis in the figure is in logarithmic coordinates), the average GFLOPS of DC workloads is two magnitude orders lower than that of traditional benchmarks, while the GIPS of DC workloads is in the same magnitude order as the traditional benchmarks. Furthermore, the average IPC of DC workloads is 1.1 and that of traditional benchmarks is 1.4, the average CPU utilization of DC workloads is 70% and that of traditional benchmarks is 80%. These metrics imply that DC workloads can utilize the system resource efficiently as the same as the traditional benchmarks. The poor FLOPS efficiency does not lie in the lower execution efficiency. In fact, the floating point operation intensity of DC workloads (0.05 on average) is much lower, which leads to the low FLOPS efficiency.
In order to analyze the execution characteristics of DC workloads, we choose the instruction mixture to perform the further analysis. Fig. 2 shows the retired instructions breakdown, and we have three observations as follows. First, for DC workloads, the ratio of integer to floating point operations is 38, while the ratios for HPCC, Parsec and SPECFP are 0.3, 0.4, and 0.02, respectively. That is the main reason why FLOPS does not work in DC computing. Second, DC workloads have more branch instructions, with the ratio of 19%, while the ratios of HPCC, Parsec and SPECFP are 16%, 11%, and 9%, respectively. Third, the data movement related operations, whose ratio is roughly 73%, include the load, store, address calculation (address calculation instructions are from floating point and integer instructions). Considering branch instructions, the ratio of data movement related operations and branch instructions increases to 92%.
3.4 What should be included in our new metric for the DC computing
The FLOPS of DC workloads is only 0.04 GFLOPS on average (only 0.1% of the peak), which implies that the FLOPS value is far from the DC system efficiency. Furthermore, if the Roofline model is used in terms of FLOPS, the achieved performance is at most up to 0.1% of the peak FLOPS, so we need a new metric for DC computing systems.
The new metric should consider both integer and floatingpoint operations. As floating point and integer instructions are more diverse and all of them are different. For the new metric, our fundamental principle is that we only consider the efficient work defined by the source code, and it should correspond with the characteristics of DC workloads. So we will not consider all of floating point and integer operations. Following the rule to choose a representative minimum subset operations of DC workloads, BOPs should be calculated through analyzing the source code (architecture independent). We analyze the floating point and integer instruction breakdown of microbenchmarks of DCMIX through inserting the analysis code into the source code. We classify all operations into four classes. The first class is array addressing computations (related with data movement operation), such as loading or storing the value of array ; The second class is the arithmetic or logical operations, such as ; The third class is the comparing operation (related with conditional branch), such as ; And the fourth class is all the other operations. We find that the 47% of total floating point and integer instructions belong to address calculation, 22% of those belong to branch calculation, and 30% of those belong to arithmetic or logical operations. Furthermore, the first three classes can reflect efficient works defined by the source code, so we include them in BOPs.
4 BOPs
BOPS (Basic OPerations per Second) is the average number of BOPs (Basic OPerations) for a specific workload completed per second. In this section, we present the definition of BOPs and how to measure BOPs with or without the available source code, and then introduce how to use BOPS.
4.1 BOPs Definition
In our definition, BOPs include the integer and floating point computations of arithmetic, logical, comparing and array addressing. The arithmetic or logical, comparing and array addressing computation operations correspond to calculation, conditional branchrelated, and data movementrelated operations, respectively. The detailed operations of BOPs are shown in Table 2. Each operation in Table 2 is counted as 1 except for Ndimensional array addressing. Note that all operations are normalized to 64bit operation. For arithmetic or logical operations, the number of BOPs is counted as the number of corresponding arithmetic or logical operations. For array addressing operations, we take the onedimensional array as the example. Loading the value of indicates the addition of an offset to the address location of , so the number of BOPs increases by one. And, it can also be applied to the calculation of the multidimensional array. For comparing operations, we transform them to subtraction operations. We take as an example and transform it to , so the number of BOPs increases by one. Through the definition of BOPs, we can see that in comparison with the FLOPS, BOPS concerns not only the floatingpoint operations, but also the integer operations. On the other hand, like FLOPs, BOPs normalize all operations into 64bit operations, and each operation is counted as 1.
Operations  Normalized value 
Add  1 
Subtract  1 
Multiply  1 
Divide  1 
Bitwise operation  1 
Logic operation  1 
Compare operation  1 
Onedimensional array addressing  1 
Ndimensional array addressing  N 
The delays of different operations are not considered in the normalized calculation of BOPs, because the delays can be extremely different in diverse microarchitecture platforms. For example, the delay of the division in Intel Xeon E5645 processor is about 712 cycles, while in Intel Atom D510 processor, the delay can reach up to 38 cycles [6]. Hence, the consideration of delays in the normalization calculations will lead to architecturerelated issue. In our BOPS definition, BOPs normalize all operations into 64bit operations, and each operation is counted as 1.
4.2 How to Measure BOPs
BOPs can be measured at either the source code level or the instruction level.
4.2.1 Sourcecode level measurement
We can calculate BOPs from the source code of a workload, and this method needs some manual work (inserting the counting code). However, it is independent of the underlying system implementation, so it is fair to evaluate and compare different system and architecture implementations. As the following example shows, BOPs is not calculated in lines 1 and 2, because they are variable declarations. Line 3 consists of a loop command and two integer operations, and the number of the corresponding BOPs is (1+1) * 100 = 200 for the integer operations, while the loop command is not calculated; Line 5 consists of the array addressing operations and addition operations: the array addressing operations are counted as 100 * 1, and the addition operations are counted as 100*1, so the sum of BOPs in the example program is: 200 + 200 = 400.
To measure BOPs in the source code level, we need to insert the counting code. For the BOPs count, we will turn on the debug flag, and for the performance test, we will turn off the debug flag. In the example codes, cmp_count is the counter for the comparing operations, adr_count is that for the array addressing operations, ari_count is that for arithmetic operations, and BOPs is the sum of these three counters.
Another thing we need to take into consideration is the system builtin library functions. For the calculation of the systemlevel functions, such as Strcmp() function, we implement userlevel functions manually, and then count the number of BOPs through inserting the counting code. For the microbenchmark workloads, we can insert the counting code easily through analyzing the source code. For the component benchmarks or real applications, first, we profile the execution time of the real DC workload and find out the Top hotspot functions. Second, we analyze these hotspot functions and insert the counting code into these functions. Then, we can count BOPs for the real DC workload (more details are in Section VI).
4.2.2 Instruction level measurement under X86_64 architecture
The sourcecode measurement need to analyze the source code, which costs a lot especially for complex system stacks (i.e., Hadoop system stacks). Instruction level measurement can avoid this high analysis cost and the restriction of needing the source code, but it is architecturedependent. We propose an instructionlevel approach to measure BOPs, which uses the hardware performance counter to obtain BOPs. Since different types of processors have different performance counter events, for convenience, we introduce an approximate but simple instruction level measurement method under X86_64 architecture. That is, we can obtain the total number of instructions (ins), branch instructions (branch_ins), load instructions (load_ins) and store instructions (store_ins) through the hardware performance counters. And BOPs can be calculated according to the following equation.
(3) 
Note that our approximate measurement method includes all of floating point and integer instructions under X86_64 architecture, which does not exactly conform to the BOPS definition. So, it only suits for the BOPSbased optimizations, and does not suit for the performance evaluation among different computer systems.
4.3 How to Measure the system with BOPS
4.3.1 The Peak BOPS of the System
BOPS is the average number of BOPs for a specific workload completed per second. The peak BOPS can be calculated by the microarchitecture with the following equation.
(4) 
For our Intel Xeon E5645 experimental platform, the CPU number is 1, the core number is 6, the frequency of core is 2.4 G, BOPs per cycle is 6 (The E5645 equips two 128bit SSE FPUs and three 128bit SSE ALUs, and according to the execution port design, it can execute three 128bit operations per cycle). So .
4.3.2 The BOPS Measuring Tool
We provide a BOPS measurement tool to calculate the BOPS value. At present we choose Sort in the DCMIX as the first BOPS measurement tool. To deal with the diversity of DC workloads, we will develop a series of representative workloads as the BOPS measurement tools. The scale of the Sort workload is 8E8 records, and BOPs of that is 324E9. Please note that BOPs value will change as the data scale or request number changes.
4.3.3 Measure the System with BOPS
The measurement tool can be used to measure the real performance of the workload on the specific system. Furthermore, the BOPS efficiency can be calculated by the following equation.
(5) 
For example, Sort has 324E9 BOPs. We run Sort on the Xeon E5645 platform and the execution time is 11.5 seconds. . For the Xeon E5645 platform, the peak BOPS is 86.4 GBOPS, the real performance of Sort is 28.2 GBOPS, so the efficiency is 32%.
4.4 Evaluations
4.4.1 Experimental Platforms and workloads
We choose all microbenchmarks of DCMIX as DC workloads, and choose the wall clock time as the userperceived performance. We obtain the userperceived performance metric through collecting the wall clock time of the workload on the specific system. Three systems equipped with three typical Intel processors are chosen as the experimental platforms, which are Intel Xeon E5310, Intel Xeon E5645 and Intel Atom D510. The two former processors are typical brawnycore processors (OoO execution, fourwide instruction issue), while Intel Atom D510 is a typical wimpycore processor (inorder execution, twowide instruction issue). Each experimental platform is equipped with one node. The detailed settings of platforms are shown in Table 3.
CPU Type  CPU Core  
Intel ®Xeon E5645  6 cores@2.4 GHZ  
L1 DCache  L1 ICache  L2 Cache  L3 Cache 
6 32 KB  6 32 KB  6 256 KB  12 MB 
CPU Type  CPU Core  
Intel ®Xeon E5310  4 cores@1.6 GHZ  
L1 DCache  L1 ICache  L2 Cache  L3 Cache 
4 32 KB  4 32 KB  2 4 MB  None 
CPU Type  CPU Core  
Intel ®Atom D510  2 cores@1.6 GHZ  
L1 DCache  L1 ICache  L2 Cache  L3 Cache 
2 24 KB  2 32 KB  2 512 KB  None 
4.4.2 Overview
As shown in Fig. 3, we take all microbenchmarks of DCMIX as workloads, and show their performance on three different systems under DCRoofline model (more details of DCRoofline model are in Section V). From Fig. 3, we can see that all of performance metrics are unified to BOPS, which include the theory peak performance of the system (such as the Peak of E5645, Peak of E5310 and Peak of D510), and the workload performance under the specific system (such as Sort under the E5645 experimental platform, the E5310 experimental platform and the D510 experimental platform). So, we cannot only analyze the performance gaps across different systems, but also analyze the efficiency of the specific system.
4.4.3 The Performance Gaps across Different Experimental Platforms
For the performance gaps between E5310 and E5645, we can see that the BOPS gap is 2.3X (38.4 GBOPS v.s. 86.4 GBOPS), the FLOPS gap is 2.3X (25.6 GFLOPS v.s. 57.6 GFLOPS), and the gap of the average userperceived performance metrics (the wall clock time) is 2.1X. This implies that FLOPS and BOPS metrics can reflect the userperceived performance gaps (the bias is only 10%). But, for the performance gaps between D510 and E5645, we can see that the FLOPS gap is 12X (4.8 GFLOPS v.s. 57.6 GFLOPS), the BOPS gap is 6.7X (12.8 GBOPS v.s. 86.4 GBOPS), and the gap of the average userperceived performance metrics is 7.4X. This implies that FLOPS metric cannot reflect the userperceived performance gaps (12X v.s. 7.4X, the bias is 62%), but BOPS can do that (6.7X v.s. 7.4X, the bias is only 9%). Furthermore, for the performance gaps between D510 and E5310, we can see that the FLOPS gap is 5.3X, the BOPS gap is 3X, and the gap of the average userperceived performance metrics is 3.4X. FLOPS metric cannot reflect the userperceived performance gaps (5.3X v.s. 3.4X, the bias is 56%), but BOPS can do that (3X v.s. 3.4X, the bias is only 11%).
This is because that E5645/E5310 and D510 are totally different microarchitecture platforms, E5645/E5310 are designed for high performance floating point computing (OoO execution, fourwide instruction issue), while D510 is a low power microprocessor for mobile computing (Inorder execution, twowide instruction issue). So, FLOPS cannot reflect the performance gaps of DC workloads across different microarchitecture platforms (Xeon vs. Atom).
4.4.4 The Efficiency of Experimental Platforms
We use the Sort workload as the measure tool to evaluate the efficiency of DC systems. The peak BOPS is obtained by Equation 4. The real BOPs values are obtained by the sourcecode level measurement, and BOPS efficiency is obtained by Equation 5. In our experiments, BOPS efficiencies of E5645, E5310 and D510 are 32%, 20% and 21%, respectively, and their average FLOPS efficiencies are 0.1%, 0.2%, and 0.1%, respectively. So we see that the FLOPS value is far from the real DC system efficiency, while the BOPS value is more reasonable to reflect the efficiency of real DC systems. Furthermore, just like the result shown in Fig. 3, we can use BOPS to build the upper bound performance model for the DC workloads.
4.5 Summary
As an effective metric for DC, BOPS not only truly reflects the performance gaps across different systems, but also reflects the efficiency of DC systems. All of these are foundations of quantitative analysis for DC systems.
5 DCRoofline Model
In this section, we present the DCRoofline model, which depicts the upper bound performance for the given DC workload under the specific system. Then, we introduce the user cases of the DC kernel workload optimization based on the DCRoofline model.
5.1 DCRoofline Definition
The DCRoofline model is inspired by the Roofline model [33]. The definitions of DCRoofline model are described as follows.
Definition 1: The Peak Performance of DC
We choose (can be obtained from Equation 4) as the peak performance metric of DC in the DCRoofline model.
Definition 2: Operation Intensity of DC
Operation intensity (OI) in the DCRoofline is the ratio of BOPS and the memory bandwidth. The memory traffic () is the total number of swap bytes between CPU and the memory. The operation intensity () could be obtained from the following equation.
(6) 
where the memory traffic () is calculated by multiplying the total number of memory access by 64, and the total number of memory access can be obtained by hardware performance counters. Section IVB describes how to calculate BOPs.
Definition 3: The Upper Bound Performance of DC The attained performance bound of a given workload is depicted as follows.
(7) 
where the is the peak memory bandwidth of the system.
5.2 Adding Ceilings for DCRoofline
We add three ceilings –ILP, SIMD, and Prefetching– to specify the performance upper bound for the specific tuning settings. Among them, ILP and SIMD reflect computation limitations and Prefetching reflects memory access limitations. We evaluate our experiments on Intel Xeon E5645 platform.
5.2.1 Prefetching Ceiling
We use the Stream benchmark as the measurement tool for the Prefetching ceiling. We improve the memory bandwidth through opening the prefetching switch option in the system BIOS, and then the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s.
5.2.2 ILP and SIMD Ceilings
We add two calculation ceilings. The first one is SIMD and the second one is ILP. SIMD is the common method for the HPC performance improvement, which performs the same operation on multiple data simultaneously. Modern processors have 128bit wide SIMD instructions at least (i.e., SSE, AVX, etc.). In the next subsection, we will show that SIMD also suits for the DC workload. ILP efficiency can be described by the IPC efficiency (the peak IPC of E5645 is 4). We add the ILP ceiling with IPC no more than 2 (according to our experiments, the IPC of all of workloads is no more than 2), and add the SIMD ceiling with the SIMD upper bound performance. We use the following equation to estimate the ILP and SIMD ceilings:
(8) 
where is the IPC efficiency of the workload, is the scale of SIMD (for E5645, the value is 1 under SIMD, the value is 0.5 under SISD). Therefore, ILP (Instructionlevel parallelism) ceiling is 43.2 GBOPS when the IPC number is 2. Based on the ILP ceiling, the SIMD ceiling is 21.6 GBOPS when not using SIMD.
Definition 4: The Upper Bound Performance of DC Under Ceilings
The attained performance bound of a given workload under ceilings is described as follows.
(9) 
5.3 Visualized DCRoofline Model
The visualized DCRoofline model on the Intel Xeon E5645 platform can be seen in Fig. 4. The diagonal line represents the bandwidth, and the roof line shows the peak BOPS performance. The confluence of the diagonal and roof lines at the ridge point (where the diagonal and horizontal roofs meet) allows us to evaluate the performance of the system. Similar to the original Roofline model, the ceilings – which imply the performance upper bounds for the specific tuning settings – can be added to the DCRoofline model. There are three ceilings (ILP, SIMD, and Prefetching) in the figure.
5.4 DCRoofline Model Usage
The same as the Roofline mode, the DCRoofline model is also suitable for the kernel optimization. We illustrate how to use the DCRoofline model to optimize the performance of the DC kernel workloads on Intel Xeon E5645 platform. For the test platform, the peak BOPS is 86.4 GBOPS according to Equation 4, and the peak memory bandwidth is 13.2 GB/s by using the Stream Benchmark [26]. The kernel workloads are microbenchmarks in the DCMIX.
5.4.1 Optimizations under the DCRoofline Model
[Memory Bandwidth Optimizations], we improve the memory bandwidth through opening the prefetching switch option, and the peak memory bandwidth increases from 13.2GB/s to 13.8GB/s in the E5645 platform. [Compiled Optimizations], compiled optimization is the basic optimization for the calculation. We improve the calculation performance through adding the compiling option with O3 (i.e., gcc O3). [OI Optimizations], for the same workload, higher OI means the better locality. We modify programs of the workload to reduce data movement and increase OI. [SIMD Optimizations], we apply the SIMD technique to the DC workloads, and change programs of the workload from SISD to SIMD through SSE revising.
5.4.2 Optimizations for the Sort Workload
The Sort workload sorts an integer array of a specific scale, and the sorting algorithm is the merge sort. The program of Sort is implemented by C++. Fig. 5 shows the optimization trajectories. In the first step, we perform Memory Bandwidth Optimization, BOPS increases from 6.4 GBOPS to 6.5 GBOPS. In the second step, we perform Compiled Optimization, improving the performance to 6.8 GBOPS. In the original source codes, data are loading and processing in the disk, and the OI of Sort is 1.4. In the third step, we perform OI Optimization, we revise the source code, which loads and processes all data in the memory, and the OI of Sort increases to 2.2. The performance of Sort increases to 9.5 GBOPS. In the forth step, we apply the SIMD technique to the DC workloads, and change Sort from SISD to SIMD through SSE revising. By using the SEE Sort, the attained performance is 28.2 GBOPS, which is 32% of the peak BOPS. Under the guidance of the DCRoofline model, the improvement is achieved by 4.4 times.
5.4.3 The Optimization Summary
Finally, we take all workloads as a whole to show their performance improvements. For the Sort workload, we execute four optimizations (Memory Bandwidth Optimizations, Compiled Optimizations, OI Optimizations and SIMD Optimizations). For other workloads, we execute two basic optimizations (Memory Bandwidth Optimizations and Compiled Optimizations). As shown in Fig. 6, all workloads have achieved performance improvements ranging from 1.1X to 4.4X.
Moreover, we can observe the workload efficiency from the DCRoofline model. As shown in Fig. 6, the workload, which is more closer to the ceiling has higher efficiency. For example, the efficiency of Sort is 65% under the ILP ceiling, and that of MD5 is 66% under the SIMD ceiling. The efficiency equation is calculated as follows.
(10) 
5.5 Roofline Model Vs. DCRoofline Model
Fig. 7 shows the final results by using Roofline (the left Y axis) and DCRoofline (the right Y axis), respectively. Please note that the Y axis in the figure is in logarithmic coordinates. From the figure, we can see that if using the Roofline model in terms of FLOPS, the achieved performance is at most up to 0.1% of the peak FLOPS. For the comparison, the results using the DCRoofline model is up to 32% of the peak BOPS. So, the DCRoofline model is more suited for the upper bound performance model of DC.
5.6 Summary
As the upper bound performance model for DC, DCRoofline not only truly reflects performance ceilings of the target DC system, but also helps to guide the optimization of DC workloads.
6 Optimizing the Real DC Workload Under DCRoofline Model
As the real DC workload always has million lines of codes and tens of thousands of functions, it is not easy to use the DCRoofline model directly (the Roofline or DCRoofline model is designed for the kernel program optimization). In this section, we propose the optimization methodology for the real DC workload, which is based on the DCRoofline model. We take the Redis workload as the example to illustrate the optimization methodology, and the experimental results show that the performance improvement of the Redis is 120% under the guidance of the optimization methodology.
6.1 The Optimization Methodology for the Real DC Workloads
Fig. 8 demonstrates the optimization methodology for real DC workload. First, we profile the execution time of the real DC workload and find out the Top hotspot functions. Second, we analyze these hotspot functions (merging functions with the same properties) and build the Kernels ( is less than or equal to ). As the independent workload, the Kernel’s codes are based on the source code of the real workload and implement a part of functions (specific hotspot functions) of the real workload. Third, we optimize these Kernels through the DCRoofline model, respectively. Forth, we merge optimization methods of Kernels and optimize the DC workload.
6.2 The Optimization for the Redis Workload
Redis is a distributed, inmemory keyvalue database with durability. It supports different kinds of abstract data structures and is widely used in modern internet services. The Redis V4.0.2 has about 200,000 lines of codes and thousands of functions.
6.2.1 The Experimental Methodology
Redis, whose version is 4.0.2, is deployed with standalone model. We choose RedisBenchmark as the workload generator. For the RedisBenchmark settings, the total request number is 10 millions and 1000 parallel clients are created to simulate the concurrent environment. The query operation of each client is SET operations. We choose the queries per second (QPS) as the userperceived performance metrics. And the platform is Intel Xeon E5645, which is the same as in Section V.
6.2.2 The hotspot Functions of Redis
There are 20 functions which occupy 69% execution time. These functions can be classified into three categories. The first category is the dictionary table management, such as dictFind(), dictSdsKeyCompare(), lookupKey(), and siphash(). The second category is the memory management, such as malloc_usable_size(), malloc(), free(), zmalloc(), and zfree(). The last category is the encapsulation of system functions. The first two categories take 55% of total execution time.
6.2.3 The Kernels of Redis
Based on the hotspot functions analysis, we build two Kernels, one is for the memory management, which is called MMK. The other is for the dictionary table management, which is called DTM. Each Kernel is constructed by the corresponding hotspot functions, and can run as an independent workload. Please note that the Kernel and the Redis workload share the same client queries.
6.2.4 The Optimizations of DTM
According to the optimization methods of the DCRoofline Model (proposed on the Section V), we execute related optimizations. As the prefetching switch option has been opened in the Section V, we execute the following three optimizations: Compiled Optimization, OI Optimization and SIMD Optimization. We perform the Compiled Optimization through adding the compiling option with O3 (gcc O3). The optimizations can be shown in Table 4, we improve the OI of DTM from 1.5 to 3.5, and BOPS from 0.4 G to 3 G; In the DTM, the rapidly increasing of keyvalue pairs would trigger the reallocate operation of the dictionary table space, and this operation will bring lots of data movement cost, we propose a method to avoid this operation through preallocating a large table space. We call this optimization as NO_REHASH. Using NO_REHASH, we improve the OI of DTM from 3.5 to 4, and BOPS from 3 G to 3.1 G; The hash operations are main operations in the DTM, we replace the default SipHash algorithm with the HighwayHash algorithm, which is implemented by the SIMD instruction program. We call this optimization as SIMD_HASH. Using SIMD_HASH, we improve the OI of DTM from 4 to 4.7, and BOPS from 3.1 G to 3.7 G.
Type  OI  GBOPS 
Original Version  1.5  0.4 
Compiled Optimization  3.5  3 
OI Optimization  4  3.2 
SIMD Optimization  4.7  3.7 
6.2.5 The Optimizations of MMK
According to the optimization methods of the DCRoofline Model, we do two optimizations: Compiled Optimization and OI Optimization. We do the Compiled Optimization through adding the compiling option with O3 (gcc O3). The optimization can be shown in Table 4, we improve the OI of MMK from 3.1 to 3.2, and BOPS from 2.2 G to 2.4 G; To reduce data movement cost, We replace the default malloc algorithm with the Jemalloc algorithm in the MMK. Jemalloc is a general purpose malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support. We call this optimization as JE_MALLOC. Using JE_MALLOC, we improve the OI of MMK from 3.2 to 90, and BOPS from 2.4 G to 2.7 G.
Type  OI  GBOPS 
Original Version  3.1  2.2 
Compiled Optimization  3.2  2.4 
OI Optimization  90  2.7 
6.2.6 The Optimizations of Redis
Merging the above optimizations of DTM and MMK, we do the optimizations for the Redis workload. Fig. 9 shows the optimization trajectories, we execute the Compiled Optimization (adding the compiling option with O3), OI Optimization (NO_REHASH and JE_MALLOC), and SIMD Optimization (SIMD_HASH) one by one. Please note that the peak performance of the system is 14.4 GBOPS in the Fig. 9. This is because that the Redis is a singlethreaded serve and we deploy it on the single specific CPU core. As shown in Fig. 9, the OI of the Redis workload is improved from 2.9 to 3.8, and BOPS from 2.8 G to 3.4 G. And, the QPS of the Redis is improved from 122,000 requests/s to 146,000 requests/s.
7 Conclusion
This paper proposes a new computationcentric metric–BOPS that measures the DC computing system efficiency. The metric is independent of the underlying systems and hardware implementations, and can be calculated through automatically analyzing the source code.
With several typical micro benchmarks, we attain the upper bound performance, and then we propose a BOPSbased Roofline model, named DCRoofline, as a ceiling performance model to guide DC computing system design and optimization.
As a realworld DC workload always has million lines of codes and tens of thousands of functions, so it is not easy to use the DCRoofline model directly. We propose a new optimization methodology as follows: we profile the hotspot functions of the realworld workload and extract the corresponding kernel workloads. The realworld application gains performance benefit from merging optimizations methods of the kernel workloads. Through experiments, we demonstrate that Redis—a typical realworld workload gains performance improvement by 1.2X.
References
 [1] Date center growth. https://www.enterprisetech.com.
 [2] http://www.tpc.org/tpcc.
 [3] http://www.spec.org/cpu.
 [4] http://top500.org/.
 [5] “Sort benchmark home page,” http://sortbenchmark.org/.
 [6] “Technical report,” http://www.agner.org/optimize/microarchitecture.pdf.
 [7] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” pp. 483–485, 1967.
 [8] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: characterization and architectural implications,” pp. 72–81, 2008.
 [9] E. L. Boyd, W. Azeem, H. S. Lee, T. Shih, S. Hung, and E. S. Davidson, “A hierarchical approach to modeling and improving the performance of scientific applications on the ksr1,” vol. 3, pp. 188–192, 1994.
 [10] S. P. E. Corporation, “Specweb2005: Spec benchmark for evaluating the performance of world wide web servers,” http://www.spec.org/web2005/, 2005.
 [11] J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,” Concurrency and Computation: Practice and Experience, vol. 15, no. 9, pp. 803–820, 2003.
 [12] W. Gao, L. Wang, J. Zhan, C. Luo, D. Zheng, Z. Jia, B. Xie, C. Zheng, Q. Yang, and H. Wang, “A dwarfbased scalable big data benchmarking methodology,” arXiv preprint arXiv:1711.03229, 2017.
 [13] Neal Cardwell, Stefan Savage, and Thomas E Anderson. Modeling tcp latency. 3:1742–1751, 2000.
 [14] William Josephson, Lars Ailo Bongo, Kai Li, and David Flynn. Dfs: A file system for virtualized flash storage. ACM Transactions on Storage, 6(3):14, 2010.
 [15] Samuel Williams, Dhiraj D Kalamkar, Amik Singh, Anand M Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann S Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. Optimization of geometric multigrid for emerging multi and manycore processors. page 96, 2012.
 [16] Daniel Richins, Tahrina Ahmed, Russell Clapp, and Vijay Janapa Reddi. Amdahl’s law in big data analytics: Alive and kicking in tpcxbb (bigbench). In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 630–642. IEEE, 2018.

[17]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,
Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.
Dadiannao: A machinelearning supercomputer.
international symposium on microarchitecture, pages 609–622, 2014.  [18] Ethan R Mollick. Establishing moore’s law. IEEE Annals of the History of Computing, 28(3):62–75, 2006.
 [19] Shoaib Kamil, Cy P Chan, Leonid Oliker, John Shalf, and Samuel Williams. An autotuning framework for parallel multicore stencil computations. pages 1–12, 2010.
 [20] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, R. Ren, C. Zheng, G. Lu, J. Li, Z. Cao, Z. Shujie, and H. Tang, “Bigdatabench: A dwarfbased big data and artificial intelligence benchmark suite,” Technical Report, Institute of Computing Technology, Chinese Academy of Sciences, 2017.
 [21] S. Harbaugh and J. A. Forakis, “Timing studies using a synthetic whetstone benchmark,” ACM Sigada Ada Letters, no. 2, pp. 23–34, 1984.
 [22] R. Jain, The art of computer systems performance analysis. John Wiley & Sons Chichester, 1991, vol. 182.
 [23] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, and A. Borchers, “Indatacenter performance analysis of a tensor processing unit,” pp. 1–12, 2017.

[24]
S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” in
International Symposium on Computer Architecture, 2016, pp. 393–405.  [25] C. Luo, J. Zhan, Z. Jia, L. Wang, G. Lu, L. Zhang, C. Xu, and N. Sun, “Cloudrankd: benchmarking and ranking cloud computing systems for data processing applications,” Frontiers of Computer Science, vol. 6, no. 4, pp. 347–362, 2012.
 [26] J. D. Mccalpin, “Stream: Sustainable memory bandwidth in high performance computers,” 1995.
 [27] M. Nakajima, H. Noda, K. Dosaka, K. Nakata, M. Higashida, O. Yamamoto, K. Mizumoto, H. Kondo, Y. Shimazu, K. Arimoto et al., “A 40gops 250mw massively parallel processor based on matrix architecture,” in SolidState Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE International. IEEE, 2006, pp. 1616–1625.
 [28] U. Pesovic, Z. Jovanovic, S. Randjic, and D. Markovic, “Benchmarking performance and energy efficiency of microprocessors for wireless sensor network applications,” in MIPRO, 2012 Proceedings of the 35th International Convention. IEEE, 2012, pp. 743–747.

[29]
M. M. Tikir, L. Carrington, E. Strohmaier, and A. Snavely, “A genetic algorithms approach to modeling the performance of memorybound computations,” pp. 1–12, 2007.
 [30] K. Ueno and T. Suzumura, “Highly scalable graph search for the graph500 benchmark,” in International Symposium on HighPerformance Parallel and Distributed Computing, 2012, pp. 149–160.
 [31] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “BigDataBench: a big data benchmark suite from internet services,” in HPCA 2014. IEEE, 2014.
 [32] R. Weicker, “Dhrystone: a synthetic systems programming benchmark,” Communications of The ACM, vol. 27, no. 10, pp. 1013–1030, 1984.
 [33] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
 [34] Thomasian, “Analytic Queueing Network Models for Parallel Processing of Task Systems,” IEEE Transactions on Computers, vol. 35, no. 12, pp. 1045–1054, 1986.
 [35] Nicholas P Carter, Aditya Agrawal, Shekhar Borkar, Romain Cledat, Howard David, Dave Dunning, Joshua B Fryman, Ivan Ganev, Roger A Golliver, Rob C Knauerhase, et al. Runnemede: An architecture for ubiquitous highperformance computing. pages 198–209, 2013.
 [36] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for largescale machine learning. operating systems design and implementation, pages 265–283, 2016.
 [37] Jaewon Lee, Changkyu Kim, Kun Lin, Liqun Cheng, Rama Govindaraju, and Jangwoo Kim. Wsmeter: A performance evaluation methodology for google’s production warehousescale computers. In Proceedings of the TwentyThird International Conference on Architectural Support for Programming Languages and Operating Systems, pages 549–563. ACM, 2018.
 [38] Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Rui Ren, Chen Zheng, Gang Lu, Jingwei Li, Zheng Cao, et al. Bigdatabench: A dwarfbased big data and ai benchmark suite. arXiv preprint arXiv:1802.08254, 2018.
 [39] John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. Elsevier, 2012.
Comments
There are no comments yet.