Big Data Dwarfs: Towards Fully Understanding Big Data Analytics Workloads

02/01/2018 ∙ by Wanling Gao, et al. ∙ 0

Though the big data benchmark suites like BigDataBench and CloudSuite have been used in architecture and system researches, we have not yet answered the fundamental issue-- what are abstractions of frequently-appearing units of computation in big data analytics, which we call big data dwarfs. For the first time, we identify eight big data dwarfs, each of which captures the common requirements of each class of unit of computation while being reasonably divorced from individual implementations among a wide variety of big data analytics workloads. We implement the eight dwarfs on different software stacks as the dwarf components. We present the application of the big data dwarfs to construct big data proxy benchmarks using the directed acyclic graph (DAG)-like combinations of the dwarf components with different weights to mimic the benchmarks in BigDataBench. Our proxy benchmarks shorten the execution time by 100s times on the real systems while they are qualified for both earlier architecture design and later system evaluation across different architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The complexity and diversity of big data analytics workloads make understanding them difficult and challenging. First, modern big data workloads expand and change very fast, and it is impossible to create a new benchmark or proxy for every possible workload. Second, whatever early in the architecture design process or later in the system evaluation, it is time-consuming to run a comprehensive benchmark suite. The complex software stacks of the modern workloads aggravate this issue. The big data benchmark suites like BigDataBench [1] or CloudSuite [2] are too huge to run on simulators and hence challenge time-constrained simulation and even make it impossible. Third, too complex workloads are not helpful for both reproducibility and interpretability of performance data.

Identifying abstractions of frequently-appearing units of computation, which we call big data dwarfs, is an important step toward fully understanding big data analytics workloads. Much previous work [3, 4, 5, 6, 7] has illustrated the importance of abstracting workloads in corresponding domains. TPC-C [4] is a successful benchmark built on the basis of frequently-appearing operations in the OLTP domain. HPCC [8] adopts a similar method to design a benchmark suite for high performance computing. Unfortunately, to the best of our knowledge, none of existing work has identified dwarfs in big data analytics. National Research Council proposed seven major tasks in massive data analysis [9], while they are macroscopical definition of problems from the perspective of mathematics.

In this paper, after thoroughly analyzing a majority of workloads in five typical big data application domains (search engine, social network, e- commerce, multimedia and bioinformatics), we identify eight big data dwarfs, including matrix, sampling, logic, transform, set, graph, sort and basic statistic computations that frequently appear 111We acknowledge our eight dwarfs may be not enough for other applications we fail to investigate. , the combinations of which describe most of big data workloads we investigated. We implement eight dwarfs on different software stacks with diverse data generation tools for text, matrix and graph data, which we call the dwarf components.

Just like relation algebra in database, big data dwarfs are promising fundamental concepts and tools for benchmarking, designing, measuring, and optimizing big data systems. In this paper, we focus on the application of the big data dwarfs to build big data proxy benchmarks that shorten the execution time while being qualified for both architecture and system evaluation.

We employ a DAG-like method to construct our proxy benchmarks, where each node represents original data set or an intermediate data set being processed and edges represent the dwarf components. Combination of one or more dwarf components with different weights forms our proxy benchmarks. An auto-tuning tool is developed to generate qualified proxy benchmarks satisfying the execution time and micro-architectural and system data accuracy requirements, through training a model using neural network. Though previous work discusses about using dwarfs to represent computation patterns of real workloads, none of them actually pull it off and build dwarf-based proxy benchmarks 

[5, 6].

On the typical X86_64 and ARMv8 processors, the proxy benchmarks shorten the execution time by 100s times with the average micro-architectural and system data accuracy above 90% with respect to the benchmarks from BigDataBench. Our proxy benchmarks have been applied to the ARM processor design and implementation in our industry partnership. Different from the previous benchmarking methodologies that create a new benchmark or proxy for every possible workload [10, 11, 12, 13, 14, 15, 16], our methodology is scalable. In addition, our proxy benchmarks are qualified for both architecture and system evaluation with higher data accuracy, while the data accuracy of the kernel benchmarks is lower for complex big data workloads [17, 18], and the synthetic trace like SimPoint [19] and the synthetic benchmarks like PerfProx [20] can only be used for architecture research.

Our contributions are three-fold as follows:

  • We identify eight big data dwarfs, and implement the dwarf components for each dwarf on different software stacks with diverse data inputs.

  • We propose a dwarf-based scalable big data benchmarking methodology, using a DAG-like combination of the dwarf components with different weights to mimic big data analytics workloads.

  • We construct the big data proxy benchmarks that shorten the execution time by 100s time while being qualified for both architecture and system evaluation.

The rest of the paper is organized as follows. Section 2 presents the big data dwarfs identified from big data analytics workloads. Section 3 introduces scalable dwarf-based benchmarking methodology. Section 4 performs evaluations on a five-node X86_64 cluster. In Section 5, we report using the proxy benchmarks on the ARMv8 processor. Section 6 introduces the related work. Finally, we draw a conclusion in Section 7.

2 Identifying Big Data Dwarfs

In this section, we illustrate how to identify big data dwarfs from big data analytics workloads, and show their corresponding dwarf components. We also take SIFT for an example to demonstrate our methdology.

Catergory Application Domain Workload Unit of Computation
Graph Mining Search Engine Community Detection PageRank Matrix, Graph, Sort
BFS, Connected component(CC) Graph
Deminsion Reduction Image Processing Text Processing Principal components analysis(PCA) Matrix
Latent dirichlet allocation(LDA) Basic Statistic, Sampling
Deep Learning Image Recognition Speech Recognition Convolutional neural network(CNN) Matrix, Sampling, Transform
Deep belief network(DBN) Matrix, Sampling
Recommendation Association Rules Mining Electronic Commerce Aporiori Basic Statistic, Set
FP-Growth Graph, Set, Basic Statistic
Collaborative filtering(CF) Graph, Matrix
Classification Image Recognition Speech Recognition Text Recognition Support vector machine(SVM) Matrix

K-nearest neighbors(KNN)

Matrix, Sort, Basic Staticstic
Naive bayes Basic Statistic
Random forest Graph, Basic Statistic
Decision tree(C4.5/CART/ID3) Graph, Basic Statistic
Clustering Data Mining K-means Matrix, Sort
Feature Preprocess Image Processing Signal Processing Text Processing Image segmentation(GrabCut) Matrix, Graph
Scale-invariant feature transform(SIFT) Matrix, Transform, Sampling, Sort, Basic Statistic
Image Transform Matrix, Transform
Term Frequency-inverse document frequency (TF-IDF) Basic Statistic
Sequence Tagging Bioinformatics Language Processing Hidden Markov Model(HMM) Matrix
Conditional random fields(CRF) Matrix, Sampling
Indexing Search Engine Inverted index, Forward index Basic Statistic, Logic, Set, Sort
Encoding/Decoding Multimedia Processing Security Cryptography Digital Signature MPEG-2 Matrix, Transform
Encryption Matrix, Logic
SimHash, MinHash Set, Logic
Locality-sensitive hashing(LSH) Set, Logic
Data Warehouse Business intelligence Project, Filter, OrderBy, Union Set, Sort
Table 1: Eight Classes of Units of Computation.

2.1 Big Data Dwarf Abstraction

Figure 1: Identifying Big Data Dwarfs.

Fig. 1

overviews the methodology of big data dwarf identification. We first single out a broad spectrum of big data analytics workloads through investigating typical components in five application domains (search engine, social network, e-commerce, multimedia, and bioinformatics) and representative algorithms in four processing techniques (machine learning, data mining, computer vision and natural language processing). Then we analyze these workloads from algorithmic and experimental levels. From algorithmic level, we decompose the algorithm to multiple operations and their DAG-like combination according to algorithm flow. From experimental level, we adopt a multi-dimensional tracing and profiling, including runtime tracing (i.e. JVM tracing and logging), system profiling (CPU time breakdown) and hardware profiling (CPU cycle breakdown), to find hotspot operations of workloads.

According to their frequency and importance, we finalize eight big data dwarfs, which are abstractions of frequently-appearing classes of units of computation. Table 1 shows the importance of eight classes of units of computation (dwarfs) in a majority of big data analytics workloads. We can find that these eight dwarfs are major classes of units of computations in a variety of big data analytics workloads.

2.1.1 Eight Big Data Dwarfs

In this subsection, we summarize eight big data dwarfs frequently appearing in big data analytics workloads.

Matrix Computations In big data analytics, many problems involve matrix computations, such as matrix multiplication and matrix transposition.

Sampling Computations Sampling plays an essential role in big data processing, which can obtain an approximate solution when one problem cannot be solved by using analytical method.

Logic Computations We name computations performing bit manipulation as logic computations, such as hash, data compression and encryption.

Transform Computations

The transform computations here mean the conversion from the original domain (such as time) to another domain (such as frequency). Common transform computations include discrete fourier transform (DFT), discrete cosine transform (DCT) and wavelet transform.

Set Computations In mathematics, set means a collection of distinct objects. Likewise, the concept of set is also widely used in computer science. For example, similarity analysis of two data sets involves set computations, such as Jaccard similarity. Furthermore, fuzzy set and rough set play very important roles in computer science.

Graph Computations A lot of applications involve graphs, with nodes representing entities and edges representing dependencies. Graph computations are notorious for having irregular memory access patterns.

Sort Computations Sort is widely used in many areas. Jim Gray thought sort is the core of modern databases [6], which shows its fundamentality.

Basic Statistic Computations

Basic statistic computations are used to obtain the summary information through statistical computations, such as counting and probability statistics.

2.2 Dwarf Components

Fig. 2 presents the overview of our dwarf components, which consist of two parts– data generation tools and dwarf implementations. The data generation tools provide various data inputs with different data types and distributions to the dwarf components, covering text, graph and matrix data. Since software stack has great influences on workload behaviors [1, 21], our dwarf component implementation considers the execution model of software stacks and the programming styles of workloads using specific software stacks. Fig. 2 lists all dwarf components. For example, we provide distance calculation (i.e. euclidian, cosine) and matrix multiplication for matrix computations. We implement dwarf components with the POSIX threads model, which consider the processes of the Hadoop framework, including input data partition, chunk data allocation per thread, intermediate data output to disk, and data combination. As JVM garbage collection (GC) is an important step for automatic memory management, for each dwarf component, we implement a unified memory management module, whose mechanism is similar with GC. For the purpose of system evaluation, we also implement the dwarf components on several other software stacks including MPI [22], Hadoop [23] and Spark [24].

Figure 2: The Overview of the Dwarf Components.

2.3 Understanding Big Data Analytics Workload using Dwarfs

We understand big data analytics workloads using a DAG-like structure of combining one or more dwarfs. Taking SIFT workload as an example, we explain how to use eight dwarfs to compose the original workload. Proposed by D. G. Lowe [25], SIFT is used to detect and describe local features of input images. As illustrated in Fig. 3, a DAG-like structure specifies how data set or intermediate data set are operated by different dwarfs. In total, SIFT workload involves five dwarfs.

Figure 3: The DAG-like Structure of SIFT Workload. SIFT as a representative workload in computer vision, is decomposed into several dwarfs: transform computations(FFT, IFFT), sampling computations(downsampling), matrix computations(matrix multiplication/subtraction), sort computations(sort), statistic computations(count).

3 Dwarf-based Scalable Big Data Benchmarking Methodology

Figure 4: Methodology Overview.

Big data dwarfs are promising fundamental tools for benchmarking, designing, measuring, and optimizing big data systems. In this section, we present the application of the big data dwarfs to construct big data proxy benchmarks.

3.1 Dwarf-based Benchmarking Methodology

Fig. 4 presents our benchmarking methodology. First, based on big data dwarfs and the dwarf components, we can understand the bahaviors of big data analytics workloads from architecture and system level. Second, we construct proxy benchmarks using the DAG-like combinations of the dwarf components with different weights to mimic the real-world workloads. A DAG-like structure uses a node to represent original data set or intermediate data set being processed, and uses a edge to represent the dwarf components.

Given a big data analytics workload, we get the running trace and execution time through the tracing and profiling tools. According to the execution ratios, we identify the hotspot functions and correlate them to the code fragments of the workload through bottom-up analysis. Then we analyze these code fragments to choose the specific dwarf components and set their initial weights according to execution ratios. Based on the dwarf components and initial weights, we construct our proxy benchmarks using a DAG-like combination of dwarf components. For the targeted performance data accuracy, such as cache behaviors or I/O behaviors, we provide an auto-tuning tool to tune the parameters of the proxy benchmark, and generate qualified proxy benchmark to satisfy the requirements. The qualified proxy benchmark is to mimic the behaviors of original big data workloads, including both system behaviors and micro-architectural behaviors.

3.2 Proxy Benchmarks Construction

Fig. 5 presents the process of proxy benchmark construction, including decomposing process and tuning process. We first break down the big data benchmark into a group of dwarfs and then tune them to approximate the original big data benchmark. We measure the proxy benchmark’s accuracy by comparing the performance data of the proxy benchmark with those of the original workloads in both system and micro-architecture level. To tune the accuracy—making it more similar to the original workload, we further provide an auto-tuning tool.

3.2.1 Benchmark Decomposing

Given a big data analytics workload, we obtain its hotspot functions and execution time through a multi-dimensional tracing and profiling method, including runtime tracing (e.g. JVM tracing and logging), system profiling (e.g. CPU time breakdown) and hardware profiling (e.g. CPU cycle breakdown). Based on the hotspot analysis, we correlate the hotspot functions to the code fragments of the workload and choose the corresponding dwarf components by analyzing the computation logic of the code fragments. Our proxy benchmark is a DAG-like combination of the selected dwarf components with initial weights setting by their execution ratios.

3.2.2 Feature Selecting

Figure 5: Proxy Benchmarks Construction.

System and Micro-architectural Metrics The main purpose of constructing proxy benchmarks is to mimic the system and micro-architectural behaviors of real workloads using dwarf combinations. According to the different concerns about the workloads, we can choose different metrics to tune a qualified proxy benchmark. For example, if our proxy benchmarks focus on cache behaviors of the workload, we can choose the metrics that reflect cache behaviors like cache hit rate to tune a qualified proxy benchmark. Here we use to denote the the performance data of selected metrics. where:

For system-level metrics, we choose running time, memory bandwidth, and disk I/O behavior. For micro-architectural metrics, we choose instruction mix, cache behavior, branch prediction, and processor performance (i.e. IPC, MIPS).

Parameters of Dwarf Components Every dwarf component accept several parameters for running, which is configurable. Here we use , which consists of four parameters for each dwarf component (listed in table 2).

To tune a qualified proxy benchmark, we need to obtain an optimal parameter configurations for each dwarf component. For the rationality of the weights of each dwarf component, we set the initial weights proportional to their corresponding execution ratios. For example, in Hadoop TeraSort, the weight is 70% of sort computation, 10% of sampling computation, and 20% of graph computation, respectively. During the modelling process, the weight of each dwarf component can be adjusted within a reasonable range (e.g. plus or minus 10%).

Parameter Description
dataSize The input data size for each dwarf component
chunkSize The data block size processed by each thread for each dwarf component
numTasks The process and thread numbers for each dwarf component
weight The contribution for each dwarf component
Table 2: Tunable Parameters for Each Dwarf Component.

3.2.3 Modelling and Predicting

We use neural network to model the relationship of and .

where

refers to the neural network, which uses linear regression as output layer. To obtain the training data, we first generate a set of

by changing the parameters one by one with a fixed stride in an acceptable range. For example, we could change one of the parameters ( increase the

dataSize by 50MB, increase the chunkSize by 10MB, increase the numTasks by 1 or increase/decrease weight by 1) while keeps other parameters unchanged to generate multiple . Then we run the proxy benchmark using these and collect the corresponding . Finally, we use these pairs to train the neural network by standard back propagation.

To generate a proxy benchmark for a real workload, we firstly run it on a physical machine to collect the system and architecture metrics . Note that the runtime in could be specified to fulfill different requirements of running time. Then, we use the model: to predict the corresponding , which is the optimal configuration that can reproduce system and architectural behaviours of the real workload.

Note that the model is specific to the workload that we aim to mimic, which means needs to be rebuilt for each new workload. To get sufficient training data for neural network modelling, we need to run the proxy benchmark multiple times to collect . However, the whole modelling process is quite straightforward and fast, due to the lightweight design of dwarf benchmarks (about ten seconds for one run).

3.3 Proxy Benchmark Implementation

Big Data
Benchmark
Workload
Patterns
Data Set
Involved Dwarfs
Involved Dwarf Components
Hadoop
TeraSort
I/O
Intensive
Text
Sort computations
Sampling computations
Graph computations
Quick sort; Merge sort
Random sampling; Interval sampling
Graph construction; Graph traversal
Hadoop
Kmeans
CPU
Intensive
Vectors
Matrix computations
Sort computations
Basic Statistic
Vector euclidean distance; Cosine distance
Quick sort; Merge sort
Cluster count; Average computation
Hadoop
PageRank
Hybrid
Graph
Matrix computations
Sort computations
Basic Statistic
Matrix construction; Matrix multiplication
Quick sort; Min/max calculation
Out degree and in degree count of nodes
Hadoop
SIFT
CPU and
Memory
Intensive
Image
Matrix computations
Sort computations
Sampling computations
Transform computations
Basic Statistic
Matrix construction; Matrix multiplication
Quick sort; Min/max calculation
Interval sampling
FFT/IFFT Transformation
Count Statistics
Table 3: Four Hadoop Benchmarks from BigDataBench and Their Corresponding Proxy Benchmarks.

Considering the mandatory requirements of simulation time and performance data accuracy for architecture community, we implement four proxy benchmarks with respect to four representative Hadoop benchmarks from BigDataBench [1] – TeraSort, Kmeans, PageRank, and SIFT, according to our benchmarking methodology. At the requests of our industry partners, we implemented the four proxy benchmarks in advance because of the following reasons. Other proxy benchmarks are being built using the same methodology.

Representative Application Domains

They are all widely used in many important application domains. For example, TeraSort is a widely-used workload in many application domains; PageRank is a famous workload for search engine; Kmeans is a simple but useful workload used in internet services; SIFT is a fundamental workload for image feature extraction.

Various Workload Patterns They have different workload patterns. Hadoop TeraSort is an I/O-intensive workload; Hadoop Kmeans is a CPU-intensive workload; Hadoop PageRank is a hybrid workload which falls between CPU-intensive and I/O-intensive; Hadoop SIFT is a CPU-intensive and memory-intensive workload.

Diverse Data Inputs They take different data inputs. Hadoop TeraSort uses text data generated by gensort [26]

; Hadoop Kmeans uses vector data while Hadoop PageRank uses graph data; Hadoop SIFT uses image data from ImageNet 

[27]. These benchmarks are of great significance for measuring big data systems and architectures [21].

In the rest of this paper, we use Proxy TeraSort, Proxy Kmeans, Proxy PageRank and Proxy SIFT to represent the proxy benchmark for Hadoop TeraSort, Hadoop Kmeans, Hadoop PageRank and Hadoop SIFT from BigDataBench, respectively. The input data to each proxy benchmark has the same data type and distribution with respect to those of the Hadoop benchmarks so as to preserve the data impact on workload behaviors. We implement dwarf components with the POSIX threads model, which consider the processes of the Hadoop framework, including input data partition, chunk data allocation per thread, intermediate data output to disk, and data combination. As JVM garbage collection (GC) is an important step for automatic memory management, for each dwarf component, we implement a unified memory management module, whose mechanism is similar with GC.

Finally, we use the auto-tuning tool to generate the proxy benchmarks. The process of constructing these four proxy benchmarks shows that our neural network model is effective. Table. 3 lists the benchmark details from the perspectives of data set, involved dwarfs, involved dwarf components.

3.4 Discussion

Methodology
Typical
Benchmark
Input
Data
Different
Microarchitecture
Multi-core
Scalability
System
Evaluation
Accuracy
Kernel
Benchmark
NPB [17] Fixed Recompile Yes Yes Low
Synthetic
Trace
SimPoint [19] Fixed Regenerate No No High
Synthetic
Benchmark
PerfProx [20] Fixed Regenerate No No High
Dwarf-Based
Proxy
Benchmark
Dwarf
Benchmark
On-demand Auto-tuning Yes Yes High
Table 4: Comparison of Different Simulation Methodologies for Big Data Workloads.

Table. 4 compares the four methodologies from the perspectives of input data, different micro-architecture, multi-core scalability support, system evaluation support and accuracy.

Kernel benchmarks, which consist of a set of kernels extracted from origninal application [18, 28], are widely used in high performance computing. However, they are insufficient to completely reflect workload behaviors and has limited usefulness in making overall comparisons  [17, 18], especially for complex big data workloads.

Synthetic trace is to generate an instruction stream through real trace or statistical profile, such as SimPoint [19]. However, the real trace or statistical profile are obtained with the aid of a functional simulator or a binary instrumentation tool (e.g. Pin [29]), which is time-consuming and costly. Complex big data software stacks and distributed deployments further aggravate this challenge. So generating synthetic trace is infeasible and time-consuming, especially for multiple architecture configurations or workload configurations [16]. For example, previous work uses Pin and SimPoint to generate synthetic traces. However, Pin lacks supports for diverse architectures (e.g. ARM architecture) and Java environment [20]. So it is difficult to use Pin in big data systems like Hadoop. Another method is to use functional simulator (e.g. GEM5 [30] SE mode) and SimPoint to obtain traces. GEM5 has limited supports for distributed deployment and also takes a long time.

Synthetic benchmark is to generate assembly code or C code according to workload profiling [31], which can be executed on real hardware as well as execution-driven simulators. However, existing synthetic benchmarks mimic micro-architectural metrics merely and generate synthetic codes without computation logic and multi–thread model. So they have limitations to reflect system-level behaviors such as multi-core scalability and user observed performance speedups. Also, it needs to regenerate synthetic benchmarks for multiple architecture configurations or workload configurations.

For dwarf-based benchmarking methodology, we use multi-thread programs which preserve similar computation logic to mimic the behaviors of big data workloads. Our proxy benchmarks can suit for different data input and support cross-architecture comparison with recompilation. As for simulation accuracy, they can reflect not only micro-architectural behaviors but also system-level behaviors of real big data analytics workloads.

4 Evaluation

In this section, we evaluate our proxy benchmarks from the perspectives of runtime speedup and accuracy.

4.1 Experiment Setups

We deploy a five-node cluster, with one master node and four slave nodes. They are connected using 1Gb ethernet network. Each node is equipped with two Intel Xeon E5645 (Westmere) processors, and each processor has six physical out-of-order cores. The memory of each node is 32GB. Each node runs Linux CentOS 6.4 with the Linux kernel version 3.11.10. The JDK and Hadoop versions are 1.7.0 and 2.7.1, respectively. The GCC version is 4.8.0. The proxy benchmarks are compiled using "-O2" option for optimization. The hardware and software details are listed on Table 5.

To evaluate the performance data accuracy, we run the proxy benchmarks against the benchmarks from BigDataBench. We run the four Hadoop benchmarks from BigDataBench on the above five-node cluster using the optimized Hadoop configurations, through tuning the data block size of the Hadoop distributed file system, memory allocation for each map/reduce job and reduce job numbers according to the cluster scale and memory size. For Hadoop TeraSort, we choose 100 GB text data produced by gensort [26]. For Hadoop Kmeans and PageRank, we choose 100 GB sparse vector data with 90% sparsity 222The sparsity of the vector indicates the proportion of zero-valued elements. and -vertex graph both generated by BDGS [32], respectively. For Hadoop SIFT, we use one hundred thousand images from ImageNet [27]. For comparison, we run the four proxy benchmarks on one of the slave nodes, respectively.

Hardware Configurations
CPU Type Intel CPU Core
Intel ®Xeon E5645 6 cores@2.40G
L1 DCache L1 ICache L2 Cache L3 Cache
6 32 KB 6 32 KB 6 256 KB 12MB
Memory 32GB,DDR3
Disk SATA@7200RPM
Ethernet 1Gb
Hyper-Threading Disabled
Software Configurations
Operating
System
Linux
Kernel
JDK
Version
Hadoop
Version
CentOS 6.4 3.11.10 1.7.0 2.7.1
Table 5: Node Configuration Details of Xeon E5645

4.2 Metrics Selection and Collection

To evaluate accuracy, we choose micro-architectural and system metrics covering instruction mix, cache behavior, branch prediction, processor performance, memory bandwidth and disk I/O behavior. Table 6 presents the metrics we choose.

Processor Performance. We choose two metrics to measure the processor overall performance. Instructions per cycle (IPC) indicates the average number of instructions executed per clock cycle. Million instructions per second (MIPS) indicates the instruction execution speed.

Instruction Mix. We consider the instruction mix breakdown including the percentage of integer instructions, floating-point instructions, load instructions, store instructions and branch instructions.

Branch Prediction. Branch predication is an important strategy used in modern processors. We track the miss prediction ratio of branch instructions (br_miss for short).

Cache Behavior. We evaluate cache efficiency using cache hit ratios, including L1 instruction cache, L1 data cache, L2 cache and L3 cache.

Memory Bandwidth. We measure the data load rate from memory and the data store rate into memory, with the unit of bytes per second. We choose metrics of memory read bandwidth (read_bw for short), memory write bandwidth (write_bw for short) and total memory bandwidth including both read and write (mem_bw for short).

Disk I/O Behavior. We employ I/O bandwidth to reflect the I/O behaviors of workloads.

We collect micro-architectural metrics from hardware performance monitoring counters (PMCs), and look up the hardware events’ value on Intel Developer’s Manual [33]. Perf [34] is used to collect these hardware events. To guarantee the accuracy and validity, we run each workload three times, and collect performance data of workloads on all slave nodes during the whole runtime. We report and analyze their average value.

Category Metric Name Description
Micro-architectural Metrics
Processor Performance IPC
Instructions per cycle
MIPS
Million instructions per second
Instruction Mix Instruction ratios Ratios of floating-point, load, store, branch and integer instructions
Branch Prediction
Branch Miss Branch miss prediction ratio
Cache Behavior L1I Hit Ratio L1 instruction cache hit ratio
L1D Hit Ratio L1 data cache hit ratio
L2 Hit Ratio L2 cache hit ratio
L3 Hit Ratio L3 cache hit ratio
System Metrics
Memory Bandwidth
Read Bandwidth
Memory load bandwidth
Write Bandwidth
Memory store bandwidth
Total Bandwidth
memory load and store bandwidth
Disk I/O Behavior
Disk I/O Bandwidth
Disk read and write bandwidth
Table 6: System and Micro-architectural Metrics.

4.3 Runtime Speedup

Table 7 presents the execution time of the Hadoop benchmarks and the proxy benchmarks on Xeon E5645. Hadoop TeraSort with 100 GB text data runs 1500 seconds on the five-node cluster. Hadoop Kmeans with 100 GB vectors runs 5971 seconds for each iteration. Hadoop PageRank with -vertex graph runs 1443 seconds for each iteration. Hadoop SIFT with one hundred thousands images runs 721 seconds. The four corresponding proxy benchmarks run about ten seconds on the physical machine. For TeraSort, Kmeans, PageRank, SIFT, the speedup is 136X (1500/11.02), 743X (5971/8.03), 160X (1444/9.03) and 90X (721/8.02), respectively.

Workloads Execution Time (Second)
Hadoop version Proxy version
TeraSort 1500 11.02
Kmeans 5971 8.03
PageRank 1444 9.03
SIFT 721 8.02
Table 7: Execution Time on Xeon E5645.

4.4 Accuracy

We evaluate the accuracy of all metrics listed in Table 6. For each metric, the accuracy of the proxy benchmark comparing to the Hadoop benchmark is computed by Equation 1. Among which, represents the average value of the Hadoop benchmark on all slave nodes; represents the average value of the proxy benchmark on a slave node. The absolute value ranges from 0 to 1. The number closer to 1 indicates higher accuracy.

(1)

Fig. 6 presents the system and micro-architectural data accuracy of the proxy benchmarks on Xeon E5645. We can find that the average accuracy of all metrics are greater than 90%. For TeraSort, Kmeans, PageRank, SIFT, the average accuracy is 94%, 91%, 93%, 94%, respectively. Fig. 7 shows the instruction mix breakdown of the proxy benchmarks and Hadoop benchmarks. From Fig. 7, we can find that the four proxy benchmarks preserve the instruction mix characteristics of these four Hadoop benchmarks. For example, the integer instruction occupies 44% for Hadoop TeraSort and 46% for Proxy TeraSort, while the floating-point instruction occupies less than 1% for both Hadoop and Proxy TeraSort. For instructions involving data movement, Hadoop TeraSort contains 39% of load and store instructions, and Proxy TeraSort contains 37%. The SIFT workload, widely used in computer vision for image processing, has many floating-point instructions. Also, Proxy SIFT preserves the instruction mix characteristics of Hadoop SIFT.

Figure 6: System and Micro-architectural Data Accuracy on Xeon E5645.
Figure 7: Instruction Mix Breakdown on Xeon E5645.

4.4.1 Disk I/O Behaviors

Big data applications have significant disk I/O pressures. To evaluate the DISK I/O behaviors of the proxy benchmarks, we compute the disk I/O bandwidth using Equation 2, where means the total number of sector reads and sector writes; means the sector size (512 bytes for our nodes).

(2)

Fig. 8 presents the I/O bandwidth of proxy benchmarks and Hadoop benchmarks on Xeon E5645. We can find that they have similar average disk I/O pressure. The disk I/O bandwidth of Proxy TeraSort and Hadoop TeraSort is 32.04 MB and 33.99 MB per second, respectively.

Figure 8: Disk I/O BandWidth on Xeon E5645.

4.4.2 The Impact of Input Data

In this section, we demonstrate when we change the input data sparsity, our proxy benchmarks can still mimic the big data workloads with a high accuracy. We run Hadoop Kmeans with the same Hadoop configurations, and the data input is 100 GB. We use different data input: sparse vector (the original configuration, 90% elements are zero) and dense vectors (all elements are non-zero, and 0% elements are zero). Fig. 9 presents the performance difference using different data input. We can find that the memory bandwidth with sparse vectors is nearly half of the memory bandwidth with dense vectors, which confirms the data input’s impacts on micro-architectural performance.

Figure 9: Data Impact on Memory Bandwidth Using Sparse and Dense Data for Hadoop Kmeans on Xeon E5645.

On the other hand, Fig. 10 presents the accuracy of proxy benchmark using diverse input data. We can find that the average micro-architectural data accuracy of Proxy Kmeans is above 91% with respect to the fully-distributed Hadoop Kmeans using dense input data with no zero-valued element. When we change the input data sparsity from 0% to 90%, the data accuracy of Proxy Kmeans is also above 91% with respect to the original workload. So we see that the Proxy Kmeans can mimic the Hadoop Kmeans under different input data.

Figure 10: System and Micro-architectural Data Accuracy Using Different Data Input on Xeon E5645.

5 Case Studies on ARM Processors

This section demonstrates our proxy benchmark also can mimic the original benchmarks on the ARMv8 processors. We report our joint evaluation work with our industry partnership on ARMv8 processors using Hadoop TeraSort, Kmeans, PageRank, and the corresponding proxy benchmarks. Our evaluation includes widely acceptable metrics: runtime speedup, performance accuracy and several other concerns like multi-core scalability and system evaluation across different processors.

5.1 Experiment Setup

Due to the resource limitation of ARMv8 processors, we use a two-node (one master and one slave) cluster with each node equipped with one ARMv8 processor. In addition, we deploy a two-node (one master and one slave) cluster with each node equipped with one Xeon E5-2690 v3 (Haswell) processor for speedup comparison. Each ARMv8 processor has 32 physical cores, with each core having independent L1 instruction cache and L1 data cache. Every four cores share L2 cache and all cores share the last-level cache. The memory of each node is 64GB. Each Haswell processor has 12 physical cores, with each core having independent L1 and L2 cache. All cores share the last-level cache. The memory of each node is 64GB. In order to narrow the gap of logical core numbers between two architectures, we enable hyperthreading for Haswell processor. Table 8 lists the hardware and software details of two platforms.

Considering the memory size of the cluster, we use 50 GB text data generated by gensort for Hadoop TeraSort, 50 GB dense vectors for Hadoop Kmeans, and -vertex graph data for Hadoop PageRank. We run Hadoop benchmarks with optimized configurations, through tuning the data block size of the Hadoop distributed file system, memory allocation for each job and reduce task numbers according to the cluster scale and memory size. For comparison, we run proxy benchmarks on the slave node. Our industry partnership pays great attentions on cache and memory access patterns, which are important micro-architectural and system metrics for chip design. So we mainly collect cache-related and memory-related performance data.

Hardware Configurations
Model ARMv8 Xeon E5-2690 V3
Number of Processors 1 1
Number of Cores 32 12
Frequency 2.1GHz 2.6GHz
L1 Cache(I/D) 48KB/32KB 32KB/32KB
L2 Cache 8 x 1024KB 12 x 256KB
L3 Cache 32MB 30MB
Architecture ARM X86_64
Memory 64GB, DDR4 64GB, DDR4
Ethernet 1Gb 1Gb
Hyper-Threading None Enabled
Software Configurations
Operating
System
EulerOS V2.0
Red-hat Enterprise Linux
Server release 7.0
Linux Kernel 4.1.23-vhulk3.6.3.aarch64 3.10.0-123.e17.x86-64
GCC Version 4.9.3 4.8.2
JDK Version jdk1.8.0_101 jdk1.7.0_79
Hadoop Version
2.5.2 2.5.2
Table 8: Platform Configurations

5.2 Runtime Speedup on ARMv8

Table. 9 presents the execution time of Hadoop benchmarks and the proxy benchmarks on ARMv8. Our proxy benchmarks run within 10 seconds on ARMv8 processor. On two-node cluster equipped with ARMv8 processor, Hadoop TeraSort with 50 GB text data runs 1378 seconds. Hadoop Kmeans with 50GB vectors runs 3374 seconds for one iteration. Hadoop PageRank with -vertex runs 4291 seconds for five iterations. In contrast, their proxy benchmarks run 4102, 8677 and 6219 milliseconds, respectively. For TeraSort, Kmeans, PageRank, the speedup is 336X (1378/4.10), 386X (3347/8.68) and 690X (4291/6.22), respectively.

Workloads Execution Time (Second)
Hadoop version Proxy version
TeraSort 1378 4.10
Kmeans 3347 8.68
PageRank 4291 6.22
Table 9: Execution Time on ARMv8.

5.3 Accuracy on ARMv8

We report the system and micro-architectural data accuracy of the Hadoop benchmarks and the proxy benchmarks. Likewise, we evaluate the accuracy by Equation 1. Fig. 11 presents the accuracy of the proxy benchmarks on ARM processor. We can find that on the ARMv8 processor, the average data accuracy is all above 90%. For TeraSort, Kmeans and PageRank, the average accuracy is 93%, 95% and 92%, respectively.

Figure 11: System and Micro-architectural Data Accuracy on ARMv8.

5.4 Multi-core Scalability on ARMv8

ARMv8 has 32 physical cores, and we evaluate its multi-core scalability using the Hadoop benchmarks and the proxy benchmarks on 4, 8, 16, 32 cores, respectively. For each experiment, we disable the specified number of cpu cores through cpu-hotplug mechanism. For the Hadoop benchmarks, we adjust the Hadoop configurations so as to get the peak performance. For the proxy benchmarks, we run them directly without any modification.

Fig. 12 reports multi-core scalability in terms of runtime and MIPS. The horizontal axis represents the core number and the vertical axis represents runtime or MIPS. Due to the large runtime gap between the Hadoop benchmarks and proxy benchmarks, we list their runtime on different side of vertical axis: the left side indicates runtime of the Hadoop benchmarks, while the right side indicates runtime of the proxy benchmarks. We can find that they have similar multi-core scalability trends in terms of both runtime and instruction execution speed.

(a) TeraSort
(b) Kmeans
(c) PageRank
Figure 12: Multi-core Scalability of the Hadoop benchmarks and Proxy Benchmarks on ARMv8.

5.5 System Evaluation across Different Processors

System evaluation across different processors is another concern from our industry partnership. We use the proxy benchmark to evaluate the runtime speedup across two different architectures of ARMv8 and Xeon E5-2690 V3 (Haswell). The runtime speedup is computed using Equation 3. The Hadoop configurations are also optimized according to hardware environments. The proxy benchmarks use the same version on two architectures.

(3)

Fig. 13 shows the runtime speedups of the Hadoop benchmarks and the proxy benchmarks across ARM and X86_64 architectures. We can find that they have consistent speedup trends. For example, Hadoop TeraSort runs 1378 seconds and 856 seconds on ARMv8 and Haswell, respectively. Proxy TeraSort runs 4.1 seconds and 2.56 seconds on ARMv8 and Haswell, respectively. The runtime speedups between ARMv8 and Haswell are 1.61 (1378/856) running Hadoop TeraSort, and 1.60 (4.1/2.56) running Proxy TeraSort.

Figure 13: Runtime Speedup across Different Processors.

6 Related Work

Multiple benchmarking methodologies have been proposed over the past few decades. The most simplest one is to create a new benchmark for every possible workload. PARSEC [35] provides a series of shared-memory programs for chip-multiprocessors. BigDataBench [1] is a benchmark suite providing dozens of big data workloads. CloudSuite [2] consists of eight applications, which are selected based on popularity. These benchmarking methods need to provide individual implementations for every possible workload, and keep expanding benchmark set to cover emerging workloads. Moreover, it is frustrating to run (component or application) benchmarks like BigDataBench or CloudSuite on simulators because of complex software stacks and long running time. Using reduced data input is one way to reduce execution time. Previous work [36, 37] adopts reduced data set for the SPEC benchmark and maintains similar architecture behaviors using the full reference data sets.

Kernel benchmarks are widely used in high performance computing. Livermore kernels [38] use Fortran applications to measure floating-point performance range. The NAS parallel benchmarks [17] consist of several separate tests, including five kernels and three pseudo-applications derived from computational fluid dynamics (CFD) applications. Linpack [39] provides a collection of Fortran subroutines. Kernel benchmarks are insufficient to completely reflect workload behaviors considering the complexity and diversity of big data workloads[17, 18].

In terms of micro-architectural simulation, many previous studies generated synthetic benchmarks as proxies [40, 41]. Statistical simulation [14, 15, 16, 42, 43, 44] generates synthetic trace or synthetic benchmarks to mimic micro-architectural performance of long-running real workloads, which targets one workload on a specific architecture with the certain configurations, and thus each benchmark needs to be generated on the other architectures with different configurations [45]. Sampled simulation selects a series of sample units for simulation instead of entire instruction stream, which were sampled randomly [10], periodically [11, 12] or based on phase behavior [13]. Seongbeom et al. [46] accelerated the full-system simulation through characterizing and predicting the performance behavior of OS services. For emerging big data workloads, PerfProx [47] proposed a proxy benchmark generation framework for real-world database applications through characterizing low-level dynamic execution characteristics.

Our big data dwarfs are inspired by previous successful abstractions in other application scenarios. The set concept in relational algebra [3] abstracted five primitive and fundamental operators, setting off a wave of relational database research. The set abstraction is the basis of relational algebra and theoretical foundation of database. Phil Colella [5] identified seven dwarfs of numerical methods which he thought would be important for the next decade. Based on that, a multidisciplinary group of Berkeley researchers proposed 13 dwarfs which were highly abstractions of parallel computing, capturing the computation and communication patterns of a great mass of applications [6].

7 Conclusions

In this paper, we answer what are abstractions of frequently-appearing units of computation in big data analytics. We identify eight big data dwarfs among a wide variety of big data analtyics workloads, including matrix, sampling, logic, transform, set, graph, sort and basic statistic computation. We propose a dwarf-based scalable big data benchmarking methodology, and construct the big data proxy benchmarks using the DAG-like combinations of dwarf components using different wights to mimic the benchmarks in BigDataBench. Our proxy benchmarks shorten the execution time by 100s times with respect to the benchmarks from BigDataBench, while the average micro-architectural and system data accuracy is above 90% on both X86_64 and ARM architectures.

References

  • [1] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “Bigdatabench: A big data benchmark suite from internet services,” in IEEE International Symposium On High Performance Computer Architecture (HPCA), 2014.
  • [2] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the clouds: A study of emerging workloads on modern hardware,” in ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.
  • [3] E. F. Codd, “A relational model of data for large shared data banks,” Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970.
  • [4] Y. Chen, F. Raab, and R. Katz, “From tpc-c to big data benchmarks: A functional workload model,” in Specifying Big Data Benchmarks, pp. 28–43, Springer, 2014.
  • [5] P. Colella, “Defining software requirements for scientific computing,” 2004.
  • [6] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and Y. Katherine, “The landscape of parallel computing research: A view from berkeley,” tech. rep., Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.
  • [7] M. Shah, P. Ranganathan, J. Chang, N. Tolia, D. Roberts, and T. Mudge, “Data dwarfs: Motivating a coverage set for future large data center workloads,” in Proc. Workshop Architectural Concerns in Large Datacenters, 2010.
  • [8] P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas, R. Rabenseifner, and D. Takahashi, “The hpc challenge (hpcc) benchmark suite,” in Proceedings of the 2006 ACM/IEEE conference on Supercomputing, p. 213, Citeseer, 2006.
  • [9] N. Council, “Frontiers in massive data analysis,” The National Academies Press Washington, DC, 2013.
  • [10] T. M. Conte, M. A. Hirsch, and K. N. Menezes, “Reducing state loss for effective trace sampling of superscalar processors,” in IEEE International Conference on Computer Design (ICCD), 1996.
  • [11] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe, “Smarts: Accelerating microarchitecture simulation via rigorous statistical sampling,” in IEEE International Symposium on Computer Architecture (ISCA), 2003.
  • [12] Z. Yu, H. Jin, J. Chen, and L. K. John, “Tss: Applying two-stage sampling in micro-architecture simulations,” in IEEE International Symposium on Modeling, Analysis, Simulation of Computer and Telecommunication Systems (MASCOTS), 2009.
  • [13] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically characterizing large scale program behavior,” in ACM SIGARCH Computer Architecture News, vol. 30, pp. 45–57, 2002.
  • [14] K. Skadron, M. Martonosi, D. August, M. Hill, D. Lilja, and V. S. Pai, “Challenges in computer architecture evaluation,” IEEE Computer, vol. 36, no. 8, pp. 30–36, 2003.
  • [15] L. Eeckhout, R. H. Bell Jr, B. Stougie, K. De Bosschere, and L. K. John, “Control flow modeling in statistical simulation for accurate and efficient processor design studies,” ACM SIGARCH Computer Architecture News, vol. 32, no. 2, p. 350, 2004.
  • [16] L. Eeckhout, K. De Bosschere, and H. Neefs, “Performance analysis through synthetic trace generation,” in Performance Analysis of Systems and Software, 2000. ISPASS. 2000 IEEE International Symposium on, pp. 1–6, IEEE, 2000.
  • [17] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, “The nas parallel benchmarks,” The International Journal of Supercomputing Applications, vol. 5, no. 3, pp. 63–73, 1991.
  • [18] D. J. Lilja, Measuring computer performance: a practitioner’s guide. Cambridge university press, 2005.
  • [19] G. Hamerly, E. Perelman, J. Lau, and B. Calder, “Simpoint 3.0: Faster and more flexible program phase analysis,” Journal of Instruction Level Parallelism, vol. 7, no. 4, pp. 1–28, 2005.
  • [20] R. Panda and L. K. John, “Proxy benchmarks for emerging big-data workloads,” in Parallel Architectures and Compilation Techniques (PACT), 2017 26th International Conference on, pp. 105–116, IEEE, 2017.
  • [21] Z. Jia, J. Zhan, L. Wang, R. Han, S. A. McKee, Q. Yang, C. Luo, and J. Li, “Characterizing and subsetting big data workloads,” in IEEE International Symposium on Workload Characterization (IISWC), 2014.
  • [22] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-performance, portable implementation of the mpi message passing interface standard,” Parallel computing, vol. 22, no. 6, pp. 789–828, 1996.
  • [23] “Hadoop.” http://hadoop.apache.org/.
  • [24] “Spark.” https://spark.apache.org/.
  • [25] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [26] “Gensort.” http://www.ordinal.com/gensort.html.
  • [27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
  • [28] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach. Elsevier, 2011.
  • [29] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: building customized program analysis tools with dynamic instrumentation,” in Acm sigplan notices, vol. 40, pp. 190–200, ACM, 2005.
  • [30] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.
  • [31] L. Van Ertvelde and L. Eeckhout, “Benchmark synthesis for architecture and compiler exploration,” in Workload Characterization (IISWC), 2010 IEEE International Symposium on, pp. 1–11, IEEE, 2010.
  • [32] Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and J. Zhan, “Bdgs: A scalable big data generator suite in big data benchmarking,” arXiv preprint arXiv:1401.5465, 2014.
  • [33] R. Intel, “Intel r 64 and ia-32 architectures. software developerś manual. volume 3a,” System Programming Guide, Part, vol. 1, 2010.
  • [34] “Perf tool.” https://perf.wiki.kernel.org/index.php/Main_Page.
  • [35] C. Bienia, Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.
  • [36] A. J. KleinOsowski, J. Flynn, N. Meares, and D. J. Lilja, Adapting the SPEC 2000 Benchmark Suite for Simulation-Based Computer Architecture Research, pp. 83–100. Boston, MA: Springer US, 2001.
  • [37] A. KleinOsowski and D. J. Lilja, “Minnespec: A new spec benchmark workload for simulation-based computer architecture research,” IEEE Computer Architecture Letters, vol. 1, no. 1, pp. 7–7, 2002.
  • [38] F. H. McMahon, “The livermore fortran kernels: A computer test of the numerical performance range,” tech. rep., Lawrence Livermore National Lab., CA (USA), 1986.
  • [39] J. J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,” Concurrency and Computation: practice and experience, vol. 15, no. 9, pp. 803–820, 2003.
  • [40] R. H. Bell Jr and L. K. John, “Improved automatic testcase synthesis for performance model validation,” in ACM International Conference on Supercomputing (ICS), 2005.
  • [41] K. Ganesan, J. Jo, and L. K. John, “Synthesizing memory-level parallelism aware miniature clones for spec cpu2006 and implantbench workloads,” in IEEE International Symposium on Performance Analysis of Systems Software (ISPASS), 2010.
  • [42] S. Nussbaum and J. E. Smith, “Modeling superscalar processors via statistical simulation,” in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2001.
  • [43] M. Oskin, F. T. Chong, and M. Farrens, HLS: Combining statistical and symbolic simulation to guide microprocessor designs, vol. 28. ACM, 2000.
  • [44] L. Eeckhout and K. De Bosschere, “Early design phase power/performance modeling through statistical simulation.,” in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2001.
  • [45] A. M. Joshi, Constructing adaptable and scalable synthetic benchmarks for microprocessor performance evaluation. ProQuest, 2007.
  • [46] S. Kim, F. Liu, Y. Solihin, R. Iyer, L. Zhao, and W. Cohen, “Accelerating full-system simulation through characterizing and predicting operating system performance,” in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2007.
  • [47] R. Panda and L. K. John, “Proxy benchmarks for emerging big-data workloads,” in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2017.