Several fundamental changes in technology, i.e., end of Dennard scaling, ending of Moore’s Law, Amdahl’s Law, and its implications for ending ‘easy’ multi-core era, indicate domain-specific hardware and software co-design is the only path left [1, 2]. Among many domains, Big Data and AI are the brightest star in the sky, and hence architecture, system, data management, and machine learning communities pay greater attention to innovative big data and AI algorithms, architecture, and systems [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. Unfortunately, complexity, diversity, frequently-changed workloads—so called workload churns , and rapid evolution of big data and AI systems raise great challenges in benchmarking and domain-specific hardware and software co-design.
|Big data systems||Application model||1||5||Proposal||Proposal||Proposal|
|Big data analytics||Application model||1||1||10||3 data generators||3|
|Cloud services||Popularity||N/A||4||8||3 data generators||3|
|Big data systems||Popularity||N/A||6||19||Random generate or with specific distribution||5|
|CALDA ||MapReduce system and parallel DBMSs||Popularity||N/A||1||5||N/A||
|YCSB ||Cloud serving systems||Performance model||N/A||1||6||N/A||4|
|LinkBench ||Database systems||Application model||N/A||1||10||1 data generator||2|
|AMP Benchmarks ||Data analytic systems||Popularity||N/A||1||4||N/A||
|Fathom ||AI systems||Popularity||N/A||1||8||N/A||1|
|DeepBench ||AI systems||Popularity||N/A||1||4||N/A||1|
|BenchNN ||AI systems||Popularity||N/A||1||5||N/A||1|
|DNNMark ||AI systems||Popularity||N/A||1||8||N/A||1|
|DAWNBench ||AI systems||Popularity||N/A||1||2||N/A||2|
The seven workload types are online service, offline analytics, graph analytics, artificial intelligence (AI), data warehouse, NoSQL, and streaming.
The traditional benchmark methodology that creates a new benchmark or proxy for every possible workload is prohibitively costly and hence not scalable 111The meaning of scalable differs from scaleable. As one of four properties of domain-specific benchmarks defined by Jim Gray , the latter refers to scaling the benchmark up to larger systems, or even impossible for Big Data and AI benchmarking. First, there are many classes of big data and AI applications. Even for Internet services, there are several important application domains, e.g., search engines, social networks, and e-commerce. The value of big data and AI also drives the emergence of innovative application domains. Meanwhile, data (sizes, types, sources, and patterns) have a great impact on workload behaviors and performance significantly [2, 26], so comprehensive and representative real-world data sets should be included.
Second, at an earlier stage, it is usually difficult to justify porting a full-scale end-to-end Big data or AI application to a new computer system or architecture simply to obtain a benchmark number ; while at a later stage, kernels alone are insufficient to completely assess the performance potential of a new system or architecture on real-world data sets and applications . Meanwhile, the benchmarks should be consistent across different communities for the co-design of software and hardware.
Third, the correctness of results and performance figures must be easily verifiable . To some extent, too complex workloads, i.e., full-scale end-to-end Big Data or AI applications raise difficulties in reproducibility and interpretability of performance data .
As modern big data and AI workloads are not only diverse, but also fast changing and expanding, it also raises great challenges in domain-specific hardware and software co-design. Even the agile hardware development methodology and tools are adopted , it is prohibitively expensive to tailor the architecture to characteristics of one or more application or even a domain of applications, and hence building domain-specific hardware and software systems case by case should be avoided.
This paper presents our joint research efforts on a scalable and unified Big Data and AI benchmarking suite with several industrial partners. On the basis of our previous work  that identifies eight data motifs— taking up most of the run time among a wide variety of big data and AI workloads, we propose a scalable benchmarking methodology that uses the combination of one or more data motifs—including Matrix, Sampling, Transform, Graph, Logic, Set, Sort and Statistic computation to represent diversity of big data and AI workloads. Our benchmark suite includes micro benchmarks, each of which is a single data motif, the component benchmarks, each of which consists of the combination of one or more data motifs with different weights in terms of runtime, and end-to-end application benchmarks, which are combinations of component benchmarks.
Following this methodology, we present a unified big data and AI benchmark suite—BigDataBench 4.0, publicly available from http://prof.ict.ac.cn/BigDataBench. BigDataBench 4.0 provides 13 representative real-world data sets and 47 big data and AI benchmarks of seven workload types: online service, offline analytics, graph analytics, AI, data warehouse, NoSQL, and streaming. Also, for each workload type, we provide diverse implementations using state-of-the-art and state-of-the-practise software stacks. Data varieties are considered with the whole spectrum of data types including structured, semi-structured, and unstructured data. Using real data sets as the seed, the data generators  are provided to generate the data with a specific scale.
On a typical state-of-practice processor: Intel Xeon E5-2620 V3, we comprehensively characterize the benchmarks of seven workload types in BigDataBench in addition to SPECCPU, PARSEC, and HPCC using the Top-Down method 
. We classify an issued micro operation (uops) intoretiring, bad speculation, frontend bound and backend bound, among which, only retiring represents useful work. In order to explore AI workloads’ characteristics thoroughly, we run them on both CPUs and GPUs to evaluate their micro-architectural performance.
We have the following observations. First, the average ILP (instruction-level parallelism) and MLP (memory-level parallelism) of the AI benchmarks are almost 1.5 times higher than that of Big Data. With respect to the traditional benchmarks, i.e., SPECCPU, PARSEC, and HPCC, the average ILP of AI is lower, and the AI framework needs more optimizations like instruction mix balance and memory access locality.
Second, in terms of uppermost-level breakdown, AI reflect similar pipeline behaviors with the traditional benchmarks. However, to explore deeply, their bottlenecks that incur the frontend and backend stalls are different, which means AI benchmarks have distinct computation patterns comparing to the traditional benchmarks. Corroborating the observations in the previous work [4, 31, 32], the frontend bound of Big Data is more severe than that of the traditional benchmarks. However, we notice that the frontend bound varies across different workload types.
Third, for Big Data and AI, they have more CISC instructions that cannot be decoded by default decoders, almost 10 times larger than that of the traditional benchmarks. So they suffer from more penalties because of switching to a special unit. Fourth, corroborating the previous work , the first bottleneck is backend bound for Big Data and AI. However, different from the previous work , we observe that the data movement delay among memory hierarchies is the main reason for backend bound, especially the latency delay from DRAM memory. Fifth, the utilizations of GPU resources vary when running different AI benchmarks. The stalls because of the data movements limit their performance on GPUs. In addition, the iteration number has little impact on architectural behaviors of AI.
In summary, our contributions are three-fold as follows.
1) We propose a data motif-based scalable benchmarking methodology.
2) We present a unified big data and AI benchmark suite—BigDataBench 4.0.
3) We thoroughly perform workload characterizations of big data and AI benchmarks on CPUs and GPUs, respectively.
The rest of this paper is organized as follows. In Section 2, we present the related work and background. Section 3 summarizes our benchmarking methodology. Section 4 presents our unified Big Data and AI benchmark suite—-BigDataBench 4.0. Section 5 illustrates the experiment configurations. In Section 6, we present the characterization results. Finally, we draw the conclusion in Section 7.
2 Related Work
Identifying units of computation in corresponding domains is an important step towards understanding various workloads [37, 28, 2]. TPC-C  benchmark is based on the units of computation in the OLTP domain. HPCC  benchmark abstracts seven basic operations in high performance computing. Following the ‘pencil and paper’ specification, the NAS parallel benchmark  consists of five ‘parallel kernel’ benchmarks and three ‘simulated application’ benchmarks, and together they mimic the computation and data movement characteristics of large-scale computational fluid dynamics applications. Our previous work  identifies eight data motifs among a wide variety of big data and AI workloads, including Matrix, Sampling, Transform, Graph, Logic, Set, Sort and Statistic computations. National Research Council identifies seven major tasks in massive data analysis , which are macroscopical definition of problems from mathematic perspective. Fox et al 
build a set of Big Data application characteristics with 50 features, which they call facets and divide them into 4 views. As a machine learning framework, TensorFlow adopts a dataflow-based programming abstraction, using individual mathematical operators as nodes in the dataflow graph. Cambricon 
is an instruction set architecture for neural networks, which is abstracted from instruction level.
Big Data and AI attract great attention, appealing many research efforts on big data and AI benchmarking, as illustrated in Table I. Our previous work—BigDataBench 2.0  abstracts three application domains and provides nineteen workloads covering offline analytics, online services and data warehouse, which targets big data systems and architecture. BigBench 1.0  targets a product retailer business model based on TPC-DS  and targets big data analytics workloads. BigBench 2.0  is a proposal which still focuses on retail business model and adds four workload types of streaming, key-value processing, graph processing, and a multimedia data type. CloudSuite 3.0  is a benchmark suite for cloud service, and choose workloads according to popularity, totally including four workload types and eight workloads. It evaluated the server inefficiencies from the frontend and backend, however, the analysis did not drill down on the deeper levels. HiBench 6.0  also chooses workloads according to popularity, containing six workload types and nineteen workloads, including micro benchmarks, machine learning, sql, graph, websearch and streaming categories. YCSB  released by Yahoo! is a benchmark for data storage systems and only includes online service workloads, i.e. Cloud OLTP. The workloads are mixes of read/write operations to cover a wide performance space. CALDA  is a benchmarking effort targeting MapReduce systems and parallel DBMSes. Its workloads are from the original MapReduce paper  and add four complex analytical tasks. LinkBench  is a synthetic benchmark for database systems which models the data scheme and workload patterns according to Facebook. AMP benchmark  is a big data benchmark proposed by UC Berkeley, which focuses on real-time analytic applications. The workloads are from CALDA benchmark.
A series of AI benchmarks are proposed as follows. Fathom 
provides eight deep learning workloads implemented with TensorFlow. DeepBench consists of four operations involved in training deep neural networks, including three basic operations and recurrent layer types. BenchNN  develops and evaluates software neural network implementations of 5 (out of 12) high-performance applications from the PARSEC Benchmark Suite. DNNMark  is a GPU benchmark suite that consists of a collection of deep neural network primitives. Tonic Suite  presents seven neural network workloads that use the DjiNN service. DAWNBench 
is a benchmark and competition focusing on end-to-end training time to achieve a state-of-the-art accuracy level, as well as inference time with that accuracy. It focuses on two tasks including image classification on CIFAR10 and ImageNet, and question answering on SQuAD. SLAB (Scalable Linear Algebra Benchmarking) presents a suite of LA-specific tests based on the analysis of data access and communication patterns of LA workloads.
3 Data motif-based scalable benchmarking Methodology
In this section, we introduce our data motif-based scalable benchmarking methodology.
We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on different initial or intermediate data inputs . Each class of unit of computation captures the common requirements while being specified only algorithmically in a ‘paper-and-pencil’ approach  and reasonably divorced from individual implementations , and hence we call it a data motif . Significantly different from the traditional kernels , a data motif’s behaviors are affected by its data sizes, patterns, types, and sources, reflecting not only computation patterns, memory access patterns, but also disk and network I/O patterns .
3.1 Background of Eight Data Motifs
After profiling forty big data and AI workloads with a broad spectrum, our previous work identifies eight unified data motifs among big data and AI workloads,including Matrix, Sampling, Transform, Graph, Logic, Set, Sort and Statistic
computations. Among them, matrix computation involves vector-vector, vector-matrix and matrix-matrix computations. Sampling is a method to select a subset of original data from within a statistical population. Transform computation indicates the conversion from the original domain to another domain, such as FFT. Graph computation uses nodes representing entities and edges representing dependencies. Logic computation performs bit manipulation. Set computation means the operations on one or more collections of data. Please note that primitive operators in relation algebra are also classified into set computation in our motif taxonomy. Sort and statistic computation are fundamental units of computation in big data and AI. For online services, get, put, post, and delete are identified as basic and abstract operations in the previous work , so we use them directly to construct online service benchmarks and don’t include those four in our motif set.
3.2 Benchmarking Methodology
Fig. 1 summarizes our data motif-based scalable benchmarking methodology for BigDataBench 4.0, separating the specification from implementation. First, through investigating typical application domains using some widely acceptable metrics, e.g. page views for internet service, we thoroughly analyze these domains in terms of processing logic and data pipeline. Second, we choose representative workloads from these domains. After profiling these workloads, we analyze their computation dependency graph and run time breakdown, and find the hotspot functions. Combing with algorithmic analysis, we decompose the workloads and summarize the frequently-appearing and time-consuming units of computation within these workloads as data motifs . Finally, circling around the data motifs identified from these application domains, we then define the specifications of micro, component, and end-to-end application benchmarks, as the guidelines for benchmark implementation. The specifications of micro, component, and application benchmarks are as follows.
Micro Benchmark Specification As illustrated in Subsection 3.1, data motifs are fundamental concepts and unified units of computation among a majority of big data and AI workloads. We design a suite of micro benchmarks, each of which is a single data motif, widely used in investigated application domains, as listed in Table II.
Component Benchmark Specification Considering the benchmarking scalability, we use the motif combinations to compose original complex workloads with a DAG-like structure considering the data pipeline. The DAG-like structure is to use a node representing original or intermediate data set being processed, and an edge representing a data motif. Table III lists the component benchmarks. For example, SIFT  is a combination of five data motifs, including matrix, sampling, transform, sort and statistic computations, Fig. 2 presents its DAG-like structure, which specifies how data set or intermediate data set are operated by different motifs.
|Micro Benchmark||Involved Motif||Application Domain||Workload Type||Data Set||Software Stack|
|Sort||Sort||SE, SN, EC, MP, BI||Offline analytics||Wikipedia entries||Hadoop, Spark, Flink, MPI|
|Grep||Set||Offline analytics||Wikipedia entries||Hadoop, Spark, Flink, MPI|
|Streaming||Random Generate||Spark streaming|
|WordCount||Basic statistics||Offline analytics||Wikipedia entries||Hadoop, Spark, Flink, MPI|
|MD5||Logic||Offline analytics||Wikipedia entries||Hadoop, Spark, MPI|
|Connected Component||Graph||SN||Graph analytics||Facebook social network||Hadoop, Spark, Flink, GraphLab, MPI|
|RandSample||Sampling||SE, MP, BI||Offline analytics||Wikipedia entries||Hadoop, Spark, MPI|
|FFT||Transform||MP||Offline analytics||Two-dimensional matrix||Hadoop, Spark, MPI|
|Offline analytics||Two-dimensional matrix||Hadoop, Spark, MPI|
|Read / Write / Scan||Set||SE, SN, EC||NoSQL||ProfSearch resumes||HBase, MongoDB|
|Convolution||Transform||SN, EC, MP, BI||AI||Cifar, ImageNet|
|Fully Connected||Matrix||SN, EC, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|Relu||Logic||SN, EC, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|Sigmoid / Tanh||Matrix||SN, EC, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|MaxPooling||Sampling||SN, EC, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|AvgPooling||Sampling||SN, EC, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|CosineNorm ||Basic Statistics||SN, EC, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|BatchNorm ||Basic Statistics||SN, EC, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|Dropout ||Sampling||SN, EC, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
Application Benchmark Specification To model an application domain, we define an end-to-end application benchmark specification considering user characteristics and processing logic, based on the real process of an application domain. We abstract the primary processes of an application domain, and then further propose portable and usable end-to-end benchmarks. In benchmarking, we also consider user characteristics. For example, for online service, we generate queries considering query number, rate, distribution and locality to reflect the user characteristics.
Due to the space limitation, we take search engine as an example and illustrate our methodology to construct benchmarks. As shown in Fig. 3, we first abstract a search engine application model, including the online search server (e.g. image search, text search), and offline analytics (e.g. indexing, classification, recommendation). From the algorithm and profiling levels, we identify the involved data motifs mainly used in search engine. Then we define benchmark specification from three levels: 1) choosing the single data motif as micro benchmark, such as sort, statistics; 2) choosing data motif combinations with different weights as primary component benchmarks in search engine, such as pagerank, index, search server; 3) combing component benchmarks to build a search engine with processing logic as application benchmark.
3.3 Why a Scalable Benchmarking Methodology
Traditional benchmarking methodology provides a case-by-case solution and creates a new benchmark for each workload. However, it is costly and even impossible due to the complexity and diversity of big data and AI applications. Moreover, the emergence of innovative applications aggravates this issue and brings great difficulties and development costs in order to keep in pace.
NAS benchmark  adopts a “paper and pencil” specification, which specifies a set of problems only algorithmically and provides kernel-based benchmarks. However, kernel-based methodology is insufficient for big data and AI benchmarking, considering the data varieties.
Our benchmarking methodology is a significant departure from the traditional benchmark methodology. First, for the sake of conciseness, representativeness, and benchmarking cost, our methodology captures the common classes of units of computation, easily combine a new workload, and hence it is scalable. Second, at an earlier stage, it is easy to port micro benchmarks to a new computer system or architecture, while at a later stage, component benchmarks and application benchmarks are sufficient for completely performance evaluations. Third, the evaluation results of data motif-based benchmarks are easily reproducible and verifiable, because of the interpretability of data motif behaviors.
4 Unified Big Data and AI benchmark suite
In this section, first, we discuss why we propose a unified benchmark suite, and then we summarize our benchmark decisions in BigDataBench 4.0.
4.1 Why a Unified Benchmark Suite
There are three reasons for why we need a unified benchmark suite for both Big Data and AI. First, being specified algorithmically in a ‘paper-and-pencil’ approach , we can state the common requirements of both Big Data and AI. Second, the unified benchmark suite sheds new light on domain-specific hardware and software co-design in terms of tailoring the system and architecture to characteristics of data motifs other than one or more application case by case. Third, the unified benchmark suite helps performing an apple-to-apple comparison on different system and architecture implementations.
4.2 Benchmark Decisions
On the basis of the benchmarking methodology, we make benchmark decisions and build BigDataBench 4.0. As there are many emerging big data and AI applications, we take an incremental and iterative approach. We choose five important and emerging application domains according to occupancy and growing rate. Search engine, social network, e-commerce from internet service, occupy 80% page views and daily visitors . Multimedia processing and bioinformatics are emerging big data domains [47, 48]. Then we build domain-specific benchmarks considering workload, data, and state-of-the-art techniques.
4.2.1 Workloads Diversity
After investigating fundamental components in application domains, we provide a suite of micro benchmarks and component benchmarks. Table II and Table III present the micro and component benchmarks of BigDataBench 4.0 respectively, from perspectives of workloads, involved data motifs, application domains, workload types, data sets and software stacks. Note that we use SE, SN, EC, MP and BI for short to represent search engine, social network, e-commerce, multimedia processing and bioinformatics domains, respectively. Totally, we provide 47 big data and AI benchmarks, each of which has diverse implementations. Because of the page limitation, we do not report the application benchmarks.
|Component Benchmark||Involved Motif||Application Domain||Workload Type||Data Set||Software Stack|
|Xapian Server||Get, Put, Post||SE||Online service||Wikipedia entries||Xapian|
|PageRank||Matrix, Sort, Basic statistics, Graph||SE||Graph analytics||Google web graph||Hadoop, Spark, Flink, GraphLab, MPI|
|Index||Logic, Sort, Basic statistics, Set||SE||Offline analytics||Wikipedia entries||Hadoop, Spark|
|Rolling top words||Sort, Basic statistics||SN||Streaming||Random generate||Spark streaming, JStorm|
|Kmeans||Matrix, Sort, Basic statistics||SE, SN, EC, MP, BI||Offline analytics||Facebook social network||Hadoop, Spark, Flink, MPI|
|Streaming||Random generate||Spark streaming|
|Collaborative Filtering||Graph, Matrix||EC||Offline analytics||Amazon movie review||Hadoop, Spark|
|Naive Bayes||Basic statistics, Sort||SE, SN, EC||Offline analytics||Amazon movie review||Hadoop, Spark, Flink, MPI|
|SIFT||Matrix, Sampling, Transform, Sort||MP||Offline analytics||ImageNet||Hadoop, Spark, MPI|
|LDA||Matrix, Graph, Sampling||SE||Offline analytics||Wikipedia entries||Hadoop, Spark, MPI|
|OrderBy||Set, Sort||EC||Data warehouse||E-commerce transaction||Hive, Spark-SQL, Impala|
|Aggregation||Set, Basic statistics||EC||E-commerce transaction||Hive, Spark-SQL, Impala|
|Project, Filter||Set||EC||Data warehouse||E-commerce transaction||Hive, Spark-SQL, Impala|
|Select, Union||Set||EC||E-commerce transaction||Hive, Spark-SQL, Impala|
|Alexnet / Googlenet||Matrix, Transform, Sampling, Logic, Basic statistics||SN, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|Resnet / VGG16||SN, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|Inception Resnet V2||SN, MP, BI||AI||Cifar, ImageNet||TensorFlow, Caffe, PyTorch|
|DCGAN / WGAN||SN, MP, BI||AI||LSUN||TensorFlow, Caffe, PyTorch|
|GAN||Matrix, Sampling, Logic, Basic statistics||SN, MP, BI||AI||LSUN||TensorFlow, Caffe, PyTorch|
|Seq2Seq||SE, EC, BI||AI||TED Talks||TensorFlow, Caffe, PyTorch|
|Word2vec||Matrix, Basic statistics, Logic||SE, SN, EC||AI||Wikipedia entries, Sogou data||TensorFlow, Caffe, PyTorch|
4.2.2 Representative Real-world Data Set
To cover a full spectrum of data characteristics, we collect 13 representative data sets, including different data sources (text, table, graph, and image), and data types of structured, un-structured, semi-structured. Further, big data generation tools are provided to suit for different cluster scales, including text, table, matrix and graph generators.
Wikipedia Entry  is a unstructured data set, consisting of 4,300,000 English articles.
Amazon Movie Review  is a semi-structured data set, consisting of 7,911,684 reviews on 889,176 movies by 253,059 users.
Google Web Graph (Directed graph) is a unstructured data set which contains 875,713 nodes representing web pages and 5,105,039 edges representing the web links.
Facebook Social Graph (Undirected graph)  contains 4,039 nodes, which represent users, and 88,234 edges, which represent friendship between users.
E-commerce Transaction Data is a structured data set from an e-commerce web site, consisting of two tables: ORDER and ITEM.
ProfSearch Person Resumé is a semi-structured data set from a vertical search engine for scientists developed by ourselves, consisting of 278,956 resumés automatically extracted from 20,000,000 web pages of about 200 universities and research institutions.
CIFAR-10  is a tiny image data set, which has 60,000 color images with the dimension of . They are classified into 10 classes and each class has 6,000 examples.
LSUN  contains about one million labelled images, classified into 10 scene categories and 20 object categories.
TED Talk  comes from translated TED talks, provided by IWSLT evaluation campaign.
SoGou Data  is a unstructured data set, including corpus and search query data from Sogou Lab. The total data size is 4.98 GB.
MNIST  is a database of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples.
MovieLens Dataset  is score data for movies, which has 9,518,231 training examples and 386,835 test examples (semi-structured text).
4.2.3 State-of-the-art Techniques
To perform apple-to-apple comparisons, we provide diverse implementations using the state-of-the-art techniques. For offline analytics, we provide Hadoop , Spark , Flink  and MPI  implementations. For graph analytics, we provide Hadoop, Spark GraphX , Flink Gelly  and GraphLab  implementations. For AI, we provide TensorFlow , Caffe  and PyTorch  implementations. For data warehouse, we provide Hive , Spark-SQL  and Impala  implementations. For NoSQL, we provide MongoDB  and HBase  implementations. For streaming, we provide Spark streaming  and JStorm  implementations.
. For AI, we identify representative and widely used data motifs in a wide variety of deep learning networks (i.e. convolution, relu, sigmoid, tanh, fully connected, max/avg pooling, cosine/batch normalization and dropout) and then implement each single motif and motif combinations as micro benchmarks and component benchmarks. The AI component benchmarks include Alexnet, Googlenet , Resnet , Inception_Resnet V2 , VGG16 , DCGAN , WGAN , Seq2Seq  and Word2vec , which are important state-of-the-art networks in AI.
5 Experiment Setup
In this section, we present our experiment configurations and methodology on characterizing the processor pipeline efficiency of big data and AI, in comparison to traditional benchmarks including SPEC CPU2006, PARSEC, and HPCC.
5.1 Experiment Configurations
We run a series of characterization experiments using BigDataBench 4.0 to obtain insights for architectural studies. From BigDataBench 4.0, we test a majority of micro and component benchmarks with all seven workload types.
|CPU Type||Intel CPU Core|
|Intel ®Xeon E5-2620 V3||12 firstname.lastname@example.orgG|
|L1 DCache||L1 ICache||L2 Cache||L3 Cache|
|12 32 KB||12 32 KB||12 256 KB||15MB|
clusters. In our experiments, we deploy an one-master-two-slave cluster for architecture evaluation, instead of a larger cluster because of the following reasons. First, a larger cluster may lead to data skew which results in load unbalance in the cluster, and lead to the deviation of experimental results. Second, the deployment and running cost is extremely high to collect all hardware events, which always need multiple times running to assure high accuracy of collected data for each benchmark. A larger cluster aggravates the cost. Third, most of previous architecture researches [92, 4, 32] also use a small-scale cluster.
In our experiments, each slave node has two Xeon E5-2620 V3 processors equipped with 64 GB memory and 6 TB disk. The detailed hardware configuration of each node is listed in Table IV. The software and compiler configurations are as follows: CentOS 7.2 with Linux kernel 4.1.13, JDK 1.8.0_65, Hadoop 2.7.1, Apache Mahout 0.10.2, Hive 0.9.0, HBase 1.0.1, Scala 2.10.4, Spark 1.5.2, Python 2.7.5, TensorFlow 1.0, GCC 4.8.5. The level of optimization is “-O2”. With regard to the input data, we use 100 GB data for offline analytics, except that matrix multiplication uses 10000*10000 matrix data. Data warehouse uses 100 GB E-commerce transaction data. Graph analytics uses
-vertex graph data. For AI, we use CIFAR-10 data set and run 10 epoches for Alexnet, Googlenet and Inception_Resent V2. For Resnet, we run 10000 training steps for each training step takes a short time. Word2vec uses text8 wikipedia corpus. We evaluate HBase with ten million records using NoSQL read and write benchmarks. Online service processes million searching requests. Spark streaming takes thousands of seconds streaming data as input and considers 10 seconds streaming data as a batch to process.
5.2 Experiment Methodology
We adopt a Top-Down methodology  to evaluate the pipeline efficiency of big data and AI, which identifies the bottlenecks in a hierarchical manner. At the uppermost level, it classifies an issued micro operation into four categories of retiring, bad speculation, frontend bound and backend bound. Totally, it has five levels, drilling down on the sub-tree of each category. Modern processors provide hardware performance counters to support micro-architectural level profiling. We use Perf , a Linux profiling tool, to collect the hardware events referring to the Intel Developer’s Manual and pmu-tools . To obtain more accurate performance counter values, we run each workload three times separately in order to sample the events during the whole runtime of workload. Then we report the average of the three runs.
5.3 Compared Benchmarks Setup
SPEC CPU2006: We run SPEC CPU 2006 with the reference input, reporting results averaged into two groups, i.e., integer benchmarks (SPECINT) and floating point benchmarks (SPECFP). The gcc version is 4.8.5.
HPCC: We run all seven HPCC benchmarks with version 1.4, including HPL, STREAM, PTRANS, RandomAccess, DGEMM, FFT, and COMM.
PARSEC: We deploy PARSEC 3.0 Beta Release, which is a benchmark suite composed of multithreaded programs. We run all the 12 benchmarks with native input data sets and use gcc version 4.8.5 in compilation.
6 Characterization Results
We perform Top-Down analysis on seven types of big data and AI, drilling down on the five levels, and report our characterization results. The seven types and corresponding software stacks include online service (Xapian), offline analytics (Hadoop, Spark), graph analytics (Hadoop, Spark), data warehouse (Hive, Spark sql), AI workloads (TensorFlow), NoSQL (HBase) and streaming (Spark streaming). For each software stack, we also report their average value of all benchmarks listed as AVG bar (e.g. Hadoop-AVG). In the rest of paper, we use Inception to represent Inception_Resnet V2 benchmark. We run all workloads in traditional benchmarks, and present their average value, respectively. They are listed as SPECCPU-Int, SPECCPU-Float, PARSEC-AVG and HPCC-AVG, respectively.
In the rest of paper, we distinguish the software stacks for the same workload type when they have different behaviors, otherwise we only use the workload type to represent all software stacks when they reflect consistent behaviors. The average execution performance of each workload type is shown in Fig. 4, from the perspectives of ILP (instruction-level parallelism) and MLP (memory-level parallelism). We use IPC (retired instructions per cycle) to reflect the instruction level parallelism. MLP is measured as the average number of memory accesses when there is at least one such access, which indicates the dependencies of missing loads . As shown in Fig. 4, different workload types or software stacks of big data reflect different execution performance. For example, the online service has low ILP while high MLP comparing to other types of big data, because it suffers from notable data cache misses and has low retired instruction percentage. Both the ILP and MLP of the AI are almost 1.5 times higher than that of big data on average. For several micro benchmarks of AI, such as Multiply and Pooling, their computations are simple and have little data dependencies, so they generate many concurrent data loads and incur many data cache misses , thus AI has higher MLP than big data. However, comparing to traditional benchmarks, the ILP of AI is lower on average. This is because that the AI framework implementation considers little instruction mix balance and memory access locality, while the traditional benchmarks like HPCC provides some computation-intensive kernels which are optimized to fully utilize the hardware resources.
The uppermost-level breakdown of all benchmarks we evaluated are listed in Fig. 5. The retiring percentage of big data is 22.9% on average, lower than traditional benchmarks (39.8% on average), which is also found by previous work  on Hadoop-based benchmarks. Specially, NoSQL, online service and streaming have extremely low retiring percentage, approximately 20%. NoSQL has poor instruction locality, so it generates a large amount of instruction cache misses and greatly impact the performance. For online service and streaming, they suffer from notable backend stalls. Further, we find that different workload types reflect diverse pipeline behaviors, indicating that they have different bottlenecks and need specific optimization strategies. Corroborating the previous work , backend bound is the first bottleneck and frontend bound is the second bottleneck for all big data we investigated. However, the frontend bound percentages vary across different workload types and software stacks. For example, eight out of twelve Spark-based benchmarks have low frontend bound percentages, only occupying less than 8% on average. NoSQL (about 35%) and data warehouse (about 25%) suffer from higher frontend bound than the others of big data (15% on average) mainly because of instruction cache misses. In addition, software stacks and algorithms both have great impacts on pipeline behaviors. For example, the frontend bound and bad speculation is 17% and 11% for Hadoop based benchmarks on average, while 9% and 3% for Spark based. Also, for the same software stack, the frontend bound percentage is 20% for Spark grep, while 6% for Spark FFT.
AI has higher retiring percentage (35% on average) than big data, approximately equal to the traditional benchmarks (39.8%). Backend bound is the first bottleneck for AI, however, frontend bound is not always the second bottleneck. For example, the percentage of frontend bound and bad speculation for Alexnet is 11% and 14%, respectively. On average, from the uppermost level breakdown, the percentages of frontend (both about 9%) and backend bound (49.7% v.s. 45.1%) of AI are close to traditional benchmarks, while their bottlenecks at a deeper level are different. Neural network structures have a great impact on pipeline behaviors. For example, the percentage of frontend bound and bad speculation for VGG16 is about 1%, respectively, while the percentage of frontend bound and bad speculation for Alexnet is more than 10%, respectively. This is because that VGG16 have much more consecutive convolution computations than Alexnet.
Deeper analysis for each category is performed in the following subsections.
A pipeline slot represents hardware resources needed to process one micro operation . Retiring means pipeline slots fraction utilized by useful work . Optimizing retiring percentage often increases the IPC metric and thus improves the execution efficiency. Retiring is composed of retiring regular uops and retiring uops fetched by the Microcode Sequencer (MS) unit. MS unit is used to decode the CISC instructions which are not supported by the default decoders. However, the switches to MS unit have penalties and hurt performance . We find that the numbers of uops decoded by MS unit of big data and AI are about 10 times larger than that of the traditional benchmarks. This result indicates that big data and AI have more CISC instructions needing microcode assists, which may suffer from more switch stalls and hurt performance.
6.2 Bad Speculation
Bad speculation means slots fraction wasted due to incorrect speculations, including branch misprediction and machine clears. From our experimental results, we find that machine clears occupy about 0.1% percentage for all benchmarks. Bad speculation mainly occurs due to branch misprediction, and their percentages are nearly equal to Bad_Speculation value in Fig. 5. Overall, big data and AI have a small fraction of bad speculation, about 10% for Hive and Hadoop benchmarks, 3% for the other types and software stacks we evaluated. For AI, different neural networks own different percentages of bad speculation, with 6% on average.
6.3 Frontend Bound
Frontend bound occurs when frontend undersupplies the backend in a cycle. It is composed of two categories – frontend latency bound (i.e. delivers no uops) and frontend bandwidth bound (i.e. delivers non-optimal amount of uops). Fig. 6 presents the frontend bound breakdown. Note that the y-axis of the black-bordered box indicates the percentage of frontend latency bound, and the length upper the black-bordered box indicates the percentage of frontend bandwidth bound. Taking Hadoop-Sort as an example, its frontend bound occupies a proportion of 12%, with 7% for latency bound and 5% for bandwidth bound. From Fig. 6 we find that, big data has more severe frontend bound than the traditional benchmarks, especially frontend latency bound, which is also found by previous work [4, 31, 32]. However, the frontend bound percentage varies across different workload types. Big data suffers from more frontend bound due to two reasons. First, the software stack changes the programming type comparing to original algorithm implementations, such as map/reduce interfaces in Hadoop. Second, the software stack itself incurs much more instructions, so the frontend bears the pressures of fetching and decoding these instructions. AI benchmarks have different frontend bound percentages in terms of their layers and computation kernel proportions. Frontend latency bound and bandwidth bound contribute to frontend bound equally. Different from previous work [4, 31, 32] that mainly identified frontend inefficiencies due to high instruction miss ratios and long latency introduced by caches, we thoroughly drill down on the sub-tree of frontend latency and bandwidth bound.
6.3.1 Frontend Latency Bound
Frontend latency bound indicates that frontend delivers no uops to backend, which may occur due to six reasons, including ICache misses, ITLB misses, branch resteers, DSB (decoded stream buffer) switches, LCP (length changing prefixes), and MS (microcode sequencer) switches. Among them, ICache misses means stalls due to instruction cache misses. ITLB misses means stalls due to instruction tlb misses. Branch resteers means stalls due to frontend delay when fetching instruction from correct path, which may occur because of branch mispredictions. DSB switches means stalls due to switches from DSB to MITE (Micro-instruction Translation Engine) pipelines. DSB is a decoded ICache used to store uops that have been decoded, so as to avoid penalties of legacy decode pipeline, which is also called MITE. DSB switches are used to measure the penalties of switching from DSB to MITE . LCP means stalls due to length changing prefixes, which can be avoided by using compiler flags. MS switches means stalls due to switches of delivering uops to microcode sequencer. As mentioned in Subsection 6.1, retiring includes retiring regular uops and uops fetched by the MS unit. Generally, uops are coming from DSB or MITE pipeline. For some CISC instructions which cannot be decoded by default decoders, they must be handled by MS unit. However, frequent MS switches hurt performance, so MS switches metric measures this penalties.
The breakdown within the black-bordered box in Fig. 6 shows the proportions of the above six reasons that incur the frontend latency bound. We find that for big data except NoSQL, Branch resteers, ICache misses and MS switches are three main reasons for frontend latency bound, while for NoSQL, the main reasons are ICache misses, ITLB misses and MS switches. The main reason of AI that incurs frontend latency bound is Branch resteers, and the second reason is MS switches, indicating that big data and AI indeed have much larger retiring uops from MS unit.
6.3.2 Frontend Bandwidth Bound
Frontend bandwidth bound indicates the amount of uops delivering to backend is less than theoretical value, such as four for Haswell architecture. The frontend bandwidth bound is mainly due to three reasons, including MITE, DSB and LSD. Among them, MITE means stalls due to MITE pipeline, such as the inefficiency of the instruction decoders. DSB means stalls due to DSB fetch pipeline, such as inefficient utilization of DSB. LSD means stalls due to loop stream detector unit, which occupies a little generally.
The breakdown of frontend bandwidth bound in Fig. 6 shows the proportions of the above three reasons. DSB and MITE are two main reasons for nearly all listed benchmarks. However, different workload types have different first frontend bandwidth bottleneck. For offline analytics and graph analytics, their first frontend bandwidth bottleneck is DSB. For data warehouse, NoSQL, online service and streaming, their first frontend bandwidth bottleneck is MITE. For AI, their first bottleneck of frontend bandwidth bound is DSB, except MITE for Word2Vec benchmark. In order to reduce the frontend bandwidth bound and improve the performance of big data and AI, DSB utilization and MITE pipeline efficiency need to be optimized.
6.4 Backend Bound
Backend bound occurs when the backend has not enough required resources to process new uops, which can be divided into backend core bound and backend memory bound. Among them, backend core bound refers to non-memory core issues, such as the lack of out-of-order resources. Backend memory bound means the stalls due to load or store instructions.
Fig. 7 lists the backend bound breakdown of all benchmarks. The black-bordered boxes indicate the percentage of backend core bound slots, and the green boxes above them indicate the percentage of backend memory bound slots. The first bottleneck of big data and AI is backend bound. Previous work  found core bound and memory bound nearly contribute to the backend bound equally. However, we find memory bound is more severe than core bound for all big data and AI benchmarks, except that online service has nearly equal core bound and memory bound.
6.4.1 Backend Core Bound
Backend core bound can further be split into Divider and Port utilization. Divider means the cycle fraction that the Divider unit is in use, which has longer latency than other integer or floating-point operations. Port utilization means the stalls due to low utilization of execution ports. For example, Haswell has eight execution ports, and each port can execute specific uops (four ports for computation and four ports for load/store operations). These execution ports may be under-utilized in a cycle due to data dependency of instructions or non divider-related resource contention .
The breakdown within the black-bordered box in Fig. 7 shows proportions of Divider and Port utilization that incur the backend core bound. Divider occupies a small proportion, except for some computation intensive workloads, such as Hadoop Kmeans. From Fig. 7 we find the utilizations of execution ports are low for big data and AI, further indicating that the instruction mix balance need to be improved.
6.4.2 Backend Memory Bound
Backend memory bound can further be divided into L1 Bound, L2 Bound, L3 Bound, DRAM Bound, and Store Bound, which incurs stalls related to memory hierarchy.
Fig. 8 shows the normalized backend memory bound breakdown. Note that L2 Bound is negative due to PMU erratum on L1 prefetchers . We find that the main reason for backend memory bound is DRAM Bound for big data and AI, except that online service suffers from more store bound than DRAM bound. Different from the traditional benchmarks, big data and AI also suffer from more stalls due to L1 Bound, L3 Bound and Store Bound.
DRAM Bound is the first Backend Memory Bound bottleneck for most benchmarks, and we further analyze two factors that incur DRAM Bound, including DRAM latency and DRAM bandwidth. DRAM latency means the stalls due to the latency from dynamic random access memory, it can be further classified into stalls due to loads from local memory (Local_DRAM), remote memory (Remote_DRAM) and remote cache (Remote_Cache). DRAM bandwidth means the stalls due to memory bandwidth limitations. Fig. 9 presents the DRAM bound breakdown, including DRAM bandwidth bound and three kinds of DRAM latency bound—local DRAM, remote DRAM latency and remote cache. Different from the traditional benchmarks, the first DRAM bound bottleneck of big data and AI is DRAM latency bound. AI suffers from more DRAM bandwidth bound than big data. In terms of DRAM latency bound, the main reason for big data is local DRAM latency on average, except that Spark sql suffers from more remote cache latency. Also, the main reason for AI is remote cache latency. Remote cache or remote DRAM latencies are mainly due to non-optimal NUMA allocations. Processor affinity and NUMA-friendly data placement may reduce the latency and improve the performance.
6.5 Discussion on AI Benchmarks
To explore the performance of AI benchmarks considering different hardware architectures and running configurations, we first characterize them on GPUs and then evaluate the impacts of iteration numbers on architecture behaviors.
6.5.1 AI Benchmarks on GPUs
Since a significant portion of the computationally intensive AI tasks are performed on GPUs, we further evaluate AI benchmarks on GPUs, using PAPI CUDA Component  and CUDA profiling tool—nvprof . The CPU platform is the same with Section 5.1. The GPU platform is NVIDIA Tesla K80 with two Tesla GK210 GPUs. Each GPU has 13 stream multiprocessors (SM), and each SM includes 192 cores. The total memory is 24 GB GDDR5.
SM efficiency and IPC are two important metrics to evaluate the execution performance on GPUs. Among them, SM efficiency indicates the percentage of time that the SM has one or more warps are active. IPC means the instructions executed per cycle. We evaluate the AI benchmarks on GPUs, as shown in Fig. 10(a) and (b), different neural network structures reflect different performance on GPUs. Resnet and Word2Vec have lower IPC and SM efficiency than other AI benchmarks. We further evaluate their memory access and computation patterns to explore why they reflect different performance. Fig. 10(c) and (d) show their global memory load and store throughput, respectively. Fig. 10(e) and (f) present their average number of instructions executed by each warp and the average number of warps that are eligible to be issued per cycle, respectively. We find that Resnet, Inception and Word2Vec have higher memory access requirements, so they suffer from more stalls due to data load and store, and thus they have insufficient instructions or warps to be executed.
To explore the reasons why they have different computation and memory access patterns, we analyze the runtime breakdown of kernels within each AI benchmark. As shown in Fig. 11
, we classify the most time-consuming kernels into five categories—GEMM, Convolution, Gradient, Matrix Transform, and Data load/store. Among them, GEMM represents the matrix multiplication kernels, including cgemm kernels and sgemm kernels. Convolution means the convolution-related kernels, including convolve, fft and winograd kernels. Gradient indicates the gradient computations, including dgrad engine and wgrad engine. For example, the backward kernel of convolution belongs to this category. Matrix transform includes matrix transpose, pooling and normalization. Data load/store includes the kernels that perform data load and store operations, such as memcpy and tensor evaluator. From Fig.11, we find Alexnet, Googlenet and VGG16 spend a majority of runtime on computation kernels and involve in little data loads and stores, so they have the highest IPC and SM efficiency. However, Resnet, Inception and Word2Vec spend too much time on data movements. Resnet and Inception use batch normalizations to speed up deep neutral networks, while Alexnet and Googlenet use quite a few local response normalizations (LRN). Batch normalization calls assign_moving_avg to update variables, which has many data loads and stores, much larger than that of LRN. So Resnet and Inception spend much time on data movements and further impact performance. In addition, Resnet has deeper layers and mainly uses winograd algorithm to compute convolution, while the others either use fft and matrix multiplication to compute convolution or have more simple structure, so Resnet spends less time on GEMM kernels than others.
In conclusion, the memory access patterns impact the performance on GPUs greatly. The performance can be improved from two levels. From the GPU architecture design level, the stalls due to memory accesses need to be optimized. From the application level, the implementation of kernels and frameworks need to consider more efficient memory access, such as better locality.
6.5.2 Iteration Impact on Architecture Behaviors
AI benchmarks always need hundreds of iterations to obtain higher prediction precision and lower training loss. However, for architecture research, AI benchmarks are too time-consuming even if running on GPUs. To evaluate the impact of iteration number on microarchitectural characteristic of AI, we run five neural networks using different number of iterations – Small, Medium, Large. For Alexnet, Googlenet, Inception and VGG16 networks, we run 1 (Small), 10 (Medium), 100 (Large) epoches, respectively. For Resnet networks, we run 2000 (Small), 10000 (Medium), 50000 (Large) training steps. respectively. We use PCA 102] to measure the similarity, using all fifty micro-architectural metrics we collect according to the Top-Down method. Fig. 12 presents the linkage distance of all AI benchmarks, and the smaller distance means the higher similarity. We find that the same neural networks with different iteration numbers are clustered together and have shorter distance, which means a small number of iterations is enough for micro-architectural evaluation of AI benchmarks.
In this paper, we propose a data motif-based scalable benchmarking methodology to build micro, component, and end-to-end application benchmarks. Following this methodology, we set up a unified open source big data and AI benchmark suite – BigDataBench 4.0. Finally, we comprehensively characterize BigDataBench 4.0 on CPUs and GPUs, respectively.
-  J. Hennessy and D. Patterson, “A new golden age for computer architecture: Domain-specific hardware/software co-design, enhanced security, open instruction sets, and agile chip development.” 2018.
-  W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, F. Tang, B. Xie, C. Zheng, X. Wen, X. He, H. Ye, and R. Ren, “Data motifs: A lens towards fully understanding big data and ai workloads,” Parallel Architectures and Compilation Techniques (PACT), 2018 27th International Conference on, 2018.
-  L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” Synthesis Lectures on Computer Architecture, vol. 4, no. 1, pp. 1–108, 2009.
-  M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the clouds: A study of emerging workloads on modern hardware,” ASPLOS, 2012.
-  N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 2017, pp. 1–12.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678.
-  A. Ghazal, M. Hu, T. Rabl, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen, “Bigbench: Towards an industry standard benchmark for big data analytics,” in SIGMOD 2013, 2013.
-  P. Wang, D. Meng, J. Han, J. Zhan, B. Tu, X. Shi, and L. Wan, “Transformer: a new paradigm for building data-parallel programming models,” Micro, IEEE, vol. 30, no. 4, pp. 55–64, 2010.
-  L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang et al., “Bigdatabench: A big data benchmark suite from internet services,” IEEE International Symposium On High Performance Computer Architecture (HPCA), 2014.
-  Z. Jia, C. Xue, G. Chen, J. Zhan, L. Zhang, Y. Lin, and P. Hofstee, “Auto-tuning spark big data workloads on power8: Prediction-based dynamic smt threading,” in Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 2016, pp. 387–400.
-  C. Luo, J. Zhan, L. Wang, and Q. Yang, “Cosine normalization: Using cosine similarity instead of dot product in neural networks,” arXiv preprint arXiv:1702.05870, 2017.
-  X.-X. Zhou, W.-F. Zeng, H. Chi, C. Luo, C. Liu, J. Zhan, S.-M. He, and Z. Zhang, “pdeep: Predicting ms/ms spectra of peptides with deep learning,” Analytical chemistry, vol. 89, no. 23, pp. 12 690–12 697, 2017.
-  T. Rabl, M. Frank, M. Danisch, H.-A. Jacobsen, and B. Gowda, “The vision of bigbench 2.0,” in Proceedings of the Fourth Workshop on Data analytics in the Cloud. ACM, 2015, p. 3.
-  S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “The hibench benchmark suite: Characterization of the mapreduce-based data analysis,” in Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on. IEEE, 2010, pp. 41–51.
-  A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker, “A comparison of approaches to large-scale data analysis,” in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009, pp. 165–178.
-  B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with ycsb,” in Proceedings of the 1st ACM symposium on Cloud computing, ser. SoCC ’10, 2010.
-  T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan, “Linkbench: a database benchmark based on the facebook social graph,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013, pp. 1185–1196.
-  https://amplab.cs.berkeley.edu/benchmark/.
-  R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks, “Fathom: reference workloads for modern deep learning methods,” in Workload Characterization (IISWC). IEEE, 2016, pp. 1–10.
-  “Deepbench,” https://svail.github.io/DeepBench/.
-  T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam, “Benchnn: On the broad potential application scope of hardware neural network accelerators,” in Workload Characterization (IISWC), 2012 IEEE International Symposium on. IEEE, 2012, pp. 36–45.
-  S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmark suite for gpus,” in Proceedings of the General Purpose GPUs. ACM, 2017, pp. 63–72.
-  C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and M. Zaharia, “Dawnbench: An end-to-end deep learning benchmark and competition,” Training, vol. 100, no. 101, p. 102, 2017.
-  J. Gray, Benchmark handbook: for database and transaction processing systems. Morgan Kaufmann Publishers Inc., 1992.
-  B. Xie, J. Zhan, X. Liu, W. Gao, Z. Jia, X. He, and L. Zhang, “Cvr: Efficient vectorization of spmv on x86 processors,” in 2018 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2018.
-  D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber et al., “The nas parallel benchmarks,” The International Journal of Supercomputing Applications, vol. 5, no. 3, pp. 63–73, 1991.
-  K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and Y. Katherine, “The landscape of parallel computing research: A view from berkeley,” Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Tech. Rep., 2006.
-  Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and J. Zhan, “Bdgs: A scalable big data generator suite in big data benchmarking,” arXiv preprint arXiv:1401.5465, 2014.
-  A. Yasin, “A top-down method for performance analysis and counters architecture,” in Performance Analysis of Systems and Software (ISPASS). IEEE, 2014, pp. 35–44.
-  S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling a warehouse-scale computer,” in Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 2015, pp. 158–169.
-  Z. Jia, J. Zhan, L. Wang, C. Luo, W. Gao, Y. Jin, R. Han, and L. Zhang, “Understanding big data analytics workloads on modern processors,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 6, pp. 1797–1810, 2017.
-  “Tpc-ds benchmark,” http://www.tpc.org/tpcds/.
-  J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
-  J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge, R. G. Dreslinski, J. Mars, and L. Tang, “Djinn and tonic: Dnn as a service and its implications for future warehouse scale computers,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3. ACM, 2015, pp. 27–40.
-  A. Thomas and A. Kumar, “A comparative evaluation of systems for scalable linear algebra-based analytics.”
-  P. Colella, “Defining software requirements for scientific computing,” 2004.
-  Y. Chen, F. Raab, and R. Katz, “From tpc-c to big data benchmarks: A functional workload model,” in Specifying Big Data Benchmarks. Springer, 2014, pp. 28–43.
-  P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas, R. Rabenseifner, and D. Takahashi, “The hpc challenge (hpcc) benchmark suite,” in Proceedings of the 2006 ACM/IEEE conference on Supercomputing. Citeseer, 2006, p. 213.
-  N. Council, “Frontiers in massive data analysis.” The National Academies Press Washington, DC, 2013.
-  G. Fox, J. Qiu, S. Jha, S. Ekanayake, and S. Kamburugamuve, “Big data, simulations and hpc convergence,” in Big Data Benchmarking. Springer, 2015, pp. 3–17.
-  S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” in Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 2016, pp. 393–405.
-  E. F. Codd, “A relational model of data for large shared data banks,” Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970.
-  D. Guinard, V. Trifa, and E. Wilde, “A resource oriented architecture for the web of things,” in Internet of Things (IOT). IEEE, 2010.
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
-  “Alexa topsites,” http://www.alexa.com/topsites/global;0.
-  “Multimedia,” http://www.oldcolony.us/wp-content/uploads/2014/11/whatisbigdata-DKB-v2.pdf.
-  “Bioinformatics,” http://www.ddbj.nig.ac.jp/breakdown_stats/dbgrowth-e.html#dbgrowth-graph.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  “wikipedia,” http://en.wikipedia.org.
-  http://snap.stanford.edu/data/web-Amazon.html.
-  “google web graph,” http://snap.stanford.edu/data/web-Google.html.
-  http://snap.stanford.edu/data/egonets-Facebook.html.
-  A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A
large-scale hierarchical image database,” in
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” arXiv preprint arXiv:1409.0575, 2014.
-  F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
-  M. Cettolo, C. Girardi, and M. Federico, “Wit3: Web inventory of transcribed and translated talks,” in Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), vol. 261, 2012, p. 268.
-  “Sogou labs,” http://www.sogou.com/labs/.
-  “mnist,” http://yann.lecun.com/exdb/mnist/.
-  F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 5, no. 4, p. 19, 2016.
-  “Hadoop,” http://hadoop.apache.org/.
-  M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010.
-  P. Mika, “Flink: Semantic web technology for the extraction and analysis of social networks,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 3, no. 2, pp. 211–223, 2005.
-  “Mpich,” https://www.mpich.org.
-  R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, “Graphx: A resilient distributed graph system on spark,” in First International Workshop on Graph Data Management Experiences and Systems. ACM, 2013, p. 2.
-  “Flink gelly,” https://flink.apache.org/news/2015/08/24/introducing-flink-gelly.html.
-  Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, “Distributed graphlab: a framework for machine learning and data mining in the cloud,” Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 716–727, 2012.
-  “Pytorch,” http://pytorch.org.
-  A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009.
-  “Spark sql,” https://spark.apache.org/sql/.
-  M. Bittorf, T. Bobrovytsky, C. C. A. C. J. Erickson, M. G. D. Hecht, M. J. I. J. L. Kuff, D. K. A. Leblang, N. L. I. P. H. Robinson, D. R. S. Rus, J. R. D. T. S. Wanderman, and M. M. Yoder, “Impala: A modern, open-source sql engine for hadoop,” in Proceedings of the 7th Biennial Conference on Innovative Data Systems Research, 2015.
-  K. Chodorow, MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. ” O’Reilly Media, Inc.”, 2013.
-  L. George, HBase: the definitive guide: random access to your planet-size data. ” O’Reilly Media, Inc.”, 2011.
-  “Spark streaming,” https://spark.apache.org/streaming/.
-  “Jstorm,” https://github.com/alibaba/jstorm.
-  http://mahout.apache.org.
-  R. Gu, Y. Tang, Z. Wang, S. Wang, X. Yin, C. Yuan, and Y. Huang, “Efficient large scale distributed matrix computation with spark,” in Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 2015, pp. 2327–2336.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” inAAAI, 2017, pp. 4278–4284.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
-  M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in neural information processing systems, 2013, pp. 3111–3119.
-  http://e.huawei.com/en/products/cloud-computing-dc/cloud-computing/bigdata/fusioninsight.
-  http://www.dca.org.cn/content/100190.html.
-  “Vtune,” https://software.intel.com/en-us/vtune-amplifier-help-allow-multiple-runs-or-multiplex-events.
-  A. Yasin, Y. Ben-Asher, and A. Mendelson, “Deep-dive analysis of the data analytics workload in cloudsuite,” in Workload Characterization (IISWC). IEEE, 2014, pp. 202–211.
-  “Perf,” https://perf.wiki.kernel.org/index.php/Main_Page.
-  “Pmu tools,” https://github.com/andikleen/pmu-tools.
-  C. Spec, “Spec cpu2006,” Retrieved February, vol. 23, p. 2015, 2006.
-  C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: Characterization and architectural implications,” in Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 2008, pp. 72–81.
-  Y. Chou, B. Fahs, and S. Abraham, “Microarchitecture optimizations for exploiting memory-level parallelism,” in ACM SIGARCH Computer Architecture News, vol. 32, no. 2. IEEE Computer Society, 2004, p. 76.
-  https://software.intel.com/en-us/vtune-amplifier-help-pipeline-slots.
-  “Papi cuda component,” http://icl.cs.utk.edu/papi/.
-  https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
I. T. Jolliffe, “Principal component analysis and factor analysis,” inPrincipal component analysis. Springer, 1986, pp. 115–128.
-  S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, vol. 32, no. 3, pp. 241–254, 1967.