Data Motifs: A Lens Towards Fully Understanding Big Data and AI Workloads

The complexity and diversity of big data and AI workloads make understanding them difficult and challenging. This paper proposes a new approach to modelling and characterizing big data and AI workloads. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on different initial or intermediate data inputs. Each class of unit of computation captures the common requirements while being reasonably divorced from individual implementations, and hence we call it a data motif. For the first time, among a wide variety of big data and AI workloads, we identify eight data motifs that take up most of the run time of those workloads, including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic. We implement the eight data motifs on different software stacks as the micro benchmarks of an open-source big data and AI benchmark suite ---BigDataBench 4.0 (publicly available from http://prof.ict.ac.cn/BigDataBench), and perform comprehensive characterization of those data motifs from perspective of data sizes, types, sources, and patterns as a lens towards fully understanding big data and AI workloads. We believe the eight data motifs are promising abstractions and tools for not only big data and AI benchmarking, but also domain-specific hardware and software co-design.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

02/01/2018

Data Dwarfs: A Lens Towards Fully Understanding Big Data and AI Workloads

The complexity and diversity of big data and AI workloads make understan...
02/23/2018

BigDataBench: A Scalable and Unified Big Data and AI Benchmark Suite

Several fundamental changes in technology indicate domain-specific hardw...
02/23/2018

BigDataBench: A Dwarf-based Big Data and AI Benchmark Suite

As architecture, system, data management, and machine learning communiti...
11/09/2017

A Dwarf-based Scalable Big Data Benchmarking Methodology

Different from the traditional benchmarking methodology that creates a n...
03/25/2022

Big data ethics, machine ethics or information ethics? Navigating the maze of applied ethics in IT

Digitalization efforts are rapidly spreading across societies, challengi...
05/23/2020

Benchmarking and Performance Modelling of MapReduce Communication Pattern

Understanding and predicting the performance of big data applications ru...
10/18/2018

Data Motif-based Proxy Benchmarks for Big Data and AI Workloads

For the architecture community, reasonable simulation time is a strong r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The complexity and diversity of big data and AI workloads make understanding them difficult and challenging. First, modern big data and AI workloads expand and change very fast, and it is impossible to create a new benchmark or proxy for every possible workload. Second, several fundamental changes, i.e., end of Dennard scaling, ending of Moore’s Law, Amdahl’s Law and its implications for ending ”Easy” multicore era, indicate only hardware-centric path left is Domain-specific Architectures (Hennessy and Patterson, 2018). To achieve higher efficiency, we need tailor the architecture to characteristics of a domain of applications (Hennessy and Patterson, 2018). However, the first step is to understand Big Data and AI workloads. Third, whatever early in the architecture design process or later in the system evaluation, it is time-consuming to run a comprehensive benchmark suite. The complex software stacks of the modern workloads aggravate this issue. The modern big data or AI benchmark suites (Wang et al., 2014; Ferdman et al., 2012) are too huge to run on simulators and hence challenge time-constrained simulation and even make it impossible. Fourth, too complex workloads raise challenges in both reproducibility and interpretability of performance data in benchmarking systems.

Identifying abstractions of time-consuming units of computation is an important step toward fully understanding complex workloads. Much previous work (Codd, 1970; Chen et al., 2014b; Colella, 2004; Asanovic et al., 2006; Shah et al., 2010) has illustrated the importance of abstracting workloads in corresponding domains. TPC-C (Chen et al., 2014b) is a successful benchmark built on the basis of frequently-appearing operations in the OLTP domain. HPCC (Luszczek et al., 2006) adopts a similar method to design a benchmark suite for high performance computing. National Research Council proposes seven major tasks in massive data analysis (Council, 2013), while they are macroscopical definition of problems from the perspective of mathematics. Unfortunately, to the best of our knowledge, none of previous work has identified time-consuming classes of unit of computation in big data and AI workloads.

Also, identifying abstractions of time-consuming units of computation is an important step toward domain-specific hardware and software co-design. Straightforwardly, we can tailor the architecture to characteristics of an application, several applications, or even a domain of applications (Hennessy and Patterson, 2018)

. The past witnesses the success of neural network processors for machine learning 

(Jouppi et al., 2017; Chen et al., 2014a), GPUs for graphics, virtual reality (Owens et al., 2008), and programmable network switches and interfaces (Hennessy and Patterson, 2018). Moreover, if we can identify abstractions of time-consuming units of computation in Big Data and AI workloads and design domain-specific hardware and software system for them, our target will be much general-purpose. Meanwhile, optimizing most time-consuming units of computation other than many algorithms case by case on different hardware or software systems will be much efficient.

In this paper, we propose a new approach to modelling and characterizing big data and AI workloads. We consider each big data and AI workload as a pipeline of one or more classes of unit of computation performed on different initial or intermediate data inputs, each of which captures the common requirements while being reasonably divorced from individual implementations (Asanovic et al., 2006). We call this abstraction a data motif. Significantly different from the traditional kernels, a data motif’s behaviors are affected by the sizes, patterns, types, and sources of different data inputs; Moreover, it reflects not only computation patterns, memory access patterns, but also disk and network I/O patterns.

After thoroughly analyzing a majority of workloads in five typical big data application domains (search engine, social network, e-commerce, multimedia and bioinformatics), we identify eight data motifs that take up most of run time, including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic. We found the combinations of one or more data motifs with different weights in terms of runtime can describe most of big data and AI workloads we investigated (Gao et al., 2018a). Considering various data inputs—text, sequence, graph, matrix and image data—with different data types and distributions, we implement eight data motifs on different software stacks, including Hadoop (had, 2018), Spark (Zaharia et al., 2010)

, TensorFlow 

(Abadi et al., 2016) and POSIX-thread (Pthread) (Barney, 2009). For big data, the implemented data motifs include sort (Sort), wordcount (Statistics), grep (Set), MD5 hash (Logic), matrix multiplication (Matrix), random sampling (Sampling), graph traversal (Graph) and FFT transformation (Transform), while for AI, we implement 2-dimensional convolution (Transform

), max pooling (

Sampling), average pooling (Sampling

), ReLU activation (

Logic), sigmoid activation (Matrix), tanh activation (Matrix), fully connected (Matrix), and element-wise multiplication (Matrix), which are frequently-used computation in neural network modelling. We release the implemented data motifs as the micro benchmarks of an open-source big data and AI benchmark suite — BigDataBench. In the rest of paper, we use the big data motifs to indicate the motif implementations for big data, and use the AI motifs to indicate the motif implementations for AI.

Just like relation algebra in database, the data motifs are promising fundamental concepts and tools for benchmarking, designing, measuring, and optimizing big data and AI systems. Based on the data motifs, we build the fourth version of BigDataBench (Gao et al., 2018b), including micro benchmarks, each of which is a data motif, and component benchmarks, each of which is a combination of several data motifs, and end-to-end application benchmarks, each of which is a combination of component benchmarks. Also, we build the proxy benchmarks (Gao et al., 2018a) for big data and AI workloads, which has a speedup up to 1000 times in terms of runtime and a micro-architectural data accuracy of more than 90%. In this paper, as the first step, we call attention to performing comprehensive characterization of those data motifs from perspective of data sizes, types, sources, and patterns as a lens towards fully understanding big data and AI workloads. On a typical state-of-practice processor: Intel Xeon E5-2620 V3, we comprehensively characterize all data motif implementations and identify their bottlenecks.

Our contributions are five-fold as follows:

  • We identify eight data motifs through profiling a wide variety of big data and AI workloads.

  • We provide diverse data motif implementations on the software stacks of Hadoop, Spark, TensorFlow, Pthread.

  • From the system and micro-architecture perspectives, we comprehensively characterize the behaviors of data motifs and identify their bottlenecks. We find that these data motifs cover a wide variety of performance space, from the perspectives of system and micro-architecture behaviors. Moreover, the behavior of each motif is not only influenced by its algorithm, but also largely affected by the type, source, size, and pattern of input data.

  • From the system aspect, we find that some AI motifs like convolution, fully-connected are CPU-intensive, while the other AI motifs are not CPU-intensive, such as Relu, Sigmoid used as activation layer. Further, the AI motifs have little pressure on disk I/O, since they load a batch (e.g. 128 images) from disk every iteration.

  • From the micro-architecture aspect, we find that these motifs show various computation and memory access patterns, exploiting different parallelism degrees of ILP and MLP. With the data size expanding, the percentage of frontend bound decreases while the backend bound increases.

The rest of the paper is organized as follows. Section 2 illustrates the motivation of identifying data motifs. Section 3 introduces data motif identification methodology. Section 4 performs system and micro-architecture evaluations on the data motif implementations. In Section 5, we report the data impact on the data motifs’ behaviors from perspectives of data size, data pattern, data type and data source. Section 6 introduces the related work. Finally, we draw a conclusion in Section 7.

2. Motivation

We take two examples to explain why we should call attention to performing comprehensive characterization of those data motifs.

2.1. SIFT Workload in Computer Vision

Figure 1. The Computation Dependency Graph and Run Time Breakdown of SIFT Workload.

SIFT (Lowe, 2004)

is a typical workload for feature extraction, and widely used to detect local features of input images.

Fig. 1 shows the computation dependency graph and run time breakdown of SIFT workload. In total, SIFT involves five data motifs. Gaussian filters with different space scale factors are used to generate a group of image scale spaces, through the convolution with the input image. Image pyramid is to downsample these image scale spaces. DOG image means difference-of-Gaussian image, which is produced by matrix subtraction of adjacent image scale spaces in image pyramid. After that, every point in one DOG scale space would sort with eight adjacent points in the same scale space and points in adjacent two scale spaces, to find the key points in the image. Through profiling, we find that computes descirptors, finds keypoints and builds gaussian pyramid are three main time-consuming parts of the SIFT workload. Furthermore, we analyze those three parts and find they consist of several classes of unit of computation, like Matrix, Sampling, Transform, Sort and Statistics, summing up to 83.23% of the total SIFT run time.

2.2. AlexNet in AI

Figure 2. The Computation Dependency Graph and Run Time Breakdown of One Iteration of TensorFlow AlexNet Workload.

AlexNet (Krizhevsky et al., 2012)

is a representative and widely-used convolutional neural network in deep learning. In total, it has eight layers, including five convolutional layers and three fully connected layers.

We profile one iteration of the AlexNet workload (implemented with TensorFlow) using TensorBoard toolkit. Fig. 2 presents its computation dependency graph and run time breakdown. For each operator, we report its run time and its percentage of the total run time, such as 6.57 ms and 1.35% for the first convolution operator. We find that each iteration involves Transform (conv2d), Sampling (max pooling, dropout), Statistics (normalization), and Matrix (fully connected). Among them, matrix and transform computations occupy a large proportion—48.87% and 36.91%, respectively.

Through the above analysis, we have the following observation. Though big data and AI workloads are very complex and fast-changing, we can consider them as a pipeline of one or more fundamental classes of unit of computation performed on different initial or intermediate data inputs. Those classes of unit of computation, which we call data motifs, occupy most of the run time of the workloads, so we should pay more attention to them. In the next section, we will investigate more extensive big data and AI workloads, and elaborate the design of data motifs.

3. Methodology

Data motifs are frequently-appearing classes of unit of computation handling different data inputs. In this section, we illustrate how to identify data motifs from big data and AI workloads, and illustrate our data motif implementations.

Catergory Application Domain Workload Unit of Computation
Deep Learning Image Recognition Speech Recognition Convolutional neural network(CNN) Matrix, Sampling, Transform
Deep belief network(DBN) Matrix, Sampling
Graph Mining Search Engine Community Detection PageRank Matrix, Graph, Sort
BFS, Connected component(CC) Graph
Dimension Reduction Image Processing Text Processing Principal components analysis(PCA) Matrix
Latent dirichlet allocation(LDA) Statistics, Sampling
Recommendation Association Rules Mining Electronic Commerce Aporiori Statistics, Set
FP-Growth Graph, Set, Statistics
Collaborative filtering(CF) Graph, Matrix
Classification Image Recognition Speech Recognition Text Recognition Support vector machine(SVM) Matrix

K-nearest neighbors(KNN)

Matrix, Sort, Statistics
Naive bayes Statistic
Random forest Graph, Statistics
Decision tree(C4.5/CART/ID3) Graph, Statistics
Clustering Data Mining K-means Matrix, Sort
Feature Preprocess Image Processing Signal Processing Text Processing Image segmentation(GrabCut) Matrix, Graph
Scale-invariant feature transform(SIFT) Matrix, Transform, Sampling, Sort, Statistics
Image Transform Matrix, Transform
Term Frequency-inverse document frequency (TF-IDF) Statistics
Sequence Tagging Bioinformatics Language Processing Hidden Markov Model(HMM) Matrix
Conditional random fields(CRF) Matrix, Sampling
Indexing Search Engine Inverted index, Forward index Statistics, Logic, Set, Sort
Encoding/Decoding Multimedia Processing Security Cryptography Digital Signature MPEG-2 Matrix, Transform
Encryption Matrix, Logic
SimHash, MinHash Set, Logic
Locality-sensitive hashing(LSH) Set, Logic
Data Warehouse Business intelligence Project, Filter, OrderBy, Union Set, Sort
Table 1. The Importance of Eight Data motifs in Big Data and AI workloads.

3.1. Motif Identification Methodology

Figure 3. Identifying Data Motifs.

Fig. 3

overviews the methodology of motif identification. We first single out a broad spectrum of big data and AI workloads through investigating five typical application domains (search engine, social network, e-commerce, multimedia, and bioinformatics) and representative algorithms in four processing techniques (machine learning, data mining, computer vision and natural language processing). Then we conduct algorithmic analysis and profiling analysis on these workloads. We profile the workload to analyze the computation dependency graph and run time breakdown, to find and correlate the hotspot functions to the code segments. Combing with algorithmic analysis, we decompose the workload into a pipeline of units of computation and focus on the input/intermediate data as well. Then we summarize the frequently-appearing and time-consuming units as data motifs. We repeat this procedure on forty workloads with a broad spectrum to guarantee the representativeness of our data motifs.

According to the units of computation pipeline and run time breakdown, we finalize eight big data and AI motifs, which are essential computations that take up most of run time. Table 1 shows the importance of eight data motifs in a majority of big data and AI workloads. Note that previous work (Guinard et al., 2010) has identified four basic units of computation in online service, including get, put, post, delete. We don’t include those four in our motif set.

3.2. Eight Data Motifs

In this subsection, we summarize eight data motifs that frequently appear in big data and AI workloads.

Matrix In big data and AI workloads, many problems involve matrix computations, such as vector-vector, matrix-vector and matrix-matrix operations.

Sampling Sampling plays an essential role in big data and AI processing, which selects a subset samples according to certain statistical population. It can be used to obtain an approximate solution when one problem cannot be solved by deterministic method.

Logic We name computations performing bit manipulation as logic computations, such as hash, data compression and encryption.

Transform

The transform computations here mean the conversion from the original domain (such as time) to another domain (such as frequency). Common transform computations include discrete fourier transform (DFT), discrete cosine transform (DCT) and wavelet transform.

Set In mathematics, Set means a collection of distinct objects. Likewise, the concept of Set is widely used in computer science. Set is also the foundation of relational algebra (Maier, 1983). In addition, similarity analysis of two data sets involves set computations, such as Jaccard similarity. Furthermore, fuzzy set and rough set play very important roles in computer science.

Graph A lot of applications involve graphs, with nodes representing entities and edges representing dependencies. Graph computation is notorious for having irregular memory access patterns.

Sort Sort is widely used in many areas. Jim Gray thought sort is the core of modern databases (Asanovic et al., 2006), which shows its fundamentality.

Statistics

Statistic computations are used to obtain the summary information through statistical computations, such as counting and probability statistics.

3.3. Data Motif Implementations

Data motifs are the fundamental components of big data and AI workloads, which is of great significance for evaluation, considering the complexity and diversity of big data and AI workloads. We provide the data motif implementations for big data and AI separately, according to their computation specialties. For the big data motif implementations, we provide Hadoop (had, 2018), Spark (Zaharia et al., 2010), and Pthreads (Barney, 2009) implementations. These data motifs include sort, wordcount, grep, MD5 hash, matrix multiplication, random sampling, graph traversal and FFT transformation. For the AI motifs, we provide TensorFlow (Abadi et al., 2016) and Pthread implementations, including 2-dimensional convolution, max pooling, average pooling, relu activation, sigmoid activation, tanh activation, fully connected (matmul), and element-wise multiply. We consider the impact of data input from the perspectives of type, source, size, and pattern. Among them, data type includes structure, un-structured, and semi-structured data. Data source indicates the data storage format, including text, sequence, graph, matrix, and image data. Data pattern includes the data distribution, data sparsity, et al. As for data size, we provide big data generators for text, sequence, graph and matrix data to fulfill different size requirements.

4. Characterization

In this section, we evaluate data motifs with various software stacks from the perspectives of both system and architecture behaviors.

4.1. Experiment Setups

We deploy a three-node cluster, with one master node and two slave nodes. They are connected using 1Gb Ethernet network. Each node is equipped with two Intel Xeon E5-2620 V3 (Haswell) processors, and each processor has six physical out-of-order cores. The memory of each node is 64 GB. The operating system, software stacks and gcc versions are as follows: CentOS 7.2 (with kernel 4.1.13); JDK 1.8.0_65; Hadoop 2.7.1; Spark 1.5.2; TensorFlow 1.0; GCC 4.8.5. The data motifs implemented with Pthread are compiled using ”-O2” option for optimization. The hardware and software details are listed in Table 2. Since Pthread is a multi-thread programming model, we evaluate both the TensorFlow and Pthread implementations of AI motifs on one node for apple-to-apple comparison.

Hardware Configurations
CPU Type Intel CPU Core
Intel ®Xeon E5-2620 V3 12 cores@2.40G
L1 DCache L1 ICache L2 Cache L3 Cache
12 32 KB 12 32 KB 12 256 KB 15MB
Memory 64GB,DDR4
Disk SATA@7200RPM
Ethernet 1Gb
Hyper-Threading Disabled
Table 2. Configuration Details of Xeon E5-2620 V3

4.2. Experiment Methodology

Figure 4. CPU Utilization and I/O Wait of Data Motifs.
Figure 5. I/O Behaviors of Data Motifs.

We evaluate eight big data motifs implemented with Hadoop, Spark, and eight AI data motifs implemented with TensorFlow and Pthread. Note that we use the optimal configurations for each software stack, according to the cluster scale and memory size. The data configuration and selected metrics are listed as follows.

Data Configuration To evaluate the impacts of data input comprehensively, we evaluate the data motifs with three data sizes: Small, Medium, and Large. We choose the Large data size according to the memory capacity of the cluster so as to fully utilize the memory resources, and the other two are chosen for comparison. For the graph motif, Small, Medium, Large is , and

-vertex, respectively. For the matrix motif, we use 100, 1K and 10K two-dimensional matrix data with the same distribution and sparsity. For the transform motif, we use 16384, 32768 and 65536 two-dimension matrix data. For the other big data motifs, we use 1, 10 and 100 GB wikipedia text data, respectively. For the AI motifs, we use three configurations in terms of input tensor sizes and channels. They are

(224*224,64), (112*112,128) and (56*56,256). Among them, the first value indicates the dimension of input tensor, the second value indicates the channels, and all of them use 128 as batch size. We choose these three configurations because they are widely used in neural network models (Simonyan and Zisserman, 2014). Note that the dimension for all input tensors is 224 for Large configuration, 112 for Medium configuration and 56 for Small

configuration. For the Pthread-version AI motifs, we use 1K, 10K, 100K images from ImageNet 

(Deng et al., 2009). In the following subsections, we characterize the system and micro-architectural behaviors of data motifs with the Large data size. In Section 5, we will analyze the impact of data input on characteristics with all data sizes.

System and Micro-architecture Metrics We characterize the system and micro-architectural behaviors (Van den Steen et al., 2016) of the data motifs, which are significant for design and optimization (Quinn et al., 2015). For system evaluation, we report the metrics of CPU utilization, I/O Wait, disk I/O bandwidth, and network I/O bandwidth. The system metrics are collected through the proc file system.

For micro-architectural evaluation, we use the Top-Down analysis method (Yasin, 2014), which categorizes the pipeline slots into four categories, including retiring, bad speculation, frontend bound and backend bound. Among them, retiring represents the useful work, which means the issued micro operations (uops) eventually get retired. Bad speculation represents the pipeline is blocked due to incorrect speculations. Frontend bound represents the stalls due to frontend, which undersupplies uops to the backend. Backend bound represents the stalls due to backend, which is a lack of required resources for new uops (pmu, 2018). We use Perf (per, 2018), a Linux profiling tool, to collect the hardware events referring to the Intel Developerś Manual (Guide, 2011) and pmu-tools (pmu, 2018).

4.3. System Evaluation

Fig. 4 presents the CPU utilization and I/O Wait of all data motifs. We find that Hadoop motifs have higher CPU utilization than Spark motifs, and suffer from less I/O Wait than Spark motifs do. Particularly, Hadoop motifs take 80 percent CPU time. The I/O Waits of AI data motifs are extremely lower than that of big data motifs. For deep neural networks, even the total input data is large, the input layer loads a batch from disk every iteration, so data loading size from disk by the input layer occupies a very small proportion comparing to intermediate data, and thus introduces little disk I/O requests. Pthread motifs have less CPU utilization and I/O Wait in general, because Pthread motifs have less memory allocation and relocation operations than counterparts using other stacks. Moreover, the data loading time overlaps the processing time since computation is simple, except that Pthread Matmul has almost 100% CPU utilization because of its high computation complexity and CPU-intensive characteristics. TensorFlow motifs, such as AvgPool, Conv, Matmul, Maxpool, and Multiply, have taken most of CPU time, because these five motifs are CPU-intensive. Nevertheless, we also find that the other AI motifs are not that CPU-intensive, such as Relu, Sigmoid, and Tanh.

Fig 5 presents the network bandwidth and disk I/O bandwidth. For AI motifs, most of them (e.g. matmul, relu, pooling, activation) are executed in the hidden layers, and the intermediate states of hidden layers are stored in the memory. That is to say, the hidden layers consume the most resources of computation and memory storage, while the disk I/O for input layer is relatively minor. Our evaluation confirms this observation. Meanwhile, as mentioned in Section 4.1, we evaluate both the TensorFlow and Pthread implementations of AI motifs on one node for apple-to-apple comparison. So we do not report the I/O behaviors of AI motifs. We find that for all big data motifs, Spark stack has much larger network I/O pressure than that of Hadoop stack, because Spark stack has more data shuffles, so it needs transferring data from one node to another one frequently. Five of the eight Spark implementations have smaller disk I/O pressure than that of Hadoop, because Spark targets in-memory computing. Except Spark Matmul, Spark MD5 and Spark WordCount have larger disk I/O pressure than that of Hadoop counterparts. Their disk I/O read sector numbers are nearly equal, while the write sector numbers are much larger.

4.4. Micro-architecture Evaluation

Figure 6. Execution Performance of Data Motifs.

To better understand the data motifs, we analyze their performance and micro-architectural characteristics.

Execution Performance The execution performance indicates the overall running efficiency of the workloads (Kim et al., 2016). We use the instruction level parallelism (ILP) and memory level parallelism (MLP) to reflect the execution performance. Among them, ILP measures the number of instructions that can be executed simultaneously. Here we use the retired instructions per cycle (IPC) to measure ILP. MLP indicates the parallelism degree that memory accesses can be generated and executed (Glew, 1998). MLP is computed through dividing L1D_PEND_MISS.PENDING by L1D_PEND_MISS.PENDING_CYCLES (pmu, 2018). Fig. 6 presents the ILP and MLP of all data motifs. We find that these motifs cover a wide range of ILP and MLP behaviors, reflecting distinct computation and memory access patterns. For example, TensorFlow Multiply does element-wise multiplications and has high MLP (5.27) but extremely low ILP (0.15). This is because that its computation is simple and has little data dependencies, so it generates many concurrent data loads, thus incurs a large amount of data cache misses. Also, max pooling and average pooling have high MLP. The MLP of average pooling is lower than max pooling, because average computation involves many divide operations, and thus suffers from more stalls due to the delay of divider unit. The software stack changes workload’s computation and memory access patterns, which is also found in previous work (Jia et al., 2014). For example, both Hadoop FFT and Spark FFT are based on cooley-tukey algorithm (Cooley and Tukey, 1965), while they have different parallelism degrees. Spark FFT is more memory-intensive and has higher MLP.

Figure 7. The Uppermost Level Breakdown of Data Motifs.

The Uppermost Level Breakdown Fig. 7 shows the uppermost level breakdown of all data motifs we evaluated. We find that these motifs have different pipeline bottlenecks. For Hadoop motifs, they suffer from notable stalls due to frontend bound and bad speculation. Moreover, Hadoop motifs reflect nearly consistent bottlenecks, indicating the Hadoop stack impacts workload behaviors more than other stacks like Spark and TensorFlow. For Spark motifs, which mainly compute in memory, they suffer from a higher percentage of backend bound than that of Hadoop counterparts. Spark Grep, Sample and Sort suffer from more frontend bound and their percentages of backend bound are smaller than the others. The AI data motifs face different bottlenecks both on TensorFlow and Pthreads. Conv and Matmul have the highest IPC (about 2.2) and retiring percentages (about 50% on TensorFlow). Max pooling, average pooling, and multiply have extremely low retiring percentages, which has been illustrated in above. However, activation operation like ReLU, sigmoid and tanh suffer from more frontend bound than backend bound. For AI data motifs implemented with Pthread, their main bottleneck is backend bound. They suffer from little frontend and bad speculation stalls.

Figure 8. The Frontend Breakdown of Data Motifs.
Figure 9. The Frontend Latency Breakdown of Data Motifs.

Frontend Bound Frontend bound can be split into frontend latency bound and frontend bandwidth bound. Among them, latency bound means the frontend delivers no uops to the backend, while bandwidth bound means delivering insufficient uops comparing to the theoretical value. Fig. 8 presents the frontend breakdown of the data motifs. We find that the main reason that incurs the frontend stalls is latency bound for almost all motifs that suffer from severe frontend bound.

We further investigate the reasons for the frontend latency bound and frontend bandwidth bound, respectively. Generally, the frontend latency bound are incurred by six reasons, including icache miss, itlb miss, branch resteers, DSB (Decoded Stream Buffer) switches, LCP (Length Changing Prefix), and MS (microcode sequencer) switches. Among them, icache miss and itlb miss are instruction cache miss and instruction tlb miss. Branch resteers means the delays to obtain the correct instructions, such as the delays due to branch misprediction. LCP measures the stalls when decoding the instructions with a length changing prefix. Generally, uops comes from three places, including the decoded uops cache (DSB), legacy decode pipeline (MITE) and microcode sequencer (MS). DSB switches record the stalls caused by switching from the DSB to MITE. MS switches measure the penalty of switching to MS unit. As for latency bandwidth bound, there are mainly two reasons: the inefficiency of MITE pipeline and the inefficient utilization of DSB cache. Additionally, LSD represents the stalls due to waiting the uops from the loop stream detector (lsd, 2018). Fig. 9

lists the latency and bandwidth bound breakdown of all data motifs. For almost all data motifs, branch resteers is a main reason for the high percentage of frontend bound, except Spark Matmul and Relu, Sigmoid, Tanh on TensorFlow. For these three activation functions, nearly 60% frontend bound is due to instruction cache miss. On average, big data motifs implemented with Hadoop and Spark suffer from more icache misses than AI data motifs. Moreover, MS switch is another significant factor that incurs frontend latency bound. Because big data and AI systems use many CISC instructions that cannot be decoded by default decoder, so they must be decoded by MS unit, and results in performance penalties.

Figure 10. The Backend Bound Breakdown of Data Motifs.
Figure 11. The Backend Core Bound Breakdown of Data Motifs.

Backend Bound Fig 10 presents the backend bound breakdown of data motifs, which are split into backend memory bound and backend core bound. Backend memory bound is mainly caused by the data movement delays among different memory hierarchies. Backend core bound is mainly caused by the lack of hardware resources (e.g. divider unit) or port under-utilization because of instruction dependencies and execution unit overloading. We find that more than half of these data motifs suffer from more backend memory bound than core bound. However, for each software stack, there is at least one data motif that suffers from equal percentages of core bound or even more percentages of core bound than memory bound, such as Hadoop WordCount, Spark MD5, TensorFlow Conv and Pthread AvgPool. Fig. 11 shows the core bound breakdown. We find that TensorFlow AvgPool and Hadoop WordCount suffer from significantly long latency of divider unit. While for Spark MD5 and TensorFlow Conv, which has the highest percentage of backend core bound, mainly suffer from the stalls due to port under-utilization. As for backend memory bound, we find that DRAM memory bound is much severe than level 1, 2, and 3 cache bound for almost all big data and AI motifs, indicating that the memory wall (Wulf and McKee, 1995) still exists and needs to be optimized.

Figure 12. Linkage Distance of Data Motifs.

5. Impact of Data Input

In this section, we evaluate the impact of data input on system and micro-architecture behaviors from the perspectives of size, source, type, and pattern. For type and pattern evaluation, we use Sort and FFT as an example, respectively.

5.1. Impact of Data Size

Based on all sixty metrics spanning system and micro-architecture we evaluated in Section 4, we conduct a coarse-grained similarity analysis using PCA (Principal Component Analysis) (Jolliffe, 1986)

and hierarchical clustering 

(Johnson, 1967) methods on three data size configurations. Fig. 12 presents the linkage distance of all data motifs, which indicates the similarity of system and micro-architecture behaviors. Note that the smaller the linkage distance, the more similar the behaviors. We find that data motifs with small data size are more likely to be clustered together. A small data size will not fully utilize the system and hardware resources, hence that they tend to reflect similar behaviors. However, for the motif that is computation intensive and has high computation complexity, even with the large data set, it will be clustered together with small data set. For example, FFTs with three data size configurations are clustered together for both Hadoop and Spark version. AI Motifs with TensorFlow implementations are also greatly affected by the input data size. However, they reflect distinct behaviors with big data motifs implemented with Hadoop and Spark, with the least linkage distance of 6.71.

Figure 13. Impact of Data Size on I/O Behaviors.

Impact of Data Size on I/O Behaviors We evaluate the impact of data size on I/O behaviors using the fully distributed Hadoop and Spark motif implementations. Using the I/O bandwidth of Small data size as baseline, we normalize the I/O bandwidth of Medium and Large data size, as illustrated in Fig. 13. The bold black horizontal line in Fig. 13 shows the equal line with the small input. That is to say, the value higher than the line means larger I/O bandwidth than the value of the small input. Here we do not report the performance data of the AI motifs because the disk I/O behavior is little in neural network modelling, which we have illustrated in Subsection 4.3. We find that almost for all data motifs, their I/O behaviors are sensitive to the data size. When the data size large enough, the whole data can not be stored in memory, then the data have to be swapped in and swapped out frequently, and hence put great pressure on disk I/O access. Modern big data and AI systems adopt a distributed manner, with the data storing on an distributed file system, such as HDFS (Shvachko et al., 2010), the data shuffling or data unbalance will generate a large amount of network I/O.

Figure 14. Impact of Data Size on Pipeline Efficiency.

Impact of Data Size on Pipeline Efficiency We further measure the impact of data size on pipeline efficiency. As shown in Fig. 14, we find that with the data size increases, the percentage of frontend bound decrease, while the percentage of backend bound increase. For example, Spark Matmul with large input size decrease nearly 20% of frontend bound and increase more than 30% of backend bound. As the data size increase, the high-speed cache and even memory are unable to hold all of them, and further incur many data cache misses, resulting in large penalties due to memory hierarchy.

5.2. Impact of Data Pattern

(a) System Behavior with Different Patterns.
(b) Micro-architecture Behavior with Different Patterns.
Figure 15. Impact of Data Pattern on Data Motifs.

Data pattern and data distribution impact the workload performance significantly (Xie et al., 2018; Yilmaz et al., 2016). To evaluate the impact of data pattern on the motifs, we use two different patterns of dense matrix and sparse matrix, to run FFT motif as an example. The matrix sparsity indicates the ratio of zero value among all matrix elements. With different sparsity, the data access patterns vary, and thus reflect different behaviors.

We use two 1638416384 matrixes as the input for the FFT motif, with the one having 10% sparsity and the other one 90% sparsity. Fig. 15 shows the impact of data pattern on the data motifs from system (Fig. 15(a)) and micro-architecture perspectives(Fig. 15(b)). We find that using the matrix with high sparsity, the network I/O and disk I/O are nearly half of the values using the dense matrix, and the major page fault per second is almost the same. Spark motifs suffer from more I/O pressure than Hadoop motifs. As for pipeline bottlenecks, sparse data input incurs more frontend stalls while less backend stalls.

5.3. Impact of Data Type and Source

(a) System Behavior with Different Types.
(b) Micro-architecture Behavior with Different Types.
Figure 16. Impact of Data Type and Source on Data Motifs.

Data types and sources are of great significance for read and write efficiency (Eeckhout et al., 2003), considering their storage format and targeted scenarios, such as the supports for splitable files and compression level. To evaluate the impact of the data type and source on system and micro-architecture behaviors, we use two different data types for Sort motif, with the same data size of 10 GB. Two types are un-structured wikipedia text data and semi-structured sequence data. Wikipedia text file is laid out in lines and each line records an article content. Sequence files are flat files that consist of key and value pairs, stored in binary format. Fig. 16 lists the impact of data type on data motifs from the system (Fig. 16(a)) and micro-architecture aspects (Fig. 16(b)). We find that the difference between using text type and sequence type ranges from 1.12 times to 7.29 times from the system aspects. Using text data type, the CPU utilization is lower than using sequence data, which indicates that using sequence data has better performance. Moreover, both Hadoop Sort and Spark Sort suffer from more major page faults and further impact the execution performance, because of page loads from disk. Note that we use the major page fault number per second in Fig. 16 and the total number during the running process is about 100 to 200. Even with the same amount of data size, their network I/O and disk I/O bandwidth still have a great difference. We find that the sequence format have larger requirements for I/O bandwidth than the text format. From the micro-architecture aspect (Fig. 16(b)), Sort with different data types reflect different percentages of pipeline bottlenecks. With the text format, backend bound bottleneck is more severe, especially backend memory bound, which indicates that they waste more cycles to wait for the data from cache or memory.

6. Related Work

Our big data and AI motifs are inspired by previous successful abstractions in other application scenarios. The set concept in relational algebra (Codd, 1970) abstracted five primitive and fundamental operators, setting off a wave of relational database research. The set abstraction is the basis of relational algebra and theoretical foundation of database. Phil Colella (Colella, 2004) identified seven motifs of numerical methods which he thought would be important for the next decade. Based on that, a multidisciplinary group of Berkeley researchers proposed 13 motifs which were highly abstractions of parallel computing, capturing the computation and communication patterns of a great mass of applications (Asanovic et al., 2006). National Research Council proposed seven major tasks in massive data analysis (Council, 2013), which they called giants. These seven giants are macroscopical definition of problems in massive data analysis from the perspective of mathematics, while our eight classes of motifs are main time-consuming units of computation in the Big Data and AI workloads.

Application kernels (Bailey et al., 1991; Dongarra et al., 2003) also aim at scaling down the run time of the real applications, while preserving the main characteristics of the workload. Consisting of the major function of the application, Kernel tries to cover the bottleneck of the real application. But kernel is still hard to understand the complex and diversity big data and AI workloads (Bailey et al., 1991; Lilja, 2005). Other than that, kernel mainly focuses on the CPU and memory behaviors, and pays little attention to the I/O, which is also important for many real applications, especially in an era of data explosion.

7. Conclusions

In this paper, we answer what are abstractions of time-consuming units of computation in big data and AI workloads. We identify eight data motifs among a wide variety of big data and AI workloads, including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic computations. We found the combinations of one or more data motifs with different weights in terms of runtime can describe most of big data and AI workloads we investigated (Gao et al., 2018a). We implement the data motifs for big data and AI separately, including the big data motif implementations using Hadoop, Spark, Pthreads, and the AI data motif implementations using TensorFlow, Pthreads, considering the impact of data type, data source, data size, and data pattern. We release them as the micro benchmarks of an open-source Big Data and AI benchmark suite—BigDataBench, publicly available from http://prof.ict.ac.cn/BigDataBench. From the system and micro-architecture perspectives, we comprehensively characterize the behaviors of data motifs and identify their bottlenecks. Further, we measure the impact of data type, data source, data pattern and data size on their behaviors. We find that these data motifs cover a wide variety of performance space, from the perspectives of system and micro-architecture behaviors. Moreover, the behavior of each data motif is not only influenced by its algorithm, but also largely affected by the type, source, size, and pattern of input data. We believe our work is an important step toward not only Big Data and AI benchmarking, but also domain-specific hardware and software co-design.

8. Acknowledgements

This work is supported by the National Key Research and Development Plan of China (Grant No. 2016YFB1000600 and 2016YFB1000601). The authors are very grateful to anonymous reviewers for their insightful feedback and Dr. Zhen Jia for his valuable suggestions.

References

  • (1)
  • had (2018) 2018. Hadoop. http://hadoop.apache.org/. (2018).
  • lsd (2018) 2018. LSD. https://software.intel.com/en-us/vtune-amplifier-help-front-end-bandwidth-lsd. (2018).
  • per (2018) 2018. Perf tool. https://perf.wiki.kernel.org/index.php/Main_Page. (2018).
  • pmu (2018) 2018. PMU Tools. https://github.com/andikleen/pmu-tools. (2018).
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning.. In OSDI, Vol. 16. 265–283.
  • Asanovic et al. (2006) Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Yelick Katherine. 2006. The landscape of parallel computing research: A view from Berkeley. Technical Report. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.
  • Bailey et al. (1991) David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, H D Simon, V Venkatakrishnan, and S K Weeratunga. 1991. The NAS parallel benchmarks. The International Journal of Supercomputing Applications 5, 3 (1991), 63–73.
  • Barney (2009) Blaise Barney. 2009. POSIX threads programming. National Laboratory. Disponível em:¡ https://computing. llnl. gov/tutorials/pthreads/¿ Acesso em 5 (2009), 46.
  • Chen et al. (2014a) Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014a. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices 49, 4 (2014), 269–284.
  • Chen et al. (2014b) Yanpei Chen, Francois Raab, and Randy Katz. 2014b. From tpc-c to big data benchmarks: A functional workload model. In Specifying Big Data Benchmarks. Springer, 28–43.
  • Codd (1970) Edgar F Codd. 1970. A relational model of data for large shared data banks. Commun. ACM 13, 6 (1970), 377–387.
  • Colella (2004) Phillip Colella. 2004. Defining software requirements for scientific computing. (2004).
  • Cooley and Tukey (1965) James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297–301.
  • Council (2013) NR Council. 2013. Frontiers in Massive Data Analysis. The National Academies Press Washington, DC.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    . IEEE, 248–255.
  • Dongarra et al. (2003) Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience 15, 9 (2003), 803–820.
  • Eeckhout et al. (2003) Lieven Eeckhout, Hans Vandierendonck, and Koen De Bosschere. 2003. Quantifying the impact of input data sets on program behavior and its applications. Journal of Instruction-Level Parallelism 5, 1 (2003), 1–33.
  • Ferdman et al. (2012) Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
  • Gao et al. (2018a) Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Zhen Jia, Daoyi Zheng, Chen Zheng, Xiwen He, Hainan Ye, Haibin Wang, and Rui Ren. 2018a. Data Motif-based Proxy Benchmarks for Big Data and AI Workloads. Workload Characterization (IISWC), 2018 IEEE International Symposium on (2018).
  • Gao et al. (2018b) Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Xu Wen, Rui Ren, Chen Zheng, Hainan Ye, Jiahui Dai, Zheng Cao, et al. 2018b. BigDataBench: A Scalable and Unified Big Data and AI Benchmark Suite. Under review of IEEE Transaction on Parallel and Distributed Systems (2018).
  • Glew (1998) Andrew Glew. 1998. MLP yes! ILP no. ASPLOS Wild and Crazy Idea Session’98 (1998).
  • Guide (2011) Part Guide. 2011. Intel® 64 and IA-32 Architectures Software Developerś Manual. Volume 3B: System programming Guide, Part 2 (2011).
  • Guinard et al. (2010) Dominique Guinard, Vlad Trifa, and Erik Wilde. 2010. A resource oriented architecture for the web of things. In Internet of Things (IOT), 2010. IEEE, 1–8.
  • Hennessy and Patterson (2018) John Hennessy and David Patterson. 2018. A New Golden Age for Computer Architecture: Domain-specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development. (2018).
  • Jia et al. (2014) Zhen Jia, Jianfeng Zhan, Lei Wang, Rui Han, Sally A McKee, Qiang Yang, Chunjie Luo, and Jingwei Li. 2014. Characterizing and subsetting big data workloads. In IEEE International Symposium on Workload Characterization (IISWC).
  • Johnson (1967) Stephen C Johnson. 1967. Hierarchical clustering schemes. Psychometrika 32, 3 (1967), 241–254.
  • Jolliffe (1986) Ian T Jolliffe. 1986. Principal component analysis and factor analysis. In Principal component analysis. Springer, 115–128.
  • Jouppi et al. (2017) Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 1–12.
  • Kim et al. (2016) Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 339–350.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Lilja (2005) David J Lilja. 2005. Measuring computer performance: a practitioner’s guide. Cambridge university press.
  • Lowe (2004) David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91–110.
  • Luszczek et al. (2006) Piotr R Luszczek, David H Bailey, Jack J Dongarra, Jeremy Kepner, Robert F Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC Challenge (HPCC) benchmark suite. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing. Citeseer, 213.
  • Maier (1983) David Maier. 1983. The theory of relational databases. Vol. 11. Computer science press Rockville.
  • Owens et al. (2008) John D Owens, Mike Houston, David Luebke, Simon Green, John E Stone, and James C Phillips. 2008. GPU computing. Proc. IEEE 96, 5 (2008), 879–899.
  • Quinn et al. (2015) Heather Quinn, William H Robinson, Paolo Rech, Miguel Aguirre, Arno Barnard, Marco Desogus, Luis Entrena, Mario Garcia-Valderas, Steven M Guertin, David Kaeli, et al. 2015. Using benchmarks for radiation testing of microprocessors and FPGAs. IEEE Transactions on Nuclear Science 62, 6 (2015), 2547–2554.
  • Shah et al. (2010) Mehul Shah, Parthasarathy Ranganathan, Jichuan Chang, Niraj Tolia, David Roberts, and Trevor Mudge. 2010. Data dwarfs: Motivating a coverage set for future large data center workloads. In Proc. Workshop Architectural Concerns in Large Datacenters.
  • Shvachko et al. (2010) Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. Ieee, 1–10.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Van den Steen et al. (2016) Sam Van den Steen, Stijn Eyerman, Sander De Pestel, Moncef Mechri, Trevor E Carlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. 2016. Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Trans. Comput. 65, 12 (2016), 3537–3551.
  • Wang et al. (2014) Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. 2014. Bigdatabench: A big data benchmark suite from internet services. In IEEE International Symposium On High Performance Computer Architecture (HPCA).
  • Wulf and McKee (1995) Wm A Wulf and Sally A McKee. 1995. Hitting the memory wall: implications of the obvious. ACM SIGARCH computer architecture news 23, 1 (1995), 20–24.
  • Xie et al. (2018) Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and Lixin Zhang. 2018. CVR: Efficient Vectorization of SpMV on X86 Processors. In 2018 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
  • Yasin (2014) Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. IEEE, 35–44.
  • Yilmaz et al. (2016) Buse Yilmaz, Bariş Aktemur, MaríA J Garzarán, Sam Kamin, and Furkan Kiraç. 2016. Autotuning runtime specialization for sparse matrix-vector multiplication. ACM Transactions on Architecture and Code Optimization (TACO) 13, 1 (2016), 5.
  • Zaharia et al. (2010) Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. 10–10.

Appendix A Artifact appendix

a.1. Abstract

The artifact contains our big data and AI motif implementations on Hadoop, Spark, Pthreads, and TensorFlow stacks. It can support the characterization results in Chapter four and Chapter five of our PACT 2018 paper Data Motifs: A Lens Towards Fully Understanding Big Data and AI Workloads. To validate the results, deploy the experiment environment and profile the benchmarks.

a.2. Artifact check-list (meta-information)

  • Program: Data motif implementations

  • Compilation: GCC 4.8.5; Python 2.7.5; Java 1.8.0_65

  • Data set: generated by BigDataBench

  • Run-time environment: CentOS 7.2, Linux Kernel 4.1.13 with Perf tool

  • Hardware: Processor supporting Top-Down analysis, above Sandy Bridge series, and the performance events corresponding to the processor

  • Run-time state: Disable Hyper-Threading

  • Execution: root user or users that can execute sudo without password

  • Output: the system and micro-architecture profiling results

  • Experiment: Deploy the data motifs and corresponding software stacks; run benchmarks; profile using perf; output the results

  • Workflow frameworks used? No

  • Publicly available?: Yes

a.3. Description

a.3.1. How delivered

The data motifs are the micro benchmarks of BigDataBench 4.0—an open source big data and AI benchmark suite. Download link:

All the related files are under the ”pact2018” directory, please refer to README for detailed description. Note that to obtain accurate performance data, the user should make sure there is no other motif running before run a motif. The running scripts we provide suit for our cluster environment, like the node ip/hostname and port number, if you download and use it in your cluster environment, you need to modify the scripts to suit for your environment.

a.3.2. Hardware dependencies

The data motifs can be run on all processors that can deploy Hadoop, Spark, TensorFlow and Pthread stacks. However, for Top-Down analysis, due to the performance counter limitations, we suggest the Intel Xeon processors, above Sandy Bridge series. Also, user need to find the performance counters corresponding to specific processor. We have provided profiling scripts for Xeon E5-2620 V3 (Haswell) processor.

a.3.3. Software dependencies

JDK 1.8.0_65; Hadoop 2.7.1; Spark 1.5.2; TensorFlow 1.0; GCC 4.8.5.

a.3.4. Data sets

We provide data generators for text, sequence, graph, and matrix data. Users can find the data generation method in the README file or BigDataBench user manual. The generation parameter used in our paper for the graph motif is 22 (Small), 24 (Medium), 26 (Large), respectively.

a.4. Installation

User need to install Hadoop, Spark, GCC and TensorFlow. The install details can be found in the User Manual of BigDataBench. We provide ”Makefile” for pthread motifs. For all data motifs, we provide running scripts in our package.

a.5. Experiment workflow

Before profiling system and micro-architecture metrics of one motif, users should make sure there is no other motif/workload running.

a.5.1. Data generation

We provide text, graph, matrix, and sequence data generators under data-generator directory. To generate large, medium, small data used in our paper, we provide a script ”data-generator.sh”. Make sure hadoop is running, because the script upload the generated data to HDFS. The script running command:

#sh data-generator.sh ¡format¿ ¡datasize¿

Note that ¡format¿ can be text, seq, graph or matrix, and ¡datasize¿ can be large, medium or small.

Also, the generators support generate other data size the user needed.

Graph data generation:

#cd $pact2018/data-generator/genGraphData

#./genGraph.sh ¡log2_vertex¿

For example, ./genGraph.sh 26 for 22̂6-vertex graph data.

Matrix data generation:

#cd $pact2018/data-generator/genMatrixData

For floating-point data: #./generate-matrix.sh ¡row_num¿ ¡colum_num¿ ¡sparsity¿

For integer data: #./generate-matrix-int ¡row_num¿ ¡colum_num¿ ¡sparsity¿

The sparsity means ”sparsity” percentage elements are zero.

Text data generation:

#cd $pact2018/data-generator/genTextData

#./genText.sh ¡size¿

Note that the parameter size means ”size” gigabytes text data.

Sequence data generation:

Transfer the wiki text data to sequence data, so user should generate text data first and put it on HDFS, for example, ”wiki-10G” data are on HDFS.

#cd $pact2018/data-generator/genSeqData

#./sort-transfer.sh ¡size¿

a.5.2. Run the workloads.

We provide running scripts for all workloads. During the running process, the profiling scripts are started to sample the system and architecture metrics.

For Hadoop motifs:

1) Under pact2018 directory

2) Start Hadoop: #./start-hadoop.sh

3) Choose one Hadoop motif: #./run-hadoop.sh motif datasize

Note that datasize parameter can be ”large”, ”medium” or ”small”, means using large/medium/small data size,respectively. For example: #./run-hadoop.sh graph large

For Spark motifs:

1) Under pact2018 directory

2) Start Spark: #./start-spark.sh

3) Choose one Spark motif: #./run-spark.sh motif datasize

Note that datasize parameter can be ”large”, ”medium” or ”small”, means using large/medium/small data size,respectively. For example: #./run-spark.sh graph large

For TensorFlow motifs:

1) Under pact2018 directory

2) Choose one TensorFlow motif: #./run-tensorflow.sh motif datasize

Note that datasize parameter can be ”large”, ”medium” or ”small”, means using large/medium/small data size,respectively. For example: #./run-tensorflow.sh relu large

For Pthread motifs:

1) Under pact2018 directory

2) Choose one Pthread motif: #./run-pthread.sh motif datasize

Note that datasize parameter can be ”large”, ”medium” or ”small”, means using large/medium/small data size,respectively. For example: #./run-pthread.sh relu large

The sampling results of system and micro-architecture metrics are under ”result” directory. We provide processing scripts for computing the result and plot the figures. Please refer to ”README” file for the details.

a.5.3. Process the metric data and plot the figures

We provide processing scripts and figure plotting scripts to generate the figures used in the paper. Note that the sampling results are saved under ”result” directory when test finished.

1) Compute the performance data and save them in an excel file.

#python lsdata.py result result_new 1

Parameter ”result” means the input directory which contains the sampling results; Parameter ”result_new” means the output excel file name and the output file is result_new.xls.

2) Plot the figures and save them as png image format

#python plot.py result_new.xls

Parameter ”result_new.xls” is the excel file generated by the first step. After running the command, several png files will be generated. In addition, ”pact-AE.txt” is generated for linkage distance analysis.

3) Linkage distance computing

#$pact2018/Linkage-Distance

#python hiclust_wiht_newpca.py pact-AE.txt

Parameter ”pact-AE.txt” is the text file generated by the second step. After running the command, a png file will be generated under the Linkage-Distance directory, which is used as Figure 12 in our paper.

a.6. Evaluation and expected result

To evaluate the system and micro-architecture performance of data motifs, users need to run those motifs and profile them. These data motifs should reflect similar characteristics like figures in Chapter 4 and Chapter 5. Our profiling scripts sample the performance data every 1 second during the whole motif runtime, and the performance data possibly vary within a slightly variation for each run.

a.7. Experiment customization

Users can run these data motifs for different benchmarking purpose, e.g. software stack comparison, different aspects of system and architecture characterizations. Also, the data motifs can be deployed on different processors and cluster scales.

a.8. Notes

For the artifact evaluation, since every motifs need to run three times for collecting dozens of performance events, it may cost several weeks to profiling all motifs, which is too expensive for the artifact evaluation. So we provide the profiling scripts and the profiling data used in our paper, which are suit for our Haswell processor configurations. Since the platform configurations of software (e.g. Hadoop/Spark configuration) and hardware (e.g. memory capacity, BIOS configuration) may be different, so the performance data may be different on another platform.