HybridTune: Spatio-temporal Data and Model Driven Performance Diagnosis for Big Data Systems

With tremendous growing interests in Big Data systems, analyzing and facilitating their performance improvement become increasingly important. Although there have much research efforts for improving Big Data systems performance, efficiently analysing and diagnosing performance bottlenecks over these massively distributed systems remain a major challenge. In this paper, we propose a spatio-temporal correlation analysis approach based on stage characteristic and distribution characteristic of Big Data applications, which can associate the multi-level performance data fine-grained. On the basis of correlation data, we define some priori rules, select features and vectorize the corresponding datasets for different performance bottlenecks, such as, workload imbalance, data skew, abnormal node and outlier metrics. And then, we utilize the data and model driven algorithms for bottlenecks detection and diagnosis. In addition, we design and develop a lightweight, extensible tool HybridTune, and validate the diagnosis effectiveness of our tool with BigDataBench on several benchmark experiments in which the outperform state-of-the-art methods. Our experiments show that the accuracy of abnormal/outlier detection we obtained reaches about 80 several Spark and Hadoop use cases, which are demonstrated how HybridTune supports users to carry out the performance analysis and diagnosis efficiently on the Spark and Hadoop applications, and our experiences demonstrate HybridTune can help users find the performance bottlenecks and provide optimization recommendations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 10

05/03/2019

Big Data Model "Entity and Features"

The article deals with the problem which led to Big Data. Big Data infor...
03/17/2021

A Survey on Spatio-temporal Data Analytics Systems

Due to the surge of spatio-temporal data volume, the popularity of locat...
04/23/2018

STAN: Spatio-Temporal Adversarial Networks for Abnormal Event Detection

In this paper, we propose a novel abnormal event detection method with s...
01/10/2018

BigRoots: An Effective Approach for Root-cause Analysis of Stragglers in Big Data System

Stragglers are commonly believed to have a great impact on the performan...
05/18/2021

Bayesian Levy-Dynamic Spatio-Temporal Process: Towards Big Data Analysis

In this era of big data, all scientific disciplines are evolving fast to...
09/18/2017

A Comparative Quantitative Analysis of Contemporary Big Data Clustering Algorithms for Market Segmentation in Hospitality Industry

The hospitality industry is one of the data-rich industries that receive...
12/26/2020

Toward Compact Data from Big Data

Bigdata is a dataset of which size is beyond the ability of handling a v...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent computing industry has witnessed an unprecedented increasing popularity of Big Data. The growth rate of the immense amount of data in our world is doubling faster than ever, and it is estimated that 90% of the global data has been created in the last two years 

[19]. All fields of our lives are now extremely relying on desirable Big Data platforms directly or indirectly, thus Big Data applications like Hadoop [12], Spark [39], Dryad [20] and Storm [34] that derive on-going business value became megatrends in enterprises lately.

Fig. 1: The Performance Diagnosis Approach based on Spatio-temporal Correlation.

There is no doubt that any level of performance optimization of Big Data systems in the current data-driven society will greatly attract academia and industry concerns. However, developing efficient performance analysis and optimization for Big Data systems continues to be a big challenge right now. Because Big Data systems are likely to be constructed from thousands of distributed computing machines, this means that performance issues may exist in a wide variety of subsets or heterogeneous node configurations, such as processors, memory, disks and network; Moreover, the entire software/hardware stacks of Big Data applications are also very complicated and include hundreds of adjustable parameters, which make performance analysis is more complex and needs fine-grained performance data collection and multi-level data association. So far the majority of state-of-the-art performance optimization approaches of Big Data Systems has focused on performance analysis [10][16][15][32][9] and Big Data systems tuning [18], etc,. Although existing studies pay much efforts on overall performance analysis of Big Data Systems in particular, and have made substantial progress. The limitations of fine-grained locating the root causes of performance bottlenecks with multiple data associated during the life cycle of applications still remain. Moreover, the pure data driven diagnosis approach is promising for relatively simple distributed applications, but it is very time-consuming and difficult to be used in the complex Big Data systems. And then, we explore and implement HybridTune, which is a hybrid method that combining temporal-spatial data and model. It not only fine-grained diagnose the performance bottlenecks based on data characteristics, but also significantly reduce the training time based on priori knowledge-based model. Concretely, our main contributions are:

  • We propose a spatio-temporal correlation analysis approach based on stage characteristic and distribution characteristic of Big Data application, which can associate the multi-layer performance data fine-grained. On the basis of correlation data, we carry out feature selection and vectorization, define some prior knowledge, and then use the data and model driven approach for performance bottlenecks detection and diagnosis.

  • We design and develop a lightweight, extensible tool HybridTune, with spatio-temporal correlation analysis and model & data driven diagnosis approach for Big Data systems. Then we evaluate diagnosis results of anomaly simulation on distributed systems, and validate the effectiveness of our tool with BigDataBench on several benchmark experiments in which the outperform state-of-the-art methods. Our experiments show that the accuracy of abnormal/outlier detection we obtained reaches about 80%.

  • Additionally, we introduce several Spark and Hadoop use cases, and we demonstrate how HybridTune supports users to carry out the performance analysis and diagnosis efficiently on the Spark and Hadoop applications. Above all, our model-based and data-driven detection and diagnosis methods based on spatio-temporal correlation data significantly helps to optimize the performance of applications on Big Data platform.

The rest of the paper is organized as follows. Section 2 describes the characteristics of Big Data systems. Section 3 gives the details of our performance diagnosis methodology. Section 4 present the implementation of HybridTune. Section 5 present experiments and list experimental results. Section 6 introduce some case studies. A brief discussion of related works is presented in Section 7. Finally, concluding thoughts are offered in Section 8.

2 Characteristics of Big Data Systems

The current Big Data systems have two different forms for data processing: (1)batch processing systems, such as Hadoop[12], Spark[39], Dryad[20], and so on; (2) stream processing system, such as Storm [34] and Spark Streaming [40], etc,. In general, these Big Data systems has two characteristics as follows.

2.1 Temporal dimension: Stage Characteristics

For Big Data applications, we observe that the jobs on different Big Data systems is generally executed in stages or in the stage-like process, and tasks in the same stage are normally executed by the same or similar code on the data partition. For example, an Hadoop job executes in two stages: Map and Reduce; In the map phase, several map tasks are executed in parallel to process the corresponding input data. After all map tasks finished, the intermediate results are transferred to the reduce tasks for further processing. While scheduling jobs, Spark splits stages and partition data, then generating the DAG of stage dependency specifically, and the tasks in one stage are executed by the same code and portioned into various data fragments[39]. A Dryad job contains a set of stages and each stage consists of an arbitrary set of vertex (each vertex runs on a distinct data partition) replicas, meanwhile all graph edges of the stage constitute point-to-point communication channels [20]. Similarly, all kinds of spout and bolt components of Storm can also be regarded as the execution stages [34]. In addition, Spark Streaming decomposes the batch data into a series of short batch jobs based on Spark [40], so the Spark streaming jobs are executed in stages like Spark jobs.

2.2 Spatial dimension: Distribution Characteristics

As Big Data systems are used for processing a huge amount of data, a single node is impossible to complete the big tasks in an effective time. So Big Data systems generally use the large-scale distributed cluster architecture. They mainly utilize the distributed file system or distributed database to store data, and use parallel programming model or distributed execution method to process/calculate jobs or tasks, that is, the jobs or tasks of Big Data applications will be scheduled into the various nodes in cluster. And the distributed parallel architect distributes data across multiple servers, and it can turbulence improve data processing speeds. We can consider that, the Big Data applications have distribution characteristics in spatial dimension.

3 Performance Diagnosis based on Spatio-temporal Correlation

3.1 Spatio-temporal Correlation

Generally speaking, jobs on Big Data systems are divided into several stages, and each stage owns multiple tasks, which can run on multiple distributed nodes. According to the temporal and spatial characteristics of Big Data applications, we propose a spatio-temporal correlation approach that involves with timestamp information and distributed nodes information to process the data in multiple layers.

First, in order to implement association in temporal dimension, the time synchronization of cluster must be guaranteed. Here, the Network Time Protocol (NTP) protocol is used to synchronize computer time in our Big Data systems. Then, combined with the runtime of Big Data applications and the resources utilization (such as, system-level performance metrics and architecture-level performance metrics) in distributed nodes, we classify the correlations between applications and resources into three forms: (1)

task-resource correlation, (2) stage-resource correlation and (3) job-resource correlation, according to the execution information (e.g., start time, finish time, nodes of a task, nodes of a stage and nodes of a job).

In our methodology, according to the stage characteristics, we mainly use the stage-resource correlation method based on spatio-temporal information for associating performance metrics between different layers.

3.2 Feature Selection and Vectorization

From common sense, we can see that different performance bottlenecks may lead to different behaviors at the performance data level. So we select the corresponding features and generate vectorization dataset for different performance bottlenecks, such as workload imbalance, data skew, abnormal node and outlier metrics, and the vectorization datasets are as follows:

(1)Workload Imbalance: We choose the number of tasks on nodes to describe the workload behavior of applications, and build the dataset =. Here, indicates the node , , and is the total number of nodes in cluster; indicates the task number on node .

(2)Skew data size: We choose the data size of each task processed on each node, and build the dataset =. Here, indicates to the node with task execution, indicates to the data size that task processes.

(3)Uneven data placement: In our implementation, we utilize the comparison of data locality in different tasks or on various nodes to determine whether the uneven data placement exists or not. And we build the dataset =. Here, indicates to the data locality of task , indicates to the task runtime of task in the stage .

(4)Abnormal node: we convert collected performance metrics into metrics vector , and generate the metrics vector dataset =. Here, = ( refers to the average value of , and refers to the -th collected metrics.

(5)Outlier metrics: we build the matrix dataset =, and is a matrix at a stage, columns refers to features number of collected metrics and rows are collection times during a stage, that is, each row in the matrix determines feature values in a particular timestamp during a stage, for example, refers to the at the timestamp .

3.3 Bottlenecks Detection and Diagnosis

After the multi-level performance data is correlated through the spatio-temporal correlation method, We propose the automatic bottleneck detection and diagnosis approaches for workload imbalance, data skew, abnormal node and outlier metrics. For different types of bottlenecks, we first select different features and vectorize these features, and then utilize some model-based and data-driven algorithms for detection or diagnosis.

3.3.1 Workload imbalance diagnosis

Occasionally, the volume of tasks in certain nodes is bigger or smaller than others at one stage of a Big Data application, or there is no execution on some slave nodes, we regard this uneven distribution of workloads as workload imbalance.

In this section, we use the as the input and propose an algorithm to detect whether the task assignment of application is balanced. Specifically, to quantify workload imbalance in stage, we first define as the measurement of workload imbalance in stage . The balance coefficient () refers to the degree of workload imbalance which can be tolerated, in Equation 1 indicates to the average number of tasks in stage in the cluster( with slaves) after removing ultrashort tasks, and refers to the volume of tasks in the stage on the node after removing ultrashort task. Since in some special cases, stages with a number of ultrashort tasks (e.g., failed tasks) running on its nodes greatly affects the judgment of workload imbalance, for that reason we decide to eliminate the ultrashort tasks.

(1)

Then in every stage, in order to determine whether there is workload imbalance in stage, we define to indicate the difference between and on node , if the absolute value is greater than the number of tolerated imbalanced tasks , we consider that the workload on this node is imbalanced. Moreover, if the sum is greater than , we consider there is workload imbalance exist in a stage. In addition, we also use in Equation 2 to represent the tilt degree of task assignment on node , which is applied to detect the most imbalanced node through ranking methods.

(2)

Further more, we assume that the stage number of the job is , and define to indicate the ratio of imbalance stage in a job, which is the proportion of the unbalanced stage to the total stage number. If (here, is a threshold of the job having workload imbalance, which can be set by users), we consider that the job has workload imbalance. And the Algorithm 1 describe how to determine the workload imbalance in stage and job .

0:    
,,,
0:    
unbalanced , top ,
1:  ;
2:  ;
3:  for ;; do
4:     Set
5:     for ;; do
6:        ;
7:        ;
8:        ;
9:        Save into key-value list ;
10:     end for
11:     if  then
12:        print ”Task  assignment  at  stage    is  unbalanced”;
13:        Sort from large to small;
14:        Output the top ;
15:        ;
16:     end if
17:  end for
18:  
19:  if   then
20:     Print ” Task assignment of job J is unbalanced”;
21:  end if
Algorithm 1 Algorithm of Determining Workload Imbalance

3.3.2 Data skew diagnosis

Data skew mainly includes two situations: skew data size and uneven data placement.

Skew data size diagnosis
0:    
,
0:    
The task or node that have skew data size
1:  Calculate ;
2:  Calculate the average value of data size that tasks process on the node k: ;
3:  if   then
4:     Print ”the data size of task is skew”;
5:  end if
6:  if   then
7:     Print ”the node has data size skew”;
8:  end if
Algorithm 2 Skew Data Size Detection Algorithm

From existing experience, data skew phenomenon would exist in a task if the data size of the processing task is much larger or smaller than other tasks in the stage. And similar phenomenon exists in a node if the average size of data is far different from other nodes in a stage, furthermore, data skew of a stage would be deemed to be in existence if data skew was found in the task or node on this stage.

To measure the size of skew data, we use = as input, and then we calculate the ratio of data size to the median value of data size , afterwards the result of value comparison of the ratio and the threshold helps us to measure skew data size. Algorithm 2 describes the skew data size detection algorithm.

Uneven data placement diagnosis
0:    
0:    
1:  Calculate

, standard deviation

;
2:  Calculate the distance from each to : ;
3:  for  do
4:     Summarize: ;
5:     Calculate the mean value of : ;
6:     if  then
7:        Put into the suspicion group ;
8:     end if
9:  end for
10:  for each in  do
11:     if   then
12:        Put this into ;
13:        Find the corresponding and of this ;
14:     end if
15:  end for
16:  =0;
17:  for   do
18:     for  each in locality categories do
19:        Calculate ;
20:        Calculate ;
21:        if   then
22:           Output ,and its corresponding node and , which has uneven data placement.
23:        end if
24:     end for
25:  end for
Algorithm 3 Uneven Data Placement Detection based on Euclidean Distance Outlier Algorithm

Data placement is another critical factor for task runtime and workload imbalance. Because the cluster hardware and workloads are different in the distributed cluster environment, the partitioned data may be placed unevenly, and the task runtime can be very different. In order to find uneven data placement, we focus on the data locality of Big Data systems.

At the very beginning, we obtain , and then we classify the runtime of localities into two categories by outlier detection algorithm of Euclidean distance, definitely the abnormal runtime () is much longer than the normal runtime. Next, we determine whether the data placement leads to abnormal runtime or not, by setting several different weights for distinct localities given the priority of locality. and are utilized here to infer to categories of locality and weights of locality respectively, for example, we set =1, =2, and some weights are set as 0 ( 0 means these localities are not supposed to cause uneven problems of the data placement). Meanwhile, we calculate , which refers to the number of abnormal runtime of on node . In addition, we define in Equation 3, and it indicates to the ratio of uneven data placement on a node. When is bigger than 0, the uneven data placement occurs. The uneven data placement detection algorithm is based on Euclidean distance, and the the whole procedure is demonstrated in Algorithm 3.

(3)

3.3.3 Abnormal node diagnosis

Execution behaviors of various tasks at the same stage show striking similarity while these tasks running on the homogeneous cluster, thus characteristics of nodes in the homogeneous cluster at one stage are supposed to be analogous. When a node shows significantly different characteristics compared to other nodes at the same stage, the node would be regarded as an abnormal node with the potential bottleneck. In this section, we figure out cosine similarity between nodes to check abnormal machines.

First of all, we obtain , and then we calculate the cosine similarity between the metrics vector on node k () and node (), as can be seen in Equation 4. The closer the cosine value is to 1, the smaller angle between two vectors and the more similar nodes we get.

(4)

For the sake of detecting the abnormal node, we abandon pairwise comparison method that lacks intuitive results, instead, we measure the average similarity between each node and all rest nodes shown in Equation 5. If of node is smaller than a specified similarity threshold , then the node is regarded as an abnormal node. Here, refers to slave nodes set without node .

(5)

3.3.4 Outlier metrics diagnosis

Generally, if there are abnormal nodes during a stage, at the micro level, by observing from micro-level, individual metrics of abnormal nodes always have abnormal states; Even if some nodes only subject to interferences, the interfered metrics will also behave differently; So the metrics whose behaviors have a greater difference between the metrics’s behaviors on other nodes, which are regarded as outlier metrics. In this section, we compare the differences between the principal component metrics at each node in the cluster, and try to find the root cause of performance bottlenecks by observing the anomalies of metrics.

Principal Component Analysis

According to the observations, we learn that not all performance metrics are closely associated with performance anomalies, for some metrics remain stable even an outlier appears. Furthermore, different applications and stages are sensitive to different metrics, hence we use principal component analysis (PCA)  

[36] for relevant metrics selection. In general terms, PCA uses an orthogonal transformation to convert to a large set of data observations. The number of principal components is usually less than or equal to original variables, and the first principal component accounts for the most variability in the data. Before using PCA, we first construct the dataset . For the matrix , the principal components can be obtained from Equation , they are the eigenvectors of the covariance matrix  [17].

In addition, the cumulative contribution rate of PCA is used in our evaluation for selecting and determining an appropriate dimension of eigenspace. The rate formula is

, and represents the top principal components.

Time Series Transformation

We use to represents the time series of principal component metrics during stage and on node , and the length of the time span is . Obviously, where there is a time point, there is a metric value. To accomplish the data set reduction for outlier detection, we introduce two different methods of time-frequency transformation for input data, and the details of both two transformations are described as follows:

(1)Mean Value. Average value comparison of the performance metric on different nodes is a typical approach for time series transformation. If there are substantial differences in average value of one performance metric between certain nodes and other nodes, then we believe that this performance metric is a potential key metric, the calculation method is shown in Formula 6.

(6)

(2)Fast Fourier Transform. Phase the phase difference between sequences in time domain can result poorly when doing similarity comparison of original signal on time domain, thus transforming original data from time domain to frequency domain is an ideal try to eliminate problems of phase difference. Fast Fourier Transform (FFT) is often utilized to transform original data from time-space domain to frequency domain and vice versa, and is an efficient method for a sequence to compute its Discrete Fourier Transform (DFT). Formula

7 is a DFT equation, where is the r-th coefficient of the DFT, and denotes the t-th sample of the time series which consists of samples, and  [8]. More over, an FFT rapidly computes such transformations by factorizing the DFT matrix into a product of sparse (mostly zero) factors. As a result, FFT manages to reduce the complexity of computing the DFT from , which arises if one simply applies the definition of DFT, to , where n is the data size.

(7)

By using the above two transformations, for all slaves of a stage, we get a reduction data set or , which is used for outlier detection. In subsequent experiments, we will further compare the detection results of using the two transformations.

0:    
Time series of principal component metrics
0:    
Reduction data set:
or
1:  for ;; do
2:     if Calculate  then
3:        Output
4:     end if
5:     if Calculate  then
6:        Output
7:     end if
8:  end for
Algorithm 4 Time Series Transformation
Normalization

Different performance metrics in a cluster normally have varied sizes and units. For instance, the units of and are percentage(%), and its value is between zero and one. However, the units of and may be MB/s, which is so different with the percentage. To adjust metrics measured at the stage on different scales to a common scale notionally, normalization [2] is applied into the data preprocessing, with the help of that, it would be more normal to process the data with consistent statistical properties.

In this section, we use the linear Min-Max Normalization to convert the original metrics into values range from 0 to 1. Formula 8 is the transformation expression, is a sample in or , and is the maximum value of the samples, is the maximum value of the samples. However, the disadvantage of this normalization method is that max and min might be redefined when inputting the extra new data.

(8)
Outlier detection
0:    
Normalized or
0:    
Outlier metrics
1:  if  then
2:      conversion for input data set
3:     Call the magnitude-based outlier algorithm:
4:     (1)Find the center of mass(median);
5:     (2)Calculate the distance from each point to the center of mass
6:     if  then
7:        Add the point into the suspicion group
8:     end if
9:     (3)Compute the distance from the point in to the center of mass;
10:     if  ( then
11:        This point in is counted as outlier.
12:     end if
13:  else
14:     Call the distance-based outlier algorithm:
15:     (1)Select the maximum and minimum values for the current point in class A and B;
16:     (2)Calculate the distance and from each point to the two current points;
17:     if   then
18:        Assign the point to class A;
19:     else
20:        Assign the point to class B;
21:     end if
22:     if  then
23:        (3)Compute the distance from the point in A to the class B (the representative point of class B);
24:        if  then
25:           This point a is counted as outlier.
26:        end if
27:     else
28:        (3)Compute the distance from the point in B to the class A (the representative point of class A);
29:        if  then
30:           This point b is counted as outlier.
31:        end if
32:     end if
33:  end if
Algorithm 5 An Outlier Detection Algorithm Based on Distance and Magnitude

In statistics, an outlier is an observation point that is distant from other observations [3]. In this section, we propose an unsupervised method combing with distance and magnitude algorithm for outlier detection to distinguish the metrics that do not belong any expected pattern in the dataset or show certain similarities with other metrics.

Our distance-based outlier model borrow ideas from the distribution-based approaches, it is also suitable for situations where the observed distribution does not fit any standard distribution [23]. Specifically, an object in a dataset is an -outlier, if at least a fraction of all data objects in lies greater than distance from . We use the term -outlier as shorthand notation for a Distance-Based outlier (detected using parameters and ). Of course, the choice of parameters and , and validity checking (i.e., deciding whether each -outlier is a real outlier), requires expert.

Even though our distance-based outlier algorithm with appropriate parameter settings is able to detect most of outliers in the dataset, some outliers could still be missed. For instance, the normalized mean value of on each node is [ hw073: 0.006838, hw106: 0.15604399, hw114: 0.17810599 ] respectively, however, there would exist no outlier as setting equal to 0.5 and equal to 1. Actually, we could consider 0.006838 as an outlier value here. To make our outlier detection model still work in this case, we apply logarithm method (e.g., ) in the beginning to obtain data s order of magnitude by transforming the original data, for example, the order of magnitude on several nodes [ hw073: -2, hw106: 0, hw114: 0 ] shows significant disparity, we can suppose hw073 is a potential outlier, then the remaining two nodes would be analyzed through distance-based detection algorithm.

Algorithm 5 is the detailed pseudo-code, and describe the outlier detection algorithm based on distance and magnitude. In this algorithm, we predefine the parameters as a default value 1, and is adjustable. In addition, we use two methods to calculate the representative point of class A or B, one method is computing the max/min value of larger class, the other method is computing the median value of larger class. In the subsequent experiments, we will compare the results of outlier detection by the max/min value method and the median value method with different for larger class.

4 HybridTune Implementation

Fig. 2: The workflow of HybridTune.

Based on our general performance diagnosis approach, we have implemented HybridTune, a scalable, lightweight, model-based and data-driven performance diagnosis tools utilizing spatio-temporal correlation. In this section, we describe the implementation of HybridTune, and the workflow of HybridTune is shown in Fig. 2.

4.1 Data Collection

We use the data collector of BDtune [31] to gather the performance information and application logs from the software stack of Big Data systems at different levels. Specifically, the data collector can collect architecture-level metrics, system-level metrics and application logs. Hardware Performance Monitoring Unit (PMU) and Perf [4] are used for data sampling of architecture-level metrics in data collector, and metrics consists of instruction ratio, instructions per cycle (IPC), cache miss, translation lookaside buffer (TLB) miss, etc. Then, we use Hmonitor [31] to collect raw data from the filesystem /proc, which provides key parameters (e.g., CPU usage, memory access, disk I/O bandwidth, network bandwidth) of system performance. Furthermore, we use log collection tools to collect the application logs (e.g., history job logs of Spark and Hadoop). Table I lists the detailed collecting information.

The format of Application history job logs:
{”Event” : ”SparkListenerTaskEnd” , ”Stage ID” : 0, ”Stage Attempt ID” :0, ”Task Type” :”ShuffleMapTask”,”Task End Reason” :{”Reason”:”Success”},”Task Info”:{”Task ID”:2,”Index”:2,”Attempt”:0,”Launch Time” :1456896044081,”Executor ID”:”0”,”Host”:”hw073”,”Locality”: ”PROCESS LOCAL”,”Speculative”:false,”Getting Result Time”:0,”Finish Time”:1456896045955,”Failed”:false, ”Accumulables”:[{”ID”:1,”Name”:”peakExecutionMemory”,”Update”:”920 ”,”Value”:”920”,”Internal”:true}]},”Task Metrics”:{”Host Name”:”hw073”,”Executor Deserialize Time”:1548,”Executor Run Time”:147,”Result Size”:1094,”JVM GC Time”:0,”Result Serialization Time”:21,”Memory Bytes Spilled”:0,”Disk Bytes Spilled”:0,”Shuffle Write Metrics”:{”Shuffle Bytes Written”:26,”Shuffle Write Time”:4201900,”Shuffle Records Written”:1}}}
The format of Raw Architecture-Level Metrics:
timestamp cycle ins L2 miss L2 refe L3 miss L3 refe DTLB miss ITLB miss L1I miss L1I hit MLP MUL ins DIV ins FP ins LOAD ins STORE ins BR ins BR miss unc read unc write
The format of Raw System-Level Metrics:
timestamp usr nice sys idle iowait irq softirq intr ctx procs running blocked mem total free buffers cached swap cached active inactive swaptotal swap free pgin pgout pgfault pgmajfault active conn passive conn rbytes rpackets rerrs rdrop sbytes spackets serrs sdrop read read merged read sectors read time write write merged write sectors write time progress io io time io time weighted
TABLE I: The format of Application Logs, Raw Architecture-level Metrics and System-level Metrics.

4.2 Data Preprocessing

It is so important for performance analysis to efficient data preprocessing, since log files and performance data with non-uniform formats are generally collected from different nodes. Therefore, we parse the performance data and application logs, and then unify the data format, preprocess the raw data and load the data into our MySQL database.

4.2.1 Data Parsing

In order to deal with different log formats of applications,we establish various log parsing templates compatible with different applications logs. In our implementation, we collected the history job logs of Hadoop and Spark, which are json formats and record various information about jobs’ run. Then we parse and extract some useful application data from these history job logs, for example:

  • Runtime information: The submission time, completion time and runtime of jobs, stages and tasks.

  • Dataflow information: The data flow information between nodes in each stage of jobs, includes the reading and writing data, reading and writing time, input and output data of tasks, and so on.

  • Application configuration information: The configuration parameter information of Hadoop/Spark, etc,.

  • Job Runtime Parameters: Job-level parameters and task-level parameters, for example, the ”counters” information of Hadoop and the ”task_metrics” information of Spark.

Simultaneously, we use the collected raw metrics to calculate the selected performance metrics which are shown in Table II. For different performance metrics, there are different calculation methods, such as can be calculated by these metrics , , , , , and  [1]. In this section, the calculation formulas are not described in detail.

Layer Metrics Description
System level cpu_usage CPU utilizations
mem_usage Memory usage
ioWaitRatio Percentage of CPU time spent by IO wait
weighted_io Average weighted disk io time
diskR_band Disk Read Bandwidth
diskW_band Disk Write Bandwidth
netS_band Network Send Bandwidth
netR_band Network Receive Bandwidth
Architecture level IPC Instructions Per Cycle
L2_MPKI Misses Per Kilo Instructions of L2 Cache
L3_MPKI Misses Per Kilo Instructions of L3 Cache
L1I_MPKI Misses Per Kilo Instructions of L1I Cache
ITLB_MPKI Misses Per Kilo Instructions of ITLB
DTLB_MPKI Misses Per Kilo Instructions of DTLB
MUL_Ratio MUL operation’ percentage
DIV_Ratio DIV operation’ percentage
FP_Ratio Floating point operations’ percentage
LOAD_Ratio Ratio of LOAD Operation
STORE_Ratio Store operations’ percentage
BR_Ratio Branch operations’ percentage
TABLE II: The Selected Performance Metrics.

4.2.2 Data Storage

In addition, we design a tagging mechanism and propose an incremental table approach to match the scalability need of date aggregation and storage. Specifically, we set corresponding labels for tables of different applications, for instance, if label , then the tables represent the parsed Hadoop log contents, if , we know that tables store the parsed Spark logs. Moreover, we provide public table interfaces and unique table interfaces for different application logs, because the contents of application logs are not always the same. For example, both Hadoop and Spark consist of public table interfaces like table, table, table and table. The unique table interfaces for Hadoop are table and table, unlike Hadoop, Spark s unique table interfaces are table and table, and so on. When collecting Storm logs and storing parsed data into MySQL, the data preprocessor needs to adjust its log parsing template that only for Storm applications, then it creates unique tables of Storm, parsing data, preprocessing raw data, loading data into existed public tables and its unique tables, and setting .

4.3 Data & Model Driven Performance Diagnosis

After data preprocessing, performance diagnosis model would undertake the specific analysis and diagnosis works. Our performance diagnosis model is equipped with a plug-in mechanism, which enables the analysis engine to adapt different application occasions and different diagnosis methods. Among these plug-in mechanisms, some are universal (e.g., statistical analysis of performance metrics), while some are application-specific (e.g., critical path computing of different jobs).

Specifically, the workflow of performance diagnosis method comprises three main step: (1) spatio-temporal correlation, (2) feature selection and vectorization and using the data driven approach / defining prior knowledge and using the model driven approach, (3) bottlenecks diagnosis. The details are described in section 3.

5 Evaluations

5.1 Experiment Settings

The Hadoop cluster used in our experiment consists of one master machine and six slave machines, the Resource Manager and Name Node modules are deployed on the master node, the Node Manager and Data Node only run on the slave node. Furthermore, we deploy our Spark cluster on the Hadoop Yarn framework, the Master module runs on the master machine, and the Workers executes on slave machines. In our cluster, we use NTP (Network Time Protocol) service to ensure clock synchronization across nodes, Table III details the server configurations of our cluster. In addition, the evaluations about impacts of configurations on system performance are not included in this paper, thus our machines in the cluster are homogeneous machines with the same machine configurations and cluster configuration parameters.

Processor Intel(R) Xeon(R) CPU E5645@2.40GHz
Disk 8 Seagate Constellation ES (SATA 6Gb/s)- ST1000NM0011 [1.00 TB]
Memory 32GB per server
Network Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
Kernel Linux Kernel 3.11.10
TABLE III: Server configurations

5.2 Anomaly Simulation

To further determine whether or not workload imbalances, straggler nodes, data skew and abnormal machine states exist, and evaluate the effectiveness of our automatic diagnosis tool for performance bottleneck detection, we decide to simulate anomalies by the following methods:

(1) Reduce the computing power of some nodes. For example, you can attempt to disable a core in a multi-core machine. However, this does not work for a few certain cores, like the core 0. And disabling too many cores would result in a system crash.

(2) Make disk storage imbalance. E.g., filling the space of disks on some nodes with data.

(3) Mix interference workloads. In our experiment, Linux stress testing tool —— Stress [5] is utilized to impose extra load on CPU, memory, IO and disk.

(4) Cache flusher adjustment. We use a cache flusher to load certain volume of data according to sizes of the last-level cache, for inserting cache anomalies.

5.3 Anomaly detection Evaluation Results

In the section, over 90 programs on 342 stages are tested and executed. We also try a number of workload executions, e.g., Wordcount, Grep, Sort, K-means on BigDataBench, FPGrowgh and PrefixSpan on Spark MLlib 

[26]. Due to the bottlenecks at application level (such as workload imbalance and data skew) may be related to the users’ subjective view even more, so we plan to use the artificially setting threshold, for example, we can set BC=0.1, =0.6 and =1.5 [31]. Here, we mainly evaluate the threshold selection of abnormal node and outlier metrics, as well as the effect of outlier detection.

5.3.1 Abnormal node determine

To determine whether or not node is an abnormal node, we compare (formula 5 in section 3.3.3) with the predefined threshold . If the similarity value is smaller than the threshold, then the node is judged as an abnormal node.As for how to set the size of threshold , we utilize formula 9 to measure the proportion of real abnormal nodes in the detected abnormal nodes detected just by predefined threshold , and the results are shown in Figure 3.

(9)
Fig. 3: The results of .

5.3.2 Selection of principal components

The results of outlier metrics detection are often affected by the dimension of principal component metrics. To determine an appropriate dimension for the eigenspace, we decide to use the cumulative contribution rate to help selecting principal components. First we de?ne as the -th principal component, then we calculate the ratio of a certain performance metric to principal component , as shown in formula 10. We discover that the principal component always belongs to several particular performance metrics.

(10)

We can see various average eigenvector values, the cumulative contribute rate of performance metrics and from Figure 4. here refers to the proportion of a principal component to the total number of PCA, which is illustrated in formula 11. Figure 4 also show the different results of principal components selection while setting the cumulative contribution rate as 0.9, 0.95 and 0.99. If we choose 0.9 as the cumulative contribution rate, the principal components would involve metrics from PC1 to PC10, and among them, only PC9 metric and PC10 metric belongs to architecture-level metrics. However, if our cumulative contribution rate equals to 0.95, a few system-level metrics and architecture-level metrics are selected as the principal components. Furthermore, when the cumulative contribution rate is 0.99, almost all metrics would be contained in principal components, this makes the cumulative contribution rate become meaningless. Above all, we choose 0.95 as the cumulative contribution rate for our principal components selection.

(11)
Fig. 4: The principal component results by PCA analysis.
Fig. 5: The effectiveness of outlier detection when using Mean-Value transformation.
Fig. 6: The effectiveness of outlier detection when using FFT transformation.

5.3.3 The effect of outlier detection

We evaluate the effectiveness of outlier metrics detection through three indicators: Precision, Recall and Accuracy [35]. means the exactness of outlier detection, the higher the precision we have, the lower the rate of false alarms we get.

indicates to the completeness of detection, the higher the recall is, the lower the false negative rate we obtain. Nevertheless, only a precision or a separate recall is incapable of evaluating the effectiveness of an anomaly detection method. Therefore, we introduce the

(the harmonic mean of

and ) to measure outlier metrics detection.

(12)
(13)
(14)

We can see the similar results of the outlier metrics detection by Mean-Value transformation and FFT transformation from figure 6 and figure 6 respectively. If we utilize the median value to represent larger class, then is higher than . As is shown in the figure 6 and figure 6, when equals to 0.5 or 0.6, the reaches more than 92%, however is a slightly lower: 67% and 70% respectively. In addition, if using max/min value as the larger class, in contrast, would be higher than , we can see the s in figure 6 and figure 6 are both over 84% no matter what the is. In other words, there are more outliers able to be detected and the false negative rate is much lower. Contrary to the , the is a little low. If the value is 0.4, 0.5 or 0.6, then the ranges from 62% to 81%, it means that lots of normal metrics are misjudged as outliers, thus resulting in a high false positive rate. Additionally, when max/min value represents the larger class and is 0.6, then the closes to 83%. However, if is set to 0.5, the is about 80% when using max/min value and median value, and setting to 0.4 would further decrease the .

6 Case Studies

From [31] we know that, the causes of performance bottleneck can be classified four categories: improper configuration, data skew, abnormal nodes and intra-node resource interference. In this section, based on our detection and diagnostic methods, we share our experiences on tuning and diagnosing the performance of Spark and Hadoop applications, illustrate the three cases in the paper [31] and an added case for Hadoop job.

6.1 Case-1: Uneven Data Placement

From  [31] we know that, the S-WordCount job’s stage spark stage app-20160630230531-0000 0 have a straggle outlier node hw114 when Th_D=1.5, and workload imbalance when BC=0.1. In this paper, we give the automatic diagnostic results in Table IV.

In contrast to other nodes, the priority of data locality of hw114 is ”ANY” and the average similarity between hw114 and other slave nodes is 23.57%. That comes about because, the uneven data phenomenon placement exists in hw114. We find that every task on node hw114 needs to read data from other nodes rather than the local, so that their task runtime is relatively longer. So we decide to utilize the HDFS (Hadoop Distributed File System) balancer to optimize the data distribution, then the completion time of this S-WordCount job is reduced from 218 seconds to 167 seconds, approximately 23.21%.

In addition, we also give a automatic diagnostic results of Hadoop’s mapStage job 1493084522519 0014 in Table V. We can know that this map stage of Hadoop has no obviously straggle outlier node, but there exist workload imbalances. In fact, we check the task assignments of hw106, hw114, hw062 and hw073, which are respectively 228, 159, 44 and 23. It is obvious the assigned tasks in hw073 and hw062 are significantly less than that in the other two nodes, and the reason is that the localities of hw073 and hw062 are ”RACK_LOCAL”, while the localities of hw106 and hw114 are ”NODE_LOCAL”.

Stage id: spark stage app-20160630230531-0000 0
Detected straggle outlier node: hw114
Detected workload imbalance: hw114
— Data skew diagnosis:
      Skew data size : Null
      Uneven data placement : hw114 [ANY:0.06875]
— Abnormal node diagnosis:
      Similarity analysis : Similarity ([’hw089’, ’hw062’, ’hw073’,
      ’hw103’, ’hw114’, ’hw106’], other nodes): [0.8048, 0.7838,
      0.8242, 0.7870, 0.2359, 0.8171]
      Detected abnormal node: hw114
— Outlier metrics diagnosis:
      Mode: [Mean-Value,median,=0.95,dmin=0.5]:
hw114:(mem_usage,ioWaitRatio,diskR_band,netS_band,netR_band)
TABLE IV: The automatic diagnostic results of Spark job for case-1 .
Stage id: mapStage job 1493084522519 0014
Detected straggle outlier node: Null
Detected workload imbalance: hw106, hw073, hw062, hw114
— Data skew diagnosis:
      Skew data size : Null
      Uneven data placement : hw073 [RACK_LOCAL:0.09469],
      hw062 [RACK_LOCAL:0.00379]
— Abnormal node diagnosis:
      Similarity analysis : Similarity ([’hw062’, ’hw073’,’hw114’,
      ’hw106’], other nodes): [0.8571, 0.8143,0.8784, 0.8807]
      Detected abnormal node: Null
— Outlier metrics diagnosis:
      Mode: [Mean-Value,median,=0.95,dmin=0.5]: Null
TABLE V: The automatic diagnostic results of Hadoop job for case-1.

6.2 Case-2: Abnormal node

Stage id: spark stage app-20160719212517-0001 2
Detected straggle outlier node: hw089
Detected workload imbalance: hw089
— Data skew diagnosis:
      Skew data size : Null
      Uneven data placement : Null
— Abnormal node diagnosis:
      Similarity analysis : Similarity ([’hw089’, ’hw062’, ’hw073’,
      ’hw103’, ’hw114’, ’hw106’], other nodes): [0.1198, 0.7667,0.8017
      0.7774,0.7995, 0.7974]
      Detected abnormal node: hw089
— Outlier metrics diagnosis:
      Mode: [Mean-Value,median,=0.95,dmin=0.5]:
       hw089:(cpu_usage, ioWaitRatio,weighted_io)
TABLE VI: The automatic diagnostic results for case-2.

We also give the automatic diagnostic results of the case spark stage app-20160719212517-0001 2(reported in the paper  [31]) in Table VI. From the automatic diagnostic results we know that, hw089 is a straggle outlier node and has workload imbalance, and it is a abnormal node whose similarity between others is about 11.98%. Nevertheless, these bottlenecks problems are not mainly caused by data skew. However, we do find some outlier metrics, and just find an abnormal metric: the average weighted_io of hw089 calculated by Mean-value method is -4177890.23, contrary to common sense. Since the io_time_weighted value is 4294936240 at 2016-07-19 22:35:49, while the io_time_weighted value is 258900 at 2016-07-19 22:35:50. In order to diagnose the root cause, we further view the system logs, and find that the disk of hw089 has experienced a high temperate alarm and Raw Read Error Rate [31].

We can see from case-2, it is necessary to analyze the correlation between the outlier metrics, for some outlier metrics may be caused by other metrics, such as case-2, the abnormal metric weighted_io lead to the outlier metrics cpu_usage and ioWaitRatio. Further more, in order to locate the root cause of abnormal metric, the diagnosis based on system or RAS logs is also needed.

6.3 Case-3: Intra-Node Resource Interference

Stage id: spark stage app-20160703145107-0001 0
Detected straggle outlier node: hw062,hw106
Detected workload imbalance:
— Data Skew diagnosis:
      Skew data size : Null
      Uneven data placement : Null
— Abnormal node diagnosis:
      Similarity analysis : Similarity ([’hw089’, ’hw062’, ’hw073’,
      ’hw103’, ’hw114’, ’hw106’], other nodes): [0.9593, 0.9255,0.9228
      0.9437,0.9513, 0.9432]
      Detected abnormal node: Null
— Outlier metrics diagnosis:
      Mode: [FFT,median,=0.95,dmin=0.5]:
       hw062:(L3_MPKI);    hw106:(L3_MPKI)
TABLE VII: The automatic diagnostic results for case-3.

The automatic diagnostic results of the case spark stage app-20160703145107-0001 0 (reported in the paper  [31]) in Table VII. From the automatic diagnostic results we know that, there exists two straggle outlier nodes hw062 and hw106, and it not caused by data skew and abnormal node. However, the automatic diagnosis tool of BDTune find that the average L3_MPKI in these two nodes are both larger than other nodes while the node similarity of all nodes in the cluster is 93.3%.

7 Related work

Performance analysis. There have been much prior studies on building tools to analyze performance for MapReduce applications. SONATA [16] propose a correlation-based performance analysis approach for full-system MapReduce optimization, it correlates different phases, tasks and resources for identify critical outliers and recommends optimization suggestions based on embedded rules, which just uses the model-based method. HiTune [10] describe a dataflow-driven performance analysis approach, it reconstruct the high level, dataflow-based, distributed and dynamic execution process for each Big Data application. Mochi [32] is a visual, log-analysis based debugging tool correlates Hadoop s behavior in space, time and volume, and extracts a causal, unified control and dataflow model of Hadoop across the nodes of a cluster.

Besides the above tools used to analyze MapReduce applications, tools for other platforms are also proposed. Kay et al. [28] use blocked time analysis to quantify the performance bottlenecks in Spark framework, and Microsoft use Artemis [9]

to analyze Dryad application, which is a plug-in mechanism which using statistical and machine learning algorithms.

Performance anomaly detection and diagnosis. In general, anomaly detection is an essential part of performance diagnosis for Big Data systems. And anomaly detection techniques can be broadly classified into two groups: data-driven and model-based.Data-driven methods include nearest neighbor based methods include distance-based [24], k-nearest neighbor [30] and local outlier factor [7], and kmeans clustering [37]. Specifically, A number of node comparison methods have been adopted for anomaly detection in large-scale systems [38]. For example, Kahuna [33] aims to diagnose performance in MapReduce systems based on the hypothesis that nodes exhibit peer-similarity under fault-free conditions, and that some faults result in peer-dissimilarity. Ganesha [29] is a black-box diagnosis technique that examines OS-level metrics to detect and diagnose faults in MapReduce systems, especially can diagnose faults that manifest asymmetrically at nodes. Eagle [17] is a framework for anomaly detection at eBay, which uses density estimation and PCA algorithms for user behavior analysis. Kasick et al. [21] developed anomaly detection mechanisms in distributed environments by comparing system metrics among nodes. Z. Lan et al. [38] present a practical and scalable anomaly detection method for large-scale systems, based on hierarchical grouping, non-parametric clustering, and two-phase majority voting.

Representative model-based techniques include rule based methods [13][14]

, support vector machine (SVM) based methods 

[22]

, probability model 

[25]

, bayesian network based methods 

[11], etc. For example, Hadoop vaidya[13] is a rule-based performance diagnostic tool from MapReduce jobs. Although it can provise recommendations based on the analysis of runtime statistics, it cannot facilitate full-system optimization. CloudDiag [27] can efficiently pinpoint fine-grained causes of the performance problems through a black-box tracing mechanisms and without any domain-specific knowledge. Mantri [6] is a system that monitors tasks and culls outliers based on their causes, and then delivers effective mitigation of outliers in MapReduce networks.

Moreover, the pure data driven diagnosis approach is promising for relatively simple distributed applications, but it is very time-consuming and difficult to be used in the complex Big Data systems. The model driven approach requires more detailed prior knowledge to achieve better accuracy, and it is also difficult to adapt for big data scale. Distinguished from the above works, HybridTune is a lightweight and extensible tool, which uses a hybrid method that combining with data driven and model driven diagnosis approach. It provides fine-grained spatio-temporal correlation analysis and different diagnosis. Due to the stage-based and multi-level performance data correlation, it is also easily extended to semi-realtime detection and can improve the time effectiveness of diagnosis.

8 Conclusion

In this paper, we propose a spatio-temporal correlation analysis approach based on stage characteristic and distribution characteristic of Big Data applications, which can associate the multi-level performance data fine-grained. On the basis of correlation data, we define some rules, build some suitable datasets through feature selection and vectorization for different performance bottlenecks, such as, workload imbalance, data skew, abnormal node and outlier metrics. And then, we utilize the model-based and data-driven algorithms for bottlenecks detection and diagnosis. In addition, we design and develop a lightweight, extensible tool HybridTune, and validate the diagnosis effectiveness of our tool with BigDataBench on several benchmark experiments in which the outperform state-of-the-art methods. Our experiments show that the accuracy of abnormal/outlier detection we obtained reaches about 80%. Furthermore, we report several Spark and Hadoop use cases, which are demonstrated how HybridTune supports users to carry out the performance analysis and diagnosis efficiently on the Spark and Hadoop applications. we can see that our model-based and data-driven detection and diagnosis methods based on spatio-temporal correlation data can pinpoint performance bottlenecks and provide performance optimization recommendations for Big Data applications.

Acknowledgments

References

  • [1] Cpu time. https://en.wikipedia.org/wiki/CPU_time.
  • [2] Normalization(statistics). https://en.wikipedia.org/wiki/Normalization_(statistics).
  • [3] Outlier. https://en.wikipedia.org/wiki/Outlier.
  • [4] perf. https://perf.wiki.kernel.org/index.php/Main_Page.
  • [5] Stress tool. http://weather.ou.edu/~apw/projects/stress/.
  • [6] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using mantri. In the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI, 2010.
  • [7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In ACM SIGMOD international conference on Management of data, pages 93 C–104, 2000.
  • [8] W.T. Cochran, J.W. Cooley, D.L. Favin, H.D. Helms, R.A. Kaenel, W.W. Langa, G.C. Maling, D.E. Nelson, C.M. Rader, and P.D. Welch. What is the fast fourier transform? Proceedings of the IEEE, 55, 1967.
  • [9] G.F. Cretu-Ciocarlie, M. Budiu, and M. Goldszmidt. Hunting for problems with artemis. In Workshop on Analysis of System Logs, 2008.
  • [10] J. Dai, J. Huang, S. Huang, B. Huang, and Y. Liu. Hitune: dataflow-based performance analysis for big data cloud. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, 2011.
  • [11] K. Das and J. Schneider. Detecting anomalous records in categorical datasets. In The 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 220 C–229, 2007.
  • [12] Apache Software Foundation. Apache hadoop. http://hadoop.apache.org/.
  • [13] Apache Software Foundation. Hadoop vaidya. http://hadoop.apache.org/docs/r1.2.1/vaidya.html.
  • [14] X. Fu, R. Ren, S. A. McKeez, J. Zhan, and N. Sun. Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In IEEE International Conference on Cluster Computing(Cluster), pages 103–112, 2014.
  • [15] E. Garduno, S. P. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. Theia: visual signatures for problem diagnosis in large hadoop clusters. In 26th Large Installation System Administration Conference (LISA), 2012.
  • [16] Q. Guo, Y. Li, T. Liu, K. Wang, G. Chen, X. Bao, and W. Tang. Correlation-based performance analysis for full-system mapreduce optimization. In IEEE International Conference on Big Data, 2013.
  • [17] C. Gupta, R. Sinha, and Y. Zhang. Eagle: User profile-based anomaly detection for securing hadoop clusters. In IEEE International Conference on Big Data (Big Data), 2015.
  • [18] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F.B. Cetin, and S. Babu. Starfish: A selftuning system for big data analytics. In 5th Biennial Conference on Innovative Data Systems Research, 2011.
  • [19] IBM. What is Big Data. http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html, 2012. [Online; accessed 13-Jan-2014].
  • [20] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), 2007.
  • [21] M. P. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-box problem diagnosis in parallel file systems. In Proc. 8th USENIX Conf. File Storage Technol, 2010.
  • [22] L. Khan, M. Awad, and B. Thuraisingham.

    A new intrusion detection system using support vector machines and hierarchical clustering.

    The International Journal on Very Large Data Bases, 16:507 C–521, 2007.
  • [23] E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In International Conference on Very Large Data Bases(VLDB), 1998.
  • [24] E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In 24rd Int. Conf. Very Large Data Bases, pages 392 C–403, 1998.
  • [25] S. Lee and K. G. Shin. Probabilistic diagnosis of multiprocessor systems. ACM Computing Surveys (CSUR), 26:121 C–139, 1994.
  • [26] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine Learning in Apache Spark, pages 1–7. 2016.
  • [27] H. Mi, H. Wang, Y. Zhou, M. R. Lyu, and H. Cai. Fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Transactions on Parallel and Distributed Systems, 24:1245 C–1255, 2013.
  • [28] K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B. Chun. Making sense of performance in data analytics frameworks. In the 12th USENIX Symposium on Networked Systems Design and Implementation(NSDI), 2015.
  • [29] X. Pan, J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Ganesha: Blackbox diagnosis of mapreduce systems. ACM SIGMETRICS Performance Evaluation Review, 37, 2009.
  • [30] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. ACM SIGMOD Int. Conf. Manage. Data, pages 427 C–438, 2000.
  • [31] R. Ren, Z. Jia, L. Wang, J. Zhan, and T. Y. Bdtune: Hierarchical correlation-based performance analysis and rule-based diagnosis for big data systems. In IEEE International Conference on Big Data (Big Data), pages 555–562, 2016.
  • [32] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Mochi: Visual log-analysis based tools for debugging hadoop. In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), 2009.
  • [33] J. Tan, X. Pan, E. Marinelli, S. Kavulya, R. Gandhi, and P. Narasimhan. Kahuna: Problem diagnosis for mapreduce-based cloud computing environments. In IEEE Network Operations and Management Symposium(NOMS), 2010.
  • [34] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M.Fu, J.Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm @twitter. In ACM Conference on Management of Data(SIGMOD), 2014.
  • [35] C. Wang, V. Talwar, K. Schwan, and P. Ranganathan. Online detection of utility cloud anomalies using metric distributions. In IEEE Network Operations and Management Symposium(NOMS), pages 96 C–103, 2010.
  • [36] Wikipedia. Pca. https://en.wikipedia.org/wiki/Principal{_}component{_}analysis. [Online].
  • [37] D. Yu, G. Sheikholeslami, and A. Zhang. Findout: Finding outliers in very large datasets. Knowledge and Information Systems, 4:387 C–412, 2002.
  • [38] L. Yu and Z. Lan. A scalable, non-parametric method for detecting performance anomaly in large scale computing. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 27:1902–1914, 2016.
  • [39] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Conference on Networked Systems Design and Implementationv, 2012.
  • [40] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In ACM Symposium on Operating Systems Principles (SOSP), 2013.