Anomaly Analysis for Co-located Datacenter Workloads in the Alibaba Cluster

In warehouse-scale cloud datacenters, co-locating online services and offline batch jobs is an efficient approach to improving datacenter utilization. To better facilitate the understanding of interactions among the co-located workloads and their real-world operational demands, Alibaba recently released a cluster usage and co-located workload dataset, which is the first publicly dataset with precise information about the category of each job. In this paper, we perform a deep analysis on the released Alibaba workload dataset, from the perspective of anomaly analysis and diagnosis. Through data preprocessing, node similarity analysis based on Dynamic Time Warping (DTW), co-located workloads characteristics analysis and anomaly analysis based on iForest, we reveals several insights including: (1) The performance discrepancy of machines in Alibaba's production cluster is relatively large, for the distribution and resource utilization of co-located workloads is not balanced. For instance, the resource utilization (especially memory utilization) of batch jobs is fluctuating and not as stable as that of online containers, and the reason is that online containers are long-running jobs with more memory-demanding and most batch jobs are short jobs, (2) Based on the distribution of co-located workload instance numbers, the machines can be classified into 8 workload distribution categories1. And most patterns of machine resource utilization curves are similar in the same workload distribution category. (3) In addition to the system failures, unreasonable scheduling and workload imbalance are the main causes of anomalies in Alibaba's cluster.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 10

08/08/2018

Characterizing Co-located Datacenter Workloads: An Alibaba Case Study

Warehouse-scale cloud datacenters co-locate workloads with different and...
12/27/2019

URSA: Precise Capacity Planning and Contention-aware Scheduling for Public Clouds

Database platform-as-a-service (dbPaaS) is developing rapidly and a larg...
04/12/2022

The MIT Supercloud Workload Classification Challenge

High-Performance Computing (HPC) centers and cloud providers support an ...
04/08/2020

Hedge Your Bets: Optimizing Long-term Cloud Costs by Mixing VM Purchasing Options

Cloud platforms offer the same VMs under many purchasing options that sp...
04/20/2018

Bayesian Admission Policies for Cloud Computing Clusters

Cloud computing providers must handle customer workloads that wish to sc...
04/20/2018

The Power of Machine Learning and Market Design for Cloud Computing Admission Control

Cloud computing providers must handle customer workloads that wish to sc...
08/28/2020

Fifer: Tackling Underutilization in the Serverless Era

Datacenters are witnessing a rapid surge in the adoption of serverless f...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the popularity of internet services, cloud datacenter has become the infrastructure, which contains thousands of machines. However, there is an IRU-QoS curse dilemma that improving resource utilization (IRU) and guaranteeing QoS at the same time in cloud (Liu and Yu, 2018) (Ren et al., 2017). On one hand, in order to guarantee the service quality of internet services, the datacenter management systems usually reserve resources and it will reduce the resource utilization. For example, Geithner and McKinsey reported that the global server utilization seems to be very low, which is only 6% to 12% (Lu et al., 2017). Google reported that the CPU utilization of 20,000 servers averaged about 30% during January to March, 2013, in a typical datacenter for online services (Barroso et al., 2009). On the other hand, co-locating online services and offline batch jobs for resource sharing is an efficient approach to improving datacenter utilization, even though it also raises unpredictable performance variability (Ren et al., 2017). For instance, Alibaba tried to deploy batch jobs and latency-critical online services on same machines. They use Sigma (Ali, 2018) to schedule online service containers for the production jobs, and Fuxi (Zhang et al., 2014) scheduler to manage the batch workloads. To better facilitate the understanding of interactions among the co-located workloads and their real-world operational demands, Alibaba first released a co-located trace dataset (https://github.com/alibaba/clusterdata) in Aug 2017.

For Alibaba’s production cluster traces, recent studies (Lu et al., 2017) (Cheng et al., 2018) (Liu and Yu, 2018) have analyzed the characteristics from the perspective of imbalance phenomenon, co-located workloads (how the co-located workloads interact and impact each other), the elasticity and plasticity of semi-containerized cloud. However, these works do not further analyze the abnormal node in the cluster. And discovering the cluster anomalies quickly is very important, for it helps to locate bottlenecks, troubleshoot problems, avoid failures and improve utilization.

In this paper, we perform a deep analysis on the released Alibaba trace dataset (Ali, 2017), from the perspective of anomaly analysis and diagnosis. we first performed raw data preprocessing, including data supplementing, filtering, correlation and aggregation, and generating the container-level, batch-level and server-level resource usage data finally. Then based on these summary data, we conducted in-depth analysis from the aspects of node similarity, workload characteristics and distribution, and anomalies. From the above analysis, our key findings are summarized as follows:

Performance discrepancy of machines in the Alibaba’s co-located workloads cluster is relatively large. Obviously, The purpose of workloads co-locating is making the resources which online services can not fully used dynamically to be fully used by batch jobs. Unavoidably, deploying multiple applications to share resources on the same node will cause contentions and performance tilt. In the Alibaba’s cluster, the distribution of co-located workloads is not balanced. Since the online containers are long-running jobs with more memory-demanding and most batch jobs are short jobs, the resource utilization (especially memory utilization) of batch jobs is fluctuating and not as stable as that of online containers. So the performance fluctuation between different nodes may be high.

Generally, the patterns of machine resource utilization curves are similar in the same workload distribution category. Based on the co-located workload distributions, the machines of Alibaba’s cluster can be classified into 8 categories. Especially, most of CPU usage and memory usage on machines that belonging to the same category have similar patterns, while disk usage of different nodes may vary greatly.

Unreasonable scheduling and workload imbalance are the main causes of anomalies in Alibaba’s cluster. Undoubtedly, system errors or failures will cause the abnormal nodes or unavailable nodes. In addition, the scheduling strategies of the cluster management systems are unreasonable, may result in uneven workload distribution. And the resource contentions or interferences from co-located workloads, may also cause abnormal resource utilization. Since the Alibaba cluster is a co-located workloads datacenter, the abnormal phenomenons caused by imbalance workload and utilization are more common.

2. Background and Methodology

2.1. Trace Overview

In Aug 2017, Alibaba released a publicly accessible trace data referred as “Alibaba Cluster Trace”. The trace data contains cluster information of a production cluster in 12 hours period, and contains about 1.3k machines that run both online services and offline batch jobs. And the dataset includes six files: server_event.csv, server_usage.csv, batch_instance.csv, batch_task.csv, container_event.csv and container_usage.csv, which can be classified into two categories: 1) resource data, and 2) workload data.

2.1.1. Resource Data

In the Alibaba cluster, Alibaba CMS (cluster management system) provides a practice of semi-containerized co-location: online services run in containers and batch jobs directly run on physical servers (Liu and Yu, 2018). So the dataset includes the resource utilization data on physical machines and containers.

Physical Machine Resource Usage

The resource utilization information of physical machine includes two files: server_event.csv and server_usage.csv.

The file server_event.csv reflects the normalized physical capacity of each machine and event type (sch, 2017). It gives three dimensions of physical capacity: CPU cores, Memory and Disk, and each dimension is normalized independently. The three event types are add, softerror and harderror. In total, there are 1313 64-cores machines in the Alibaba Cluster Trace, whose machine id is from 1 to 1313.

The file server_usage.csv reflects the total resource utilization of all workloads (batch tasks, online container instances and the workloads of operating systems) in physical machines. It records the CPU usage, memory usage, disk usage and the average linux cpu load of 1/5/15 minute during a period from 39600s to 82500s222The timestamp of each record in the trace, which are in seconds and relative to the start of trace period. Additionally, a time of 0 represents the event occur before the trace period (Ali, 2017)., and the most of the record time interval is 300s. However, the data in file server_usage.csv is partially missing. For example, it only records the resource utilization data of 1310 machines.

Container Resource Usage

The resource utilization information of online container also includes two files: container_event.csv and container_usage.csv.

The file container_event.csv reflects the created online containers, and their requested resources, which including the assigned CPU cores, memory and disk space. Since the instance is the smallest scheduling unit and running in a lightweight virtual machine of Linux container (LXC), each instance can be seen as a container. In addition, it could also be regarded as a online service job. From the file container_event.csv we see that, there is just one event type in container instance, which is Create. That is, the online container instance will always exist if it is not killed after being created, which can be consider as a long-running job workload.

The file container_usage.csv gives the actual resource utilization information for online container instances, such as, cpu usage, memory usage, disk usage, average cpu load, cache misses. Most container resource utilization data is also collected from 39600s to 82500s, and the measurement interval of resource usage data is about 300s.

2.1.2. Workload Data

There are two files batch_task.csv and batch_instance.csv to describe the batch workloads. In general, a batch job contains multiple tasks, and different tasks execute different computing logics according to the data dependencies. In addition, a task may be executed through multiple instances, which execute exactly the same binary with the same resource request, but with different input data (Lu et al., 2017). So the file batch_task.csv describes the task execution information of batch jobs, such as, the task status, the cpu and memory resources that tasks require.

The file batch_instance.csv also gives the batch instance information, which is the smallest scheduling unit of batch workload. In addition,a batch instance may fail due to machine failures or network problems. Each record in this file records one try run. The start and end timestamp can be 0 for some instances. For example, all timestamp is zero when the instance is in ready and waiting status; start time is non-zero but end time is zero, when the instance is in failed status.

2.2. Our Methodology

Based on the Alibaba Cluster Trace, researchers can study the workload characteristics, analyze the cluster status, design new algorithms to assign workloads, and help to optimize the scheduling strategies between online services and batch jobs for improving throughput while maintain acceptable service quality. For large-scale clusters, anomaly discovery and diagnosis is also very important, which helps to locate bottlenecks, troubleshoot problems, avoid failures and improve utilization. In addition, it is a common and effective method to perform anomaly analysis from trace dataset.

In this paper, we perform a deep analysis on the released Alibaba trace dataset, from a distinctive perspective of anomaly analysis and diagnosis. The analysis method is shown in Figure 1. We first performed raw data preprocessing, including data supplementing, filtering, correlation and aggregation, and generating the container-level, batch-level and server-level resource usage data finally. Then based on these summary data, we conducted in-depth analysis from the aspects of node similarity, workload characteristics and distribution, and anomalies.

Figure 1. The analysis methodology.

2.3. Terminology

To analyze the machine conditions and discover anomalies in the Alibaba cluster , we correlate the multiple files and define the following symbols:

  • : The machine id of the cluster, which is range from 1 to 1313.

  • : The container instance, whose start time and end time are and , respectively.

  • : The batch task instance, whose start time and end time are and , respectively.

  • : The recording timestamp.

  • : The time interval, here, .

  • : The number of CPU cores that machine has.

  • : The number of CPU cores that requested by the container instance .

  • : The number of CPU cores that requested by the batch task instance .

  • : The memory that requested by the container instance .

  • : The used percent of container instance ’s requested cpus.

  • : The number of CPU core that container instance used on machine .

  • : The number of CPU core that batch task instance used on machine .

In the subsequent analysis, we will summarize the trace data during every time interval, so we also define the following symbols to describe the resource utilization during the time interval :

  • : The online container instance data sets that on machine , and their life cycles have intersections with the time interval .

  • : The batch task instance data sets that on machine , and their life cycles have intersections with the time interval .

  • : The number of container instance that running on machine .

  • : The number of batch task instance that running on machine .

  • : The estimate number of CPU cores that container instance

    or batch task .

  • : The proportion of used CPU resources in the requested CPU resources of container .

  • : The proportion of used memory in the requested memory resources of container .

  • : The actual CPU usage of container instance that running on machine .

  • : The actual CPU usage of batch task instance that running on machine .

  • : The actual memory usage of container instance that running on machine .

  • : The actual memory usage of batch task instance that running on machine .

  • : The real occupation runtime of batch task instance that running on machine .

  • : The total number of CPU core that container instance used on machine .

  • : The total number of CPU core that batch task used on machine .

  • : The runtime of batch task instance that running on machine .

  • : The total CPU usage of all container instances instances that running on machine .

  • : The total CPU usage of all batch task instances that running on machine .

  • : The total memory usage of all container instances that running on machine .

  • : The total memory usage of all batch task instances that running on machine .

  • : The average CPU usage of machine .

  • : The average memory usage of machine .

  • : The average disk usage of machine .

3. Data Preprocessing

3.1. Data Supplementing

We find that some files have missing data. Such as, there is no resource data of three machines (machine id is 149, 602 and 930) in the file server_usage.csv. And it samples the resource usage of each machine every 300s from 39600s to 82500s. That is, 144 resource utilization data is sampled at each machine. In fact, we find that on 335 machines, the number of recorded resource utilization data is less than 144, which means the resource data of some machines are missing, too.

So we do the data supplementing. For the missing machine 149, 602 and 930

, all resource data is completed with 0. Afterwards, the linear interpolation method is used to replenish the data, which is a method of constructing new data points within the range of a discrete set of known data points 

(Wikipedia, [n. d.]). For example, supposing is the missing data, and the number of missing value between existing value and is . And the detailed linear interpolation method is described in Algorithm 1.

1:  Find and
2:  Calculate the between and
3:  Calculate the rake ratio between and
4:  
5:  for each miss value  do
6:      
7:      Insert into the raw data
8:  end for
Algorithm 1 Linear interpolation method.

3.2. Data Filtering

Some files also have the aberrant data, which needs to be deleted. For example, the record number of container_event.csv is 11102, while the number of online container instances that we calculated is just 11089. Through our in-depth analysis, we find that some online container instances are duplicated and have two memory allocation values, which are shown in Table 1. We can see that, at the same time, there are multiple containers on the same node. If the requested memory of one container is greater than 0.9, all the requested memory of containers may be exceed the machine memory, which is obviously unreasonable. So we remove these anomalous records that requested memory is greater than 0.9.

1681 0.0424093 / 0.999963 56 10
2160 0.0424093 / 1 1038 9
2878 0.0424093 / 1 1112 8
3384 0.0848187 / 0.999963 102 8
4467 0.0424093 /1 331 10
5470 0.0424093 /0.999963 866 9
6330 0.0424093 / 0.999963 95 8
6549 0.0848187 / 1.00001 1134 9
6639 0.0424093 / 0.999963 19 9
7663 0.0424093 / 1 323 10
7915 0.0424093 / 0.999963 69 12
8476 0.0424093 /1 323 10
10772 0.0424093 / 0.999963 85 12
Table 1. The data with abnormal requested memory in container_event.csv.

3.3. Data Correlation, Aggregation and Generation

In order to compare the resource utilization of online container services, batch job workloads and servers, we aggregate all the container-level, batch-level and server-level resource usage statistics by the machine Id and recording interval, respectively.

3.3.1. Generating container-level resource usage data

Because the file container_usage.csv samples the resource usage of each container every 300s. So at every time interval, we aggregate all the container-level resource usage statistics by machine Id based on mapping recorded in the container_event.csv (Lu et al., 2017). We generate the container instance data sets . And then, the CPU usage and memory usage that occupied by all containers during every interval is defined as and , which can be calculated by Algorithm 2.

0:    
0:    ,
1:  Select the online container instances set on machine within
2:  Count the
3:  for each online container instance in  do
4:      Calculate :
5:      =
6:      Calculate :
7:      =*
8:  end for
9:  
10:  
Algorithm 2 Calculating the CPU usage and memory usage of all containers during every interval.

3.3.2. Generating batch-level resource usage data

Cheng et al. (Cheng et al., 2018) have calculated the batch job workload resource usage by subtracting the usage of containers from the overall usage of the cluster. However, we think their calculation method is not accurate enough, for there are resources that occupied by the OS operations on machines, except for the resources used by containers and batch tasks. So we generate the batch-level resource usage data based on actual occupation time of batch task instances.

The file batch_instance.csv records the start time, end time and location (machine) of all batch task instances. For each time interval, according to the positions of batch tasks’ start time and end time, there are four situations that shown in Figure 2. So we can calculate the actual occupation time of batch task instances during every time interval, according to formula (1).

Figure 2. Four situations for the positions of batch tasks’ start and end time.
(1)

So we derive the cpu usage that occupied by all batch tasks based on the task execution time at every time interval. And the CPU usage and memory usage that occupied by all batch tasks during every interval is and , which can be calculated by Algorithm 3.

3.3.3. Generating server-level resource usage data

Similarly, based on the file server_usage.csv, we calculate the average resource utilization for each time interval and each machine, which includes , and .

After generating the above data, a series of analysis can be performed on the basis of server-level, container-level and batch-level resource usage data, such as, node similarity analysis, co-located workloads characteristics analysis, and anomaly analysis, and so on.

0:    
0:    ,
1:  Select the batch instances set within
2:  Calculate the
3:  for each batch instance in  do
4:      if  and  then
5:          
6:          
7:          
8:      else if  and  then
9:          
10:          =*
11:          =*
12:      else if  and  then
13:          
14:          =*
15:          =*
16:      else if  and  then
17:          
18:          =*
19:          =*
20:      end if
21:  end for
22:  
23:  
24:  
Algorithm 3 Calculating the CPU usage and memory usage of all batch tasks during every interval.

4. Node Similarity Analysis

Node similarity analysis can be used to discover the performance difference between nodes in the cluster, and help to understand the stability of cluster. In this section, we apply Dynamic Time Warping (DTW) (DTW, 2018) to measure the similarity between server-level resource utilization series.

4.1. Node similarity Analysis based on DTW

4.1.1. Calculating DTW value between two time series

Dynamic Time Warping (DTW) is a distance measure that compares two time series after optimally aligning them. Suppose there are two time series and , of length and respectively, where:

(2)
(3)

Then we construct an n-by-l matrix, and the (, ) element of the matrix contains the distance between the two points and ( Typically, the Euclidean distance is used, so ). Each matrix element corresponds to the alignment between the points and . The warping path is a contiguous set of matrix elements that defines a mapping between and . So The element of is defined as , and:

(4)

In addition, we are interested in the path which minimizes the warping cost:

(5)

The in the denominator is used to compensate for the fact that warping paths may have different lengths. This path can be found very efficiently using dynamic programming, and we define the cumulative distance , as the distance found in the current cell and the minimum of the cumulative distances of the adjacent elements (Chu et al., 2002):

(6)
Figure 3. Sorted DTW values.
Figure 4. DTW ranges.
Figure 5. Resource utilization of the selected standard curves.

4.1.2. Calculating node similarity based DTW value

During the tracing interval, the resource utilization on each machine can form a resource utilization curve. Based on the server_usage.csv that has been supplemented, we combine these three curves of CPU usage, memory usage and disk usage into a resource utilization curve. All the resource utilization curves constitute a data set . Then, we try to find a standard curve by random sampling, and calculate the DWT values between all other resource curves and the standard curve, which is taken as the node similarity.

0:    
0:    DTW values
1:  Randomly extract the rows in , and set the obtained rows as the sample set .
2:  for each row in  do
3:      Calculate the DTW value in each pair
4:      Put the calculated DTW value into the array
5:  end for
6:  Take the median value of as the standard value
7:  Select a row as the standard curve randomly.
8:  for each row in  do
9:      Calculate the DTW value of each row between
10:  end for
Algorithm 4 Calculating the DTW value of the cluster nodes.

4.2. The results of node similarity

In the experiments, we calculate the DTW values between the resource utilization curves of all machines and the selected standard curves.For instance, we select 5 curves as the standard curves, which are the standard curve 16, 19, 23, 28, 36333Here, standard curves 16 represents the resource utilization curve on machine 16.. Then we plot the sorted DTW values that between all machines and the standard curves in Figure 5. We can see that, the standard curve 23 is slightly different from other standard curves, so we think that the resource utilization curve on machine 23 may be not suitable as the standard curve.

In addition, based on the average DTW value of standard curve 16, 19, 28 and 36, we calculate the standard value of DTW is about 1.72, and plot the proportion of different DTW ranges in Figure 5. There are 478 nodes whose DTW value is in the range of 1 to 2, which has the largest proportion. About 50% of nodes have a DTW value that is greater than 2, and 7% of nodes have a DTW value that is greater than 5. By manually analyzing the resource utilization curves, we find that, when the DTW value is greater than 3, there may be a big gap between this resource curve and the selected standard curve. Assuming that 3 is the threshold of DTW value for judging the abnormal node, and there are 46% of the nodes that will be divided into abnormal nodes. That is, the performance of different nodes is different and volatility.

summary. The performance discrepancy of the machines in Alibaba’s co-located workloads cluster is relatively large.

%ͨ ʵ 飬 Ƿ  ̵ֽڵ dtwֵ ķ Χ ͼ ʾ dtw ׼ֵΪ ١ %ͨ Դ ʵ ͼ ˹ dtwֵ 3 Ϊ  ̵ֵ ȽϺ ʡ Խ 3 Ľڵ㿴 쳣 ڵ ѡ

Figure 6. The CPU utilization of online containers, batch tasks and servers.
Figure 7. The memory utilization of online containers, batch tasks and servers.

5. Co-located Workloads Characteristics

5.1. Resource Utilization of Co-located Workloads

The CPU usage and memory usage of online containers, batch tasks and servers are shown in Figure 6 and Figure 7. From these two figures, we see that the resource utilization of server is slightly higher than the sum of online containers and batch tasks’ resource utilization. It is no doubt that the OS system will take up some resources. We also observe that there are some spikes in these figures, which implies that some machines may have a sudden high resource utilization at a certain time. From the range of machine 132 to 151, machine 418 to 553, they are lack of online containers’ resource utilization, which implies that these machine regions are hosting batch jobs only.

(a) CPU usage.
(b) Memory usage.
Figure 8. The box-and-whisker plots that showing CPU and memory usage distribution.

Figure 8 is the box-and-whisker plots that showing CPU usage and memory usage distributions. We observe that on the same machine, the aggregated CPU usage of online containers is lower than that of batch tasks, while the aggregated memory usage of online containers is higher than that of batch tasks. It implies that the online container services (long-running jobs) are more memory-demanding.

(a) CPU usage.
(b) Memory usage.
Figure 9. The resource usage heatmap of online containers.
(a) CPU usage.
(b) Memory usage.
Figure 10. The resource usage heatmap of batch tasks.

We also plot the resource usage heatmap of online containers and batch tasks in Figure 9 and Figure 10. Figure 9 also shows that, there are no running online containers from the range of machine 132 to 151, machine 418 to 553. During the tracing interval, the resource utilization (CPU usage and memory usage) of online containers is relatively stable. Figure 10 shows that, there are no running batch tasks from 52800s (14.7h) in some machine regions, such as the region of machine 95 to 127, machine 275 to 296, machine 753 to 760, and machine 830 to 906. Since most batch tasks are short jobs, the resource utilization is not as stable as that of long-running jobs, especially the memory usage is fluctuating.

summary. The online container instances and batch tasks are not running on all machines in the cluster. Since the online containers are the long-running jobs with more memory-demanding, the memory usage is relatively stable; while the memory usage of batch jobs is fluctuating for most batch tasks are short jobs.

5.2. Distribution Characteristics of Co-located Workloads

(a) Type 1.
(b) Type 2.
(c) Type 3.
(d) Type 4.
(e) Type 5.
(f) Type 6.
(g) Type 7.
(h) Type 8.
Figure 11. Categories of co-located workload distribution.

In Figure 12, we give the box-and-whisker plots about the number of online container and batch tasks during every time interval. We observe that, most of the batch task numbers are in the range of 35 to 71, and most of the online container numbers are in the range of 7 to 10.

(a) Online containers.
(b) Batch tasks.
Figure 12. The box-and-whisker plots about number of online container and batch tasks.

Based on the number of batch tasks and online containers on machines, we classify the distribution of the co-located workloads. First, the non-zero values of and are mapped to 1, the zero values remains unchanged. Second, for each machine, we combine all the mapped batch task numbers and container numbers to form a (143+143)-dimensional444The number of recording interval is 143.vector. That is, it generates a matrix of 1313*286. At last, the Kmeans (k-m, 2018) algorithm is applied to the generated number matrix and is used for classification. All machines in Alibaba cluster can be classified into 8 workload distribution categories, and the machine number that belonging to these 8 categories is shown in Table  2. In detail, the 8 workload distribution categories include:

  • Type 1: The online containers and batch tasks are always co-located running on machines, which is shown in Figure 11 (a).

  • Type 2: No running workloads on machines, which is shown in Figure 11 (b).

  • Type 3: Batch tasks are running only, which is shown in Figure 11 (c).

  • Type 4: Online container instances are running only, which is shown in Figure 11 (d).

  • Type 5: Batch tasks are running only during the first few hours of tracing, which is shown in Figure 11 (e).

  • Type 6: The online containers and batch tasks are co-located on machines, but no batch tasks run during the latter few hours of tracing, which is shown in Figure 11 (f).

  • Type 7: The online containers and batch tasks are co-located on machines, but no batch tasks run during a short time of tracing, which is shown in Figure 11 (g).

  • Type 8: The online containers and batch tasks are co-located on machines, but no batch tasks run during the first few hours of tracing, which is shown in Figure 11 (h).

Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8
956 9 170 11 2 155 9 1
Table 2. The machine number of workload distribution categories.
Figure 13. Resource utilization of batch tasks in machine 149.

From Table  2 we see that, 72.8% of nodes have the co-located workloads, and they belong to Type 1. The resource utilization curves on these nodes are shown in Figure 14: The CPU usage, memory usage and disk usage are in the approximate range of 20%-30%, 50%-60%, and 40%-60%, respectively. There are no running workloads on 9 nodes that belonging to Type 2, and the machine ids are 372, 478, 481, 550, 602, 924, 930, 983, 1075. The resource utilization curves on these nodes are shown in Figure 17: the CPU usage is very low on these nodes, which is about 1%; and the average memory usage is 9.6% and the average disk usage is high, which is 30.92%. In addition, we also find that the machine 149 that lacking recorded resources is not belonging to Type 2, because there are some batch jobs are running on the machine 149 actually. The resource utilization of batch tasks on machine 149 is shown in Figure 13. There are 170 nodes that belonging to Type 3, which including: 66, 132-151, 237, 265, 390, 418-549, 551-553, 973, 982, 987, 1004, 1008, 1028, 1029, 1043, 1055, 1057, 1058, 1081, 1083, whose average CPU usage, memory usage and disk usage are 17.44%, 29.55%, 43.32%, respectively. From Figure 18 we observe that, excluding some peak memory usages, the memory usage and CPU usage curves of different machines are similar, and the disk usage on the same machine is relatively stable with no changing. There are 11 nodes that only have online containers (belonging to Type 4), which including: 161, 171, 556, 763, 791, 800, 851, 943, 949, 1069, 1113. On these machines, the patterns of resource utilization are not obvious. The average CPU usage, memory usage and disk usage are respectively 12.06%, 36.2%, 33.46%, and the memory usage are higher because of the online container services requiring more memory. There are just 2 nodes that belonging to Type 5, which including: 401, 689. From Figure 17 we see that, the average CPU usage and memory usage are 22% and 28.5% during the batch task execution period, while the disk usage is almost stable at 42%. There are 150 nodes that belonging to Type 6, which including: 88-127, 275-296, 683, 723, 753-760, 830-850, 852-906, 965, 986, 993, 1079, 1096, whose average CPU usage, memory usage and disk usage are 21.29%, 39.88%, 45.31%. And there are 9 nodes that belonging to Type 7, which including: 619-626, 794, whose average CPU usage, memory usage and disk usage are 24.74%, 47.43%, 50.04%. From Figure 19 and 21 we also observe that, the CPU usage and memory usage curves of different machines are similar, and the disk usage on the same machine is relatively stable with no changing, too. There is only one machine  618 that belonging to Type 8, and the average CPU usage, memory usage and disk usage are 19.58%, 29.66%, 56.58%, which is shown in Figure 21.

Summary. Based on the the number of batch tasks and online containers (called as co-located workload distributions), the machines of Alibaba’s cluster can be classified into 8 workload distribution categories. In addition, for most categories, the CPU usage and memory usage of machines that belonging to the same category have similar patterns, while disk usage of different nodes may vary greatly.

Figure 14. Resource utilization of Type 1.
Figure 15. Resource utilization of Type 2.
Figure 16. Resource utilization of Type 4.
Figure 17. Resource utilization of Type 5.
Figure 18. Resource utilization of Type 3.
Figure 19. Resource utilization of Type 6.
Figure 20. Resource utilization of Type 7.
Figure 21. Resource utilization of Type 8.

6. Anomaly Analysis

Through the node similarity based on DTW value (Section 4) or the co-located workloads characteristics (Section 5), we could discover the abnormal nodes from different perspectives. However, when the standard curve is not selected well, it may have a bad impact on the abnormal detection results by using DTW method. And anomaly analysis based on the co-located workloads characteristics is a qualitative analysis. So based on the generated associated data, we utilize Isolation Forest (iForest) (Liu et al., 2008)

to filter out the outliers, and then analyze the anomalies based on co-located workload characteristics and machine states.

6.1. Anomaly Discovery based on iForest

We choose 5 dimensions , , , and ) to build the machine-resources matrix. Then we apply the Isolation Forest (iForest) (Liu et al., 2008) algorithm to this machine-resources matrix, and output the anomaly scores. The iForest (Liu et al., 2008)

is a fast anomaly detection method that based on ensemble, which has linear time complexity and high precision. If one machine’s anomaly score is smaller, the probability that it is an abnormal node is higher. The distribution of machines’ anomaly score is shown in Figure 

22. Some machines have anomaly scores that are less than 0, and the number is 81. We also list the top 25 abnormal nodes in Table 3.

Figure 22. The anomaly score.
Top Anomaly score Categories Causes
1 602 -0.170951862 Type 2 No workloads
2 930 -0.170951862 Type 2 Frequent softerror
3 1075 -0.152726265 Type 2 Frequent softerror
4 550 -0.152597894 Type 2 No workloads
5 372 -0.152429127 Type 2 Frequent softerror
6 478 -0.152156505 Type 2 No workloads
7 983 -0.150834127 Type 2 No workloads
8 924 -0.150572048 Type 2 No workloads
9 676 -0.14511185 Type 1 Heavier online services
10 481 -0.142787057 Type 2 No workload
11 679 -0.139451001 Type 1 Heavier online services
12 851 -0.122341159 Type 4 No batch jobs
13 673 -0.119792183 Type 1 Heavier online services
14 993 -0.110451407 Type 6 Unbalanced batch tasks
15 618 -0.092764675 Type 8 Unbalanced batch tasks
16 556 -0.083327088 Type 4 No batch jobs
17 689 -0.082675027 Type 5 Softerror, unbalanced workloads
18 401 -0.082649176 Type 5 Softerror, unbalanced workloads
19 275 -0.078791916 Type 6 Unbalanced batch tasks
20 763 -0.077354718 Type 4 No batch jobs
21 149 -0.072409203 Type 3 No online services
22 1039 -0.072036834 Type 1 Unbalanced workloads with lighter online services
23 800 -0.066261211 Type 4 No batch jobs
24 1069 -0.064646667 Type 4 No batch jobs
25 949 -0.062686912 Type 4 No batch jobs
Table 3. The top 25 abnormal nodes.
Figure 23. The softerror machine.

6.2. Cause Analysis

We analyze the causes of anomalies. The one reason for anomalies is system errors or failures, and the softerror status of machines is shown in Figure 23:

(1) Frequent softerror can result in machines becoming unavailable, such as the machine 930, 1075 and 372, with no running jobs.

(2) Due to the softerror at a certain time, the machines may have exceptions, which can affect the scheduling and execution of jobs. For example, there are no online services running on the machine 689 and 401, and the batch tasks are running only during the first few hours of tracing. By checking the machine status, the machine 689 has softerror at the timestamp of 50623s, 52005s and 52219s, and there is no running batch tasks from 50400s radually; the machine 401 has softerror at the timestamp of 49854s, 50018s, 51325s and 51515s, and there is no running batch tasks from 49800s, too. The reason may be that, cluster management system is unable to continue scheduling and executing new jobs on these machines due to system failures.

The other reason for anomalies is unbalanced scheduling, which results in workload imbalance:

(1) Obviously, due to the uneven number of online container instances and batch jobs, the imbalance of co-located workload distribution looks like an obvious reason for abnormal nodes. For example, some non-faulty machines also belong to Type 2, which have no running jobs. The possible reason is that no tasks are assigned on these nodes based on the scheduling policies. And on some machines, there are only batch jobs (Type 3) or online containers (Type 4

), with skew workload distribution.

(2) Skew of co-located workload resource utilization also results in some abnormal nodes. For instance, there are four nodes that are belonging to Type 1. And the machine 673, 676 and 679 have heavier online services (high memory usage), for the number of online container instances are 17, 19 and 18, respectively; the machine 1039 has a skew on the batch tasks and online container number, for the average number of batch tasks is 71, while the number of online container is 1.

Summary. In addition to system failures, unreasonable scheduling and workload imbalance are the main causes of anomalies in Alibaba s cluster.

7. Related work

Cluster trace studies. In 2011, Google open-sourced the publicly available cluster trace data (goo, [n. d.]), which is a 29-day trace of over 25 million tasks across 12,500 heterogeneous machines. And there are several works on analyzing Google trace from different perspectives. Reiss et al. (Reiss et al., 2012) studied the heterogeneity and dynamicity properties of the Google workloads. Zhang et al. focused on characterizing run-time task resource usages of CPU, memory and disk (Zhang et al., 2011). Liu et al. focused on the frequency and pattern of machine maintenance events, job and task-level workload behaviors, and how the overall cluster resources are utilized (Liu and Cho, 2012). Di et al. focused on loads of jobs and machines, and compared the differences between a Google datacenter and other Grid/HPC systems (Di et al., 2012). Different from the Google trace, the Alibaba trace that was released in 2017, which contains information about the co-located container and batch workloads. Lu et al. (Lu et al., 2017) performed characterization of the Alibaba trace to reveal the imbalance phenomena in cloud, such as spatial imbalance, temporal imbalance, imbalanced resource demands and utilization. Cheng et al. (Cheng et al., 2018) focused on providing a unique and microscopic view about how the co-located workloads interact and impact each other. Liu et al. (Liu and Yu, 2018) revealed that the resource allocation of the Alibaba semi-containerized co-location cluster achieves high elasticity and plasticity. In addition, some works also focus on the reliability analysis based on cluster traces, such as mining failure patterns (Fu et al., 2012), failure prediction (Watanabe et al., 2012), etc. Our study focuses on a unique view about node performance differences and anomalies in co-located workloads cluster.

Cluster anomaly analysis. A number of node comparison methods have been adopted for anomaly detection in large-scale clusters (Yu and Lan, 2016)

. For example, most works use cosine similarity to calculate the node similarity in a cluster 

(Ren et al., 2016). Kahuna (Tan et al., 2010) aimed to diagnose performance based on node similarity, with supposing that nodes exhibit peer-similarity under fault-free conditions, and that some faults result in peer-dissimilarity. Kasick et al. (Kasick et al., 2010) developed anomaly detection mechanisms in distributed environments by comparing system metrics among nodes. Eagle (Gupta et al., 2015) is a framework for anomaly detection at eBay, which uses density estimation and PCA algorithms for user behavior analysis.

8. Conclusion

Aiming at improving the overall resource utilization, co-locating online services and offline batch jobs is an efficient approach. However, it also results in exponentially increased complexity for datacenter resource management. Based on the preprocessed Alibaba co-located workloads dataset, we conducted in-depth analysis from the aspects of node similarity, workload characteristics and distribution, and anomalies. Our analysis reveals several insights that the performance discrepancy of machines in Alibaba’s production cluster is relatively large, for the distribution and resource utilization of co-located workloads are not balanced. For example, the resource utilization (especially memory utilization) of batch jobs is fluctuating and not as stable as that of online containers, and the reason is that the online containers are long-running jobs with more memory-demanding and most batch jobs are short jobs. Meanwhile, based on the distribution of co-located workload instance numbers, the machines can be classified into 8 workload distribution categories. And most patterns of machine resource utilization curves are similar in the same workload distribution category. We also use the iForest algorithm to detect abnormal nodes, and find that the unreasonable scheduling and workload imbalance are the main causes of anomalies in Alibabas production cluster.

Acknowledgment

We are very grateful to anonymous reviewers.

References

  • (1)
  • goo ([n. d.]) [n. d.]. Google cluster workload traces. https://github.com/google/cluster-data. ([n. d.]). [Online].
  • Ali (2017) 2017. Alibaba trace. https://github.com/alibaba/clusterdata. (2017). [Online].
  • sch (2017) 2017. The schema description of Alibaba clusterdata. https://github.com/alibaba/clusterdata/blob/master/trace_201708.md. (2017). [Online].
  • DTW (2018) 2018. Dynamic time warping. https://en.wikipedia.org/wiki/Dynamic_time_warping. (2018). [Online].
  • k-m (2018) 2018. k-means clustering. https://en.wikipedia.org/wiki/K-means_clustering. (2018). [Online].
  • Ali (2018) 2018. Maximizing CPU Resource Utilization on Alibaba s Servers. https://102.alibaba.com/detail/?id=61. (2018). [Online].
  • Barroso et al. (2009) L. Barroso, J. Clidaras, and U. Hoelzle. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-scale Machines. (2009).
  • Cheng et al. (2018) Y. Cheng, Z. Chai, and A. Anwar. 2018. Characterizing Co-located Datacenter Workloads: An Alibaba Case Study. In https://arxiv.org/abs/1808.02919.
  • Chu et al. (2002) S. Chu, E. Keogh, D. Hart, and M. Pazzani. 2002. Iterative Deepening Dynamic Time Warping for Time Series. In Proceedings of SIAM International Conference on Data Mining.
  • Di et al. (2012) S. Di, D. Kondo, and W. Cirne. 2012. Characterization and comparison of cloud versus grid workloads. In IEEE International Conference on Cluster Computing(CLUSTER).
  • Fu et al. (2012) X. Fu, R. Ren, J. Zhan, W. Zhou, Z. Jia, and G. Lu. 2012. LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems. In IEEE 31st Symposium on Reliable Distributed Systems (SRDS).
  • Gupta et al. (2015) C. Gupta, R. Sinha, and Y. Zhang. 2015. Eagle: User profile-based anomaly detection for securing Hadoop clusters. In IEEE International Conference on Big Data (Big Data).
  • Kasick et al. (2010) M. P. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. 2010. Black-box problem diagnosis in parallel file systems. In Proc. 8th USENIX Conf. File Storage Technol.
  • Liu et al. (2008) F.T. Liu, K.M.Ting, and Z.H. Zhou. 2008. Isolation forests. In In Proceedings of International Conference on Data Mining.
  • Liu and Yu (2018) Q. Liu and Z. Yu. 2018. The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace. In Proceedings of ACM Symposium on Cloud Computing (SOCC).
  • Liu and Cho (2012) Z. Liu and S. Cho. 2012. Characterizing machines and workloads on a google cluster. In 41st International Conference on Parallel Processing Workshops.
  • Lu et al. (2017) C. Lu, K. Ye, G. Xu, C. Xu, and T. Bai. 2017. Imbalance in the cloud: An analysis on Alibaba cluster trace. In IEEE International Conference on Big Data (Big Data).
  • Reiss et al. (2012) C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In The Third ACM Symposium on Cloud Computing(SoCC).
  • Ren et al. (2016) R. Ren, Z. Jia, L. Wang, J. Zhan, and T. Y. 2016. BDTune: Hierarchical correlation-based performance analysis and rule-based diagnosis for big data systems. In IEEE International Conference on Big Data (Big Data). 555–562.
  • Ren et al. (2017) R. Ren, J. Ma, X. Sui, and Y. Bao. 2017. A Distributed Deadline Propagation Approach to Reduce Long-Tail in Datacenters. Journal of Computer Research and Development 54(7) (2017).
  • Tan et al. (2010) J. Tan, X. Pan, E. Marinelli, S. Kavulya, R. Gandhi, and P. Narasimhan. 2010. Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments. In IEEE Network Operations and Management Symposium(NOMS).
  • Watanabe et al. (2012) Y. Watanabe, H. Otsuka, M. Sonoda, S. Kikuchi, and Y. Matsumoto. 2012. Online failure prediction in cloud datacenters by real-time message pattern learning. In International Conference on Cloud Computing Technology and Science. 504–511.
  • Wikipedia ([n. d.]) Wikipedia. [n. d.]. Interpolation. https://en.wikipedia.org/wiki/Interpolation. ([n. d.]). [Online].
  • Yu and Lan (2016) L. Yu and Z. Lan. 2016. A Scalable, Non-Parametric Method for Detecting Performance Anomaly in Large Scale Computing. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 27 (2016), 1902–1914.
  • Zhang et al. (2011) Q. Zhang, J. L. Hellerstein, and R. Boutaba. 2011. Characterizing task usage shapes in google compute clusters. In Large Scale Distributed Systems and Middleware Workshop(LADIS).
  • Zhang et al. (2014) Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu. 2014. Fuxi: A fault-tolerant resource management and job scheduling system at internet scale. In Proceedings of the VLDB Endowment.