Anomaly Analysis for Co-located Datacenter Workloads in the Alibaba Cluster

11/14/2018
by   Rui Ren, et al.
0

In warehouse-scale cloud datacenters, co-locating online services and offline batch jobs is an efficient approach to improving datacenter utilization. To better facilitate the understanding of interactions among the co-located workloads and their real-world operational demands, Alibaba recently released a cluster usage and co-located workload dataset, which is the first publicly dataset with precise information about the category of each job. In this paper, we perform a deep analysis on the released Alibaba workload dataset, from the perspective of anomaly analysis and diagnosis. Through data preprocessing, node similarity analysis based on Dynamic Time Warping (DTW), co-located workloads characteristics analysis and anomaly analysis based on iForest, we reveals several insights including: (1) The performance discrepancy of machines in Alibaba's production cluster is relatively large, for the distribution and resource utilization of co-located workloads is not balanced. For instance, the resource utilization (especially memory utilization) of batch jobs is fluctuating and not as stable as that of online containers, and the reason is that online containers are long-running jobs with more memory-demanding and most batch jobs are short jobs, (2) Based on the distribution of co-located workload instance numbers, the machines can be classified into 8 workload distribution categories1. And most patterns of machine resource utilization curves are similar in the same workload distribution category. (3) In addition to the system failures, unreasonable scheduling and workload imbalance are the main causes of anomalies in Alibaba's cluster.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 10

research
08/08/2018

Characterizing Co-located Datacenter Workloads: An Alibaba Case Study

Warehouse-scale cloud datacenters co-locate workloads with different and...
research
12/27/2019

URSA: Precise Capacity Planning and Contention-aware Scheduling for Public Clouds

Database platform-as-a-service (dbPaaS) is developing rapidly and a larg...
research
02/14/2021

Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Distributed data processing systems like MapReduce, Spark, and Flink are...
research
08/04/2023

A Deep Dive into the Google Cluster Workload Traces: Analyzing the Application Failure Characteristics and User Behaviors

Large-scale cloud data centers have gained popularity due to their high ...
research
04/20/2018

Bayesian Admission Policies for Cloud Computing Clusters

Cloud computing providers must handle customer workloads that wish to sc...
research
04/20/2018

The Power of Machine Learning and Market Design for Cloud Computing Admission Control

Cloud computing providers must handle customer workloads that wish to sc...
research
08/28/2020

Fifer: Tackling Underutilization in the Serverless Era

Datacenters are witnessing a rapid surge in the adoption of serverless f...

Please sign up or login with your details

Forgot password? Click here to reset