CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

11/05/2021
by   Yingying Zhang, et al.
0

As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructure of Alibaba Cloud. However, root cause analysis in these platforms is non-trivial due to the complicated system architecture. In this paper, we propose a root cause analysis framework called CloudRCA which makes use of heterogeneous multi-source data including Key Performance Indicators (KPIs), logs, as well as topology, and extracts important features via state-of-the-art anomaly detection and log analysis techniques. The engineered features are then utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to infer root causes with high accuracy and efficiency. Ablation study and comprehensive experimental comparisons demonstrate that, compared to existing frameworks, CloudRCA 1) consistently outperforms existing approaches in f1-score across different cloud systems; 2) can handle novel types of root causes thanks to the hierarchical structure of KHBN; 3) performs more robustly with respect to algorithmic configurations; and 4) scales more favorably in the data and feature sizes. Experiments also show that a cross-platform transfer learning mechanism can be adopted to further improve the accuracy by more than 10%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud and employed in three typical cloud computing platforms including MaxCompute, Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more than 20% in the time spent on resolving failures in the past twelve months and improves service reliability significantly.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/26/2021

Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications: A Survey

The momentum gained by microservices and cloud-native software architect...
research
03/01/2021

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

Availability issues of industrial microservice systems (e.g., drop of su...
research
02/17/2021

FIXME: Enhance Software Reliability with Hybrid Approaches in Cloud

With the promise of reliability in cloud, more enterprises are migrating...
research
08/13/2018

Simple Root Cause Analysis by Separable Likelihoods

Root Cause Analysis for Anomalies is challenging because of the trade-of...
research
04/12/2020

BugDoc: Algorithms to Debug Computational Processes

Data analysis for scientific experiments and enterprises, large-scale si...
research
09/11/2023

PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis

In recent years, the transition to cloud-based platforms in the IT secto...
research
08/01/2021

Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings

For large-scale distributed systems, it's crucial to efficiently diagnos...

Please sign up or login with your details

Forgot password? Click here to reset