MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

by   Dewei Liu, et al.

Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68 minutes.



There are no comments yet.


page 1


Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications: A Survey

The momentum gained by microservices and cloud-native software architect...

Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings

For large-scale distributed systems, it's crucial to efficiently diagnos...

CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis

In large-scale online services, crucial metrics, a.k.a., key performance...

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

As business of Alibaba expands across the world among various industries...

RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk

Failures and anomalies in large-scale software systems are unavoidable i...

Simple Root Cause Analysis by Separable Likelihoods

Root Cause Analysis for Anomalies is challenging because of the trade-of...

DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services

Large scale cloud services use Key Performance Indicators (KPIs) for tra...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.