Incremental Causal Graph Learning for Online Unsupervised Root Cause Analysis

05/18/2023
by   Dongjie Wang, et al.
0

The task of root cause analysis (RCA) is to identify the root causes of system faults/failures by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure recovery and mitigate system damages or financial losses. However, previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process, a significant amount of time and data to train a robust model, and then being retrained from scratch for a new system fault. In this paper, we propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model. CORAL consists of Trigger Point Detection, Incremental Disentangled Causal Graph Learning, and Network Propagation-based Root Cause Localization. The Trigger Point Detection component aims to detect system state transitions automatically and in near-real-time. To achieve this, we develop an online trigger point detection approach based on multivariate singular spectrum analysis and cumulative sum statistics. To efficiently update the RCA model, we propose an incremental disentangled causal graph learning approach to decouple the state-invariant and state-dependent information. After that, CORAL applies a random walk with restarts to the updated causal graph to accurately identify root causes. The online RCA process terminates when the causal graph and the generated root cause list converge. Extensive experiments on three real-world datasets with case studies demonstrate the effectiveness and superiority of the proposed framework.

READ FULL TEXT

page 3

page 8

research
02/03/2023

Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization

In this paper, we propose REASON, a novel framework that enables the aut...
research
06/13/2022

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Fault diagnosis is critical in many domains, as faults may lead to safet...
research
05/07/2021

An Influence-based Approach for Root Cause Alarm Discovery in Telecom Networks

Alarm root cause analysis is a significant component in the day-to-day t...
research
06/20/2023

PyRCA: A Library for Metric-based Root Cause Analysis

We introduce PyRCA, an open-source Python machine learning library of Ro...
research
02/23/2022

NetRCA: An Effective Network Fault Cause Localization Algorithm

Localizing the root cause of network faults is crucial to network operat...
research
05/13/2021

DataExposer: Exposing Disconnect between Data and Systems

As data is a central component of many modern systems, the cause of a sy...
research
01/31/2023

BALANCE: Bayesian Linear Attribution for Root Cause Localization

Root Cause Analysis (RCA) plays an indispensable role in distributed dat...

Please sign up or login with your details

Forgot password? Click here to reset