Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

06/13/2022
by   Mingjie Li, et al.
0

Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a small set of root cause indicators for the underlying fault can save much time for failure mitigation. In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition. We proposed a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA). The core idea is a sufficient condition for a monitoring variable to be a root cause indicator, i.e., the change of probability distribution conditioned on the parents in the Causal Bayesian Network (CBN). Towards the application in online service systems, CIRCA constructs a graph among monitoring metrics based on the knowledge of system architecture and a set of causal assumptions. The simulation study illustrates the theoretical reliability of CIRCA. The performance on a real-world dataset further shows that CIRCA can improve the recall of the top-1 recommendation by 25

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/06/2022

CausalRCA: Causal Inference based Precise Fine-grained Root Cause Localization for Microservice Applications

For microservice applications with detected performance anomalies, local...
research
05/18/2023

Incremental Causal Graph Learning for Online Unsupervised Root Cause Analysis

The task of root cause analysis (RCA) is to identify the root causes of ...
research
03/16/2020

Software-Based Monitoring and Analysis of a USB Host Controller Subject to Electrostatic Discharge

Observing, understanding, and mitigating the effects of failure in embed...
research
04/08/2020

DAG With Omitted Objects Displayed (DAGWOOD): A framework for revealing causal assumptions in DAGs

Directed acyclic graphs (DAGs) are frequently used in epidemiology as a ...
research
04/07/2020

DiagNet: towards a generic, Internet-scale root cause analysis solution

Diagnosing problems in Internet-scale services remains particularly diff...
research
03/25/2020

NVMe and PCIe SSD Monitoring in Hyperscale Data Centers

With low latency, high throughput and enterprise-grade reliability, SSDs...
research
10/16/2016

Fault Detection Engine in Intelligent Predictive Analytics Platform for DCIM

With the advancement of huge data generation and data handling capabilit...

Please sign up or login with your details

Forgot password? Click here to reset