CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis

03/30/2022
by   Shifu Yan, et al.
0

In large-scale online services, crucial metrics, a.k.a., key performance indicators (KPIs), are monitored periodically to check their running statuses. Generally, KPIs are aggregated along multiple dimensions and derived by complex calculations among fundamental metrics from the raw data. Once abnormal KPI values are observed, root cause analysis (RCA) can be applied to identify the reasons for anomalies, so that we can troubleshoot quickly. Recently, several automatic RCA techniques were proposed to localize the related dimensions (or a combination of dimensions) to explain the anomalies. However, their analyses are limited to the data on the abnormal metric and ignore the data of other metrics which may be also related to the anomalies, leading to imprecise or even incorrect root causes. To this end, we propose a cross-metric multi-dimensional root cause analysis method, named CMMD, which consists of two key components: 1) relationship modeling, which utilizes graph neural network (GNN) to model the unknown complex calculation among metrics and aggregation function among dimensions from historical data; 2) root cause localization, which adopts the genetic algorithm to efficiently and effectively dive into the raw data and localize the abnormal dimension(s) once the KPI anomalies are detected. Experiments on synthetic datasets, public datasets and online production environment demonstrate the superiority of our proposed CMMD method compared with baselines. Currently, CMMD is running as an online service in Microsoft Azure.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/05/2023

Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems

Localizing root causes for multi-dimensional data is critical to ensure ...
research
09/06/2022

CausalRCA: Causal Inference based Precise Fine-grained Root Cause Localization for Microservice Applications

For microservice applications with detected performance anomalies, local...
research
05/20/2022

RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk

Failures and anomalies in large-scale software systems are unavoidable i...
research
08/20/2023

Demystifying the Performance of Data Transfers in High-Performance Research Networks

High-speed research networks are built to meet the ever-increasing needs...
research
03/01/2021

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

Availability issues of industrial microservice systems (e.g., drop of su...
research
04/07/2020

DiagNet: towards a generic, Internet-scale root cause analysis solution

Diagnosing problems in Internet-scale services remains particularly diff...
research
07/18/2022

PerfCE: Performance Debugging on Databases with Chaos Engineering-Enhanced Causality Analysis

Debugging performance anomalies in real-world databases is challenging. ...

Please sign up or login with your details

Forgot password? Click here to reset