Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps

04/21/2022
by   Amrita Saha, et al.
0

Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources like application error logs or service call traces. However a rich goldmine of root cause information is also hidden in the natural language documentation of the past incidents investigations by domain experts. This is generally termed as Problem Review Board (PRB) Data which constitute a core component of IT Incident Management. However, owing to the raw unstructured nature of PRBs, such root cause knowledge is not directly reusable by manual or automated pipelines for RCA of new incidents. This motivates us to leverage this widely-available data-source to build an Incident Causation Analysis (ICA) engine, using SoTA neural NLP techniques to extract targeted information and construct a structured Causal Knowledge Graph from PRB documents. ICA forms the backbone of a simple-yet-effective Retrieval based RCA for new incidents, through an Information Retrieval system to search and rank past incidents and detect likely root causes from them, given the incident symptom. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce, over 2K documented cloud service incident investigations collected over a few years. We also establish the effectiveness of ICA and the downstream tasks through various quantitative benchmarks, qualitative analysis as well as domain expert's validation and real incident case studies after deployment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2023

Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents

Ensuring the reliability and availability of cloud services necessitates...
research
05/26/2021

Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications: A Survey

The momentum gained by microservices and cloud-native software architect...
research
09/11/2023

PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis

In recent years, the transition to cloud-based platforms in the IT secto...
research
10/11/2019

DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services

Large scale cloud services use Key Performance Indicators (KPIs) for tra...
research
01/30/2017

Survey on Models and Techniques for Root-Cause Analysis

Automation and computer intelligence to support complex human decisions ...
research
02/11/2020

Debugging Machine Learning Pipelines

Machine learning tasks entail the use of complex computational pipelines...
research
04/12/2020

BugDoc: Algorithms to Debug Computational Processes

Data analysis for scientific experiments and enterprises, large-scale si...

Please sign up or login with your details

Forgot password? Click here to reset