Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents

05/25/2023
by   Yinfang Chen, et al.
0

Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative On-call system empowered by the Large Language Model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from serviceX in companyX. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at companyX for over four years.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2022

Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps

Root Cause Analysis (RCA) of any service-disrupting incident is one of t...
research
01/10/2023

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Incident management for cloud services is a complex process involving se...
research
09/11/2023

PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis

In recent years, the transition to cloud-based platforms in the IT secto...
research
02/17/2021

FIXME: Enhance Software Reliability with Hybrid Approaches in Cloud

With the promise of reliability in cloud, more enterprises are migrating...
research
08/13/2018

Simple Root Cause Analysis by Separable Likelihoods

Root Cause Analysis for Anomalies is challenging because of the trade-of...
research
06/08/2022

Trace Diagnostics for Signal-based Temporal Properties

Most of the trace-checking tools only yield a Boolean verdict. However, ...
research
09/08/2021

Knowledge Learning-based Adaptable System for Sensitive Information Identification and Handling

Diagnostic data such as logs and memory dumps from production systems ar...

Please sign up or login with your details

Forgot password? Click here to reset