Assess and Summarize: Improve Outage Understanding with Large Language Models

05/29/2023
by   Pengxiang Jin, et al.
0

Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are affected by a cloud outage, users can experience slow response times, connection issues or total service disruption, resulting in a significant negative business impact. Outages are usually comprised of several concurring events/source causes, and therefore understanding the context of outages is a very challenging yet crucial first step toward mitigating and resolving outages. In current practice, on-call engineers with in-depth domain knowledge, have to manually assess and summarize outages when they happen, which is time-consuming and labor-intensive. In this paper, we first present a large-scale empirical study investigating the way on-call engineers currently deal with cloud outages at Microsoft, and then present and empirically validate a novel approach (dubbed Oasis) to help the engineers in this task. Oasis is able to automatically assess the impact scope of outages as well as to produce human-readable summarization. Specifically, Oasis first assesses the impact scope of an outage by aggregating relevant incidents via multiple techniques. Then, it generates a human-readable summary by leveraging fine-tuned large language models like GPT-3.x. The impact assessment component of Oasis was introduced in Microsoft over three years ago, and it is now widely adopted, while the outage summarization component has been recently introduced, and in this article we present the results of an empirical evaluation we carried out on 18 real-world cloud systems as well as a human-based evaluation with outage owners. The results show that Oasis can effectively and efficiently summarize outages, and lead Microsoft to deploy its first prototype which is currently under experimental adoption by some of the incident teams.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/10/2023

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Incident management for cloud services is a complex process involving se...
research
04/13/2022

Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems

Alerts are crucial for requesting prompt human intervention upon cloud a...
research
05/19/2023

Evaluating The Impact Of Cloud-Based Microservices Architecture On Application Performance

The study assesses the impact of cloud-based microservices architectures...
research
08/23/2023

Evaluation of Faithfulness Using the Longest Supported Subsequence

As increasingly sophisticated language models emerge, their trustworthin...
research
06/29/2021

Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

Identifying the failure modes of cloud computing systems is a difficult ...
research
06/02/2023

SuperFlow: Performance Testing for Serverless Computing

Serverless computing is an emerging cloud computing paradigm that allows...
research
02/05/2020

Component-aware Orchestration of Cloud-based Enterprise Applications, from TOSCA to Docker and Kubernetes

Enterprise IT is currently facing the challenge of coordinating the mana...

Please sign up or login with your details

Forgot password? Click here to reset