Graph-based Incident Aggregation for Large-Scale Online Service Systems

08/27/2021
by   Zhuangbin Chen, et al.
0

As online service systems continue to grow in terms of complexity and volume, how service incidents are managed will significantly impact company revenue and user trust. Due to the cascading effect, cloud failures often come with an overwhelming number of incidents from dependent services and devices. To pursue efficient incident management, related incidents should be quickly aggregated to narrow down the problem scope. To this end, in this paper, we propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents. Thus, it can be easily employed for online incident aggregation. In particular, to learn the correlations more accurately, we try to recover the complete scope of failures' cascading impact by leveraging fine-grained system monitoring data, i.e., Key Performance Indicators (KPIs). The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud. The experimental results demonstrate that GRLIA is effective and outperforms existing methods. Furthermore, our framework has been successfully deployed in industrial practice.

READ FULL TEXT
research
08/19/2023

Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

As modern software systems continue to grow in terms of complexity and v...
research
08/16/2022

The least-used key selection method for information retrieval in large-scale Cloud-based service repositories

As the number of devices connected to the Internet of Things (IoT) incre...
research
10/11/2019

DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services

Large scale cloud services use Key Performance Indicators (KPIs) for tra...
research
08/20/2021

AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems

Service reliability is one of the key challenges that cloud providers ha...
research
01/09/2022

Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching

To ensure the performance of online service systems, their status is clo...
research
12/24/2018

Graph-Based Algorithm for a User-Aware SaaS Approach: Computing Optimal Distribution

As a tool to exploit economies of scale, Software as a Service cloud mod...
research
09/21/2023

Automated Probe Life-Cycle Management for Monitoring-as-a-Service

Cloud services must be continuously monitored to guarantee that misbehav...

Please sign up or login with your details

Forgot password? Click here to reset