Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

03/09/2021
by   Dominik Scheinert, et al.
0

Operation and maintenance of large distributed cloud applications can quickly become unmanageably complex, putting human operators under immense stress when problems occur. Utilizing machine learning for identification and localization of anomalies in such systems supports human experts and enables fast mitigation. However, due to the various inter-dependencies of system components, anomalies do not only affect their origin but propagate through the distributed system. Taking this into account, we present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies and placement as edges to improve the identification and localization of anomalies. Given a series of metric KPIs, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected. During our experiments, we simulate a distributed cloud application deployment and synthetically inject anomalies. The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

research
02/25/2021

TELESTO: A Graph Neural Network Model for Anomaly Classification in Cloud Services

Deployment, operation and maintenance of large IT systems becomes increa...
research
11/17/2020

PaDiM: a Patch Distribution Modeling Framework for Anomaly Detection and Localization

We present a new framework for Patch Distribution Modeling, PaDiM, to co...
research
11/03/2020

Machine Learning Framwork for Performance Anomaly in OpenMP Multi-Threaded Systems

Some OpenMP multi-threaded applications increasingly suffer from perform...
research
06/14/2019

Intelligent Anomaly Detection and Mitigation in Data Centers

Data centers play a key role in today's Internet. Cloud applications are...
research
07/20/2023

Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection

Performance issues permeate large-scale cloud service systems, which can...
research
12/10/2021

MTV: Visual Analytics for Detecting, Investigating, and Annotating Anomalies in Multivariate Time Series

Detecting anomalies in time-varying multivariate data is crucial in vari...
research
05/21/2018

Identifying OSPF Anomalies Using Recurrence Quantification Analysis

Open Shortest Path First (OSPF) is one of the most widely used routing p...

Please sign up or login with your details

Forgot password? Click here to reset