DiagNet: towards a generic, Internet-scale root cause analysis solution

04/07/2020
by   Loïck Bonniot, et al.
0

Diagnosing problems in Internet-scale services remains particularly difficult and costly for both content providers and ISPs. Because the Internet is decentralized, the cause of such problems might lie anywhere between an end-user's device and the service datacenters. Further, the set of possible problems and causes is not known in advance, making it impossible in practice to train a classifier with all combinations of problems, causes and locations. In this paper, we explore how different machine learning techniques can be used for Internet-scale root cause analysis using measurements taken from end-user devices. We show how to build generic models that (i) are agnostic to the underlying network topology, (ii) do not require to define the full set of possible causes during training, and (iii) can be quickly adapted to diagnose new services. Our solution, DiagNet, adapts concepts from image processing research to handle network and system metrics. We evaluate DiagNet with a multi-cloud deployment of online services with injected faults and emulated clients with automated browsers. We demonstrate promising root cause analysis capabilities, with a recall of 73.9 inference time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/26/2021

Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications: A Survey

The momentum gained by microservices and cloud-native software architect...
research
05/05/2023

Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems

Localizing root causes for multi-dimensional data is critical to ensure ...
research
06/13/2022

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Fault diagnosis is critical in many domains, as faults may lead to safet...
research
07/17/2012

Automated Inference System for End-To-End Diagnosis of Network Performance Issues in Client-Terminal Devices

Traditional network diagnosis methods of Client-Terminal Device (CTD) pr...
research
03/30/2022

CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis

In large-scale online services, crucial metrics, a.k.a., key performance...
research
04/26/2015

Monitoring Extreme-scale Lustre Toolkit

We discuss the design and ongoing development of the Monitoring Extreme-...
research
01/18/2023

CaRE: Finding Root Causes of Configuration Issues in Highly-Configurable Robots

Robotic systems have several subsystems that possess a huge combinatoria...

Please sign up or login with your details

Forgot password? Click here to reset