Flock: Accurate network fault localization at scale

05/05/2023
by   Vipul Harsh, et al.
0

Inferring the root cause of failures among thousands of components in a data center network is challenging, especially for "gray" failures that are not reported directly by switches. Faults can be localized through end-to-end measurements, but past localization schemes are either too slow for large-scale networks or sacrifice accuracy. We describe Flock, a network fault localization algorithm and system that achieves both high accuracy and speed at datacenter scale. Flock uses a probabilistic graphical model (PGM) to achieve high accuracy, coupled with new techniques to dramatically accelerate inference in discrete-valued Bayesian PGMs. Large-scale simulations and experiments in a hardware testbed show Flock speeds up inference by >10000x compared to past PGM methods, and improves accuracy over the best previous datacenter fault localization approaches, reducing inference error by 1.19-11x on the same input telemetry, and by 1.2-55x after incorporating passive telemetry. We also prove Flock's inference is optimal in restricted settings

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2021

Variability Fault Localization: A Benchmark

Software fault localization is one of the most expensive, tedious, and t...
research
07/23/2018

Fault Localization for Declarative Models in Alloy

Fault localization is a popular research topic and many techniques have ...
research
11/18/2019

Configuration-dependent Fault Localization

In a buggy configurable system, configuration-dependent bugs cause the f...
research
07/19/2022

Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems

Fault localization is challenging in an online service system due to its...
research
03/27/2018

An Empirical Study of Fault Localization Families and Their Combinations

The performance of fault localization techniques is critical to their ad...
research
12/21/2017

Fault Localization in Large-Scale Network Policy Deployment

The recent advances in network management automation and Software-Define...
research
04/12/2020

BugDoc: Algorithms to Debug Computational Processes

Data analysis for scientific experiments and enterprises, large-scale si...

Please sign up or login with your details

Forgot password? Click here to reset