Scalable Statistical Root Cause Analysis on App Telemetry

10/20/2020
by   Vijayaraghavan Murali, et al.
0

Despite engineering workflows that aim to prevent buggy code from being deployed, bugs still make their way into the Facebook app. When symptoms of these bugs, such as user submitted reports and automatically captured crashes, are reported, finding their root causes is an important step in resolving them. However, at Facebook's scale of billions of users, a single bug can manifest as several different symptoms according to the various user and execution environments in which the software is deployed. Root cause analysis (RCA) therefore requires tedious manual investigation and domain expertise to extract out common patterns that are observed in groups of reports and use them for debugging. In this paper, we propose Minesweeper, a technique for RCA that moves towards automatically identifying the root cause of bugs from their symptoms. The method is based on two key aspects: (i) a scalable algorithm to efficiently mine patterns from telemetric information that is collected along with the reports, and (ii) statistical notions of precision and recall of patterns that help point towards root causes. We evaluate Minesweeper on its scalability and effectiveness in finding root causes from symptoms on real world bug and crash reports from Facebook's apps. Our evaluation demonstrates that Minesweeper can perform RCA for tens of thousands of reports in less than 3 minutes, and is more than 85

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2021

Root cause prediction based on bug reports

This paper proposes a supervised machine learning approach for predictin...
research
03/09/2023

RCABench: Open Benchmarking Platform for Root Cause Analysis

Fuzzing has contributed to automatically identifying bugs and vulnerabil...
research
11/12/2019

Debugging Crashes using Continuous Contrast Set Mining

Facebook operates a family of services used by over two billion people d...
research
05/13/2021

DataExposer: Exposing Disconnect between Data and Systems

As data is a central component of many modern systems, the cause of a sy...
research
05/13/2022

Automatic Root Cause Quantification for Missing Edges in JavaScript Call Graphs (Extended Version)

Building sound and precise static call graphs for real-world JavaScript ...
research
02/08/2021

Feature Engineering for Scalable Application-Level Post-Silicon Debugging

We present systematic and efficient solutions for both observability enh...
research
10/10/2018

On the Refinement of Spreadsheet Smells by means of Structure Information

Spreadsheet users are often unaware of the risks imposed by poorly desig...

Please sign up or login with your details

Forgot password? Click here to reset