Scalable Statistical Root Cause Analysis on App Telemetry

by   Vijayaraghavan Murali, et al.

Despite engineering workflows that aim to prevent buggy code from being deployed, bugs still make their way into the Facebook app. When symptoms of these bugs, such as user submitted reports and automatically captured crashes, are reported, finding their root causes is an important step in resolving them. However, at Facebook's scale of billions of users, a single bug can manifest as several different symptoms according to the various user and execution environments in which the software is deployed. Root cause analysis (RCA) therefore requires tedious manual investigation and domain expertise to extract out common patterns that are observed in groups of reports and use them for debugging. In this paper, we propose Minesweeper, a technique for RCA that moves towards automatically identifying the root cause of bugs from their symptoms. The method is based on two key aspects: (i) a scalable algorithm to efficiently mine patterns from telemetric information that is collected along with the reports, and (ii) statistical notions of precision and recall of patterns that help point towards root causes. We evaluate Minesweeper on its scalability and effectiveness in finding root causes from symptoms on real world bug and crash reports from Facebook's apps. Our evaluation demonstrates that Minesweeper can perform RCA for tens of thousands of reports in less than 3 minutes, and is more than 85



There are no comments yet.


page 1

page 2

page 3

page 4


Root cause prediction based on bug reports

This paper proposes a supervised machine learning approach for predictin...

Debugging Crashes using Continuous Contrast Set Mining

Facebook operates a family of services used by over two billion people d...

DataExposer: Exposing Disconnect between Data and Systems

As data is a central component of many modern systems, the cause of a sy...

Feature Engineering for Scalable Application-Level Post-Silicon Debugging

We present systematic and efficient solutions for both observability enh...

On the Refinement of Spreadsheet Smells by means of Structure Information

Spreadsheet users are often unaware of the risks imposed by poorly desig...

Automatic Detection and Diagnosis of Biased Online Experiments

We have seen a massive growth of online experiments at LinkedIn, and in ...

KabOOM: Unsupervised Crash Categorization through Timeseries Fingerprinting

Modern mobile applications include instrumentation that sample internal ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.