Fast Dimensional Analysis for Root Cause Investigation in Large-Scale Service Environment

11/01/2019
by   Fred Lin, et al.
0

Root cause analysis in a large-scale production environment is challenging due to the complexity of services running across global data centers. Due to the distributed nature of a large-scale system, the various hardware, software, and tooling logs are often maintained separately, making it difficult to review the logs jointly for detecting issues. Another challenge in reviewing the logs for identifying issues is the scale - there could easily be millions of entities, each with hundreds of features. In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability. We first explore item-sets, i.e. a group of feature values, that could identify groups of samples with sufficient support for the target failures using the Apriori algorithm and a subsequent improvement, FP-Growth. These algorithms were designed for frequent item-set mining and association rule learning over transactional databases. After applying them on structured logs, we select the item-sets that are most unique to the target failures based on lift. With the use of a large-scale real-time database, we propose pre- and post-processing techniques and parallelism to further speed up the analysis. We have successfully rolled out this approach for root cause investigation purposes in a large-scale infrastructure. We also present the setup and results from multiple production use-cases in this paper.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2019

Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Root cause analysis in a large-scale production environment is challengi...
research
10/11/2019

DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services

Large scale cloud services use Key Performance Indicators (KPIs) for tra...
research
07/02/2022

Accelerating System-Level Debug Using Rule Learning and Subgroup Discovery Techniques

We propose a root-causing procedure for accelerating system-level debug ...
research
01/09/2023

Making Sense of Failure Logs in an Industrial DevOps Environment

Processing and reviewing nightly test execution failure logs for large i...
research
06/05/2020

Root Cause Analysis in Lithium-Ion Battery Production with FMEA-Based Large-Scale Bayesian Network

The production of lithium-ion battery cells is characterized by a high d...
research
02/22/2021

Silent Data Corruptions at Scale

Silent Data Corruption (SDC) can have negative impact on large-scale inf...
research
08/01/2021

Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings

For large-scale distributed systems, it's crucial to efficiently diagnos...

Please sign up or login with your details

Forgot password? Click here to reset