DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services

10/11/2019
by   Chetan Bansal, et al.
0

Large scale cloud services use Key Performance Indicators (KPIs) for tracking and monitoring performance. They usually have Service Level Objectives (SLOs) baked into the customer agreements which are tied to these KPIs. Dependency failures, code bugs, infrastructure failures, and other problems can cause performance regressions. It is critical to minimize the time and manual effort in diagnosing and triaging such issues to reduce customer impact. Large volumes of logs and mixed type of attributes (categorical, continuous) make any automated or manual diagnosing non-trivial. In this paper, we present the design, implementation and experience from building and deploying DeCaf, a system for automated diagnosis and triaging of KPI issues using service logs. It uses machine learning along with pattern mining to help service owners automatically root cause and triage performance issues. We present the learnings and results from case studies on two large scale cloud services in Microsoft where DeCaf successfully diagnosed 10 known and 31 unknown issues. DeCaf also automatically triages the identified issues by leveraging historical data. Our key insights are that for any such diagnosis tool to be effective in practice, it should a) scale to large volumes of service logs and attributes, b) support different types of KPIs and ranking functions, c) be integrated into the DevOps processes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2019

Fast Dimensional Analysis for Root Cause Investigation in Large-Scale Service Environment

Root cause analysis in a large-scale production environment is challengi...
research
11/01/2019

Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Root cause analysis in a large-scale production environment is challengi...
research
08/20/2021

AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems

Service reliability is one of the key challenges that cloud providers ha...
research
08/27/2021

Graph-based Incident Aggregation for Large-Scale Online Service Systems

As online service systems continue to grow in terms of complexity and vo...
research
04/21/2022

Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps

Root Cause Analysis (RCA) of any service-disrupting incident is one of t...
research
11/13/2013

Impact of Limpware on HDFS: A Probabilistic Estimation

With the advent of cloud computing, thousands of machines are connected ...
research
01/10/2022

Using Online Customer Reviews to Classify, Predict, and Learn about Domestic Robot Failures

There is a knowledge gap regarding which types of failures robots underg...

Please sign up or login with your details

Forgot password? Click here to reset