Sage: Using Unsupervised Learning for Scalable Performance Debugging in Microservices

01/01/2021
by   Yu Gan, et al.
0

Cloud applications are increasingly shifting from large monolithic services to complex graphs of loosely-coupled microservices. Despite the advantages of modularity and elasticity microservices offer, they also complicate cluster management and performance debugging, as dependencies between tiers introduce backpressure and cascading QoS violations. We present Sage, a machine learning-driven root cause analysis system for interactive cloud microservices. Sage leverages unsupervised ML models to circumvent the overhead of trace labeling, captures the impact of dependencies between microservices to determine the root cause of unpredictable performance online, and applies corrective actions to recover a cloud service's QoS. In experiments on both dedicated local clusters and large clusters on Google Compute Engine we show that Sage consistently achieves over 93 correctly identifying the root cause of QoS violations, and improves performance predictability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2021

Sage: Leveraging ML to Diagnose Unpredictable Performance in Cloud Microservices

Cloud applications are increasingly shifting from large monolithic servi...
research
05/27/2021

Sinan: Data-Driven, QoS-Aware Cluster Management for Microservices

Cloud applications are increasingly shifting from large monolithic servi...
research
04/24/2018

Seer: Leveraging Big Data to Navigate the Increasing Complexity of Cloud Debugging

Performance unpredictability in cloud services leads to poor user experi...
research
12/12/2021

Sinan: Data Driven Resource Management for Cloud Microservices

Cloud applications are increasingly shifting to interactive and loosely-...
research
05/02/2019

Leveraging Deep Learning to Improve the Performance Predictability of Cloud Microservices

Performance unpredictability is a major roadblock towards cloud adoption...
research
09/11/2023

PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis

In recent years, the transition to cloud-based platforms in the IT secto...
research
10/12/2022

Building Heterogeneous Cloud System for Machine Learning Inference

Online inference is becoming a key service product for many businesses, ...

Please sign up or login with your details

Forgot password? Click here to reset