Seer: Leveraging Big Data to Navigate the Increasing Complexity of Cloud Debugging

04/24/2018
by   Yu Gan, et al.
0

Performance unpredictability in cloud services leads to poor user experience, degraded availability, and has revenue ramifications. Detecting performance degradation a posteriori helps the system take corrective action, but does not avoid the QoS violations. Detecting QoS violations after the fact is even more detrimental when a service consists of hundreds of thousands of loosely-coupled microservices, since performance hiccups can quickly propagate across the dependency graph of microservices. In this work we focus on anticipating QoS violations in cloud settings to mitigate performance unpredictability to begin with. We propose Seer, a cloud runtime that leverages the massive amount of tracing data cloud systems collect over time and a set of practical learning techniques to signal upcoming QoS violations, as well as identify the microservice(s) causing them. Once an imminent QoS violation is detected Seer uses machine-level hardware events to determine the cause of the QoS violation, and adjusts the resource allocations to prevent it. In local clusters with 10 40-core servers and 200-instance clusters on GCE running diverse cloud microservices, we show that Seer correctly anticipates QoS violations 91 the time, and attributes the violation to the correct microservice in 89 cases. Finally, Seer detects QoS violations early enough for a corrective action to almost always be applied successfully.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2019

Leveraging Deep Learning to Improve the Performance Predictability of Cloud Microservices

Performance unpredictability is a major roadblock towards cloud adoption...
research
01/01/2021

Sage: Using Unsupervised Learning for Scalable Performance Debugging in Microservices

Cloud applications are increasingly shifting from large monolithic servi...
research
12/12/2021

Sage: Leveraging ML to Diagnose Unpredictable Performance in Cloud Microservices

Cloud applications are increasingly shifting from large monolithic servi...
research
04/12/2018

Pliant: Leveraging Approximation to Improve Datacenter Resource Efficiency

Cloud multi-tenancy is typically constrained to a single interactive ser...
research
11/26/2019

Intelligent Resource Scheduling for Co-located Latency-critical Services: A Multi-Model Collaborative Learning Approach

Latency-critical services have been widely deployed in cloud environment...
research
10/12/2022

Building Heterogeneous Cloud System for Machine Learning Inference

Online inference is becoming a key service product for many businesses, ...
research
07/18/2019

Approximate Solution Approach and Performability Evaluation of Large Scale Beowulf Clusters

Beowulf clusters are very popular and deployed worldwide in support of s...

Please sign up or login with your details

Forgot password? Click here to reset