The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems

02/11/2022
by   Lei Zhang, et al.
0

Today's distributed tracing frameworks only trace a small fraction of all requests. For application developers troubleshooting rare edge-cases, the tracing framework is unlikely to capture a relevant trace at all, because it cannot know which requests will be problematic until after-the-fact. Application developers thus heavily depend on luck. In this paper, we remove the dependence on luck for any edge-case where symptoms can be programmatically detected, such as high tail latency, errors, and bottlenecked queues. We propose a lightweight and always-on distributed tracing system, Hindsight, where each constituent node acts analogously to a car dash-cam that, upon detecting a sudden jolt in momentum, persists the last hour of footage. Hindsight implements a retroactive sampling abstraction: when the symptoms of a problem are detected, Hindsight retrieves and persists coherent trace data from all relevant nodes that serviced the request. Developers using Hindsight receive the exact edge-case traces they desire; by comparison existing sampling-based tracing systems depend wholly on serendipity. Our experimental evaluation shows that Hindsight successfully collects edge-case symptomatic requests in real-world use cases. Hindsight adds only nanosecond-level overhead to generate trace data, can handle GB/s of data per node, transparently integrates with existing distributed tracing systems, and persists full, detailed traces when an edge-case problem is detected.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2020

Aggregate-Driven Trace Visualizations for Performance Debugging

Performance issues in cloud systems are hard to debug. Distributed traci...
research
03/26/2022

MiSeRTrace: Kernel-level Request Tracing for Microservice Visibility

With the evolution of microservice applications, the underlying architec...
research
07/13/2022

Automated Cause Analysis of Latency Outliers Using System-Level Dependency Graphs

Detecting performance issues and identifying their root causes in the ru...
research
02/24/2023

Enhancing Trace Visualizations for Microservices Performance Analysis

Performance analysis of microservices can be a challenging task, as a ty...
research
04/13/2020

Detecting Latency Degradation Patterns in Service-based Systems

Performance in heterogeneous service-based systems shows non-determistic...
research
07/16/2021

Estimation from Partially Sampled Distributed Traces

Sampling is often a necessary evil to reduce the processing and storage ...
research
09/10/2022

SampleHST: Efficient On-the-Fly Selection of Distributed Traces

Since only a small number of traces generated from distributed tracing h...

Please sign up or login with your details

Forgot password? Click here to reset