Data stream fusion for accurate quantile tracking and analysis

01/17/2021
by   Massimo Cafaro, et al.
0

UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse data streams (or datasets) by using UDDSKETCH data summaries that are fused into a new summary related to the union of the streams (or datasets) processed by the input summaries whilst preserving both the error and size guarantees provided by UDDSKETCH. This property of sketches, known as mergeability, enables parallel and distributed processing. We prove that UDDSKETCH is fully mergeable and introduce a parallel version of UDDSKETCH suitable for message-passing based architectures. We formally prove its correctness and compare it to a parallel version of DDSKETCH, showing through extensive experimental results that our parallel algorithm almost always outperforms the parallel DDSKETCH algorithm with regard to the overall accuracy in determining the quantiles.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/18/2020

UDDSketch: Accurate Tracking of Quantiles in Data Streams

We present UDDSketch (Uniform DDSketch), a novel sketch for fast and acc...
research
08/28/2019

DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees

Summary statistics such as the mean and variance are easily maintained f...
research
05/09/2019

Tight Lower Bound for Comparison-Based Quantile Summaries

Quantiles, such as the median or percentiles, provide concise and useful...
research
12/01/2018

Distributed mining of time--faded heavy hitters

We present P2PTFHH (Peer--to--Peer Time--Faded Heavy Hitters) which, to ...
research
02/13/2019

Joint Tracking of Multiple Quantiles Through Conditional Quantiles

Estimation of quantiles is one of the most fundamental real-time analysi...
research
10/14/2020

Taurus: Lightweight Parallel Logging for In-Memory Database Management Systems (Extended Version)

Existing single-stream logging schemes are unsuitable for in-memory data...

Please sign up or login with your details

Forgot password? Click here to reset