Moment-Based Quantile Sketches for Efficient High Cardinality Aggregation Queries

03/06/2018
by   Edward Gan, et al.
0

Interactive analytics increasingly involves querying for quantiles over specific sub-populations and time windows of high cardinality datasets. Data processing engines such as Druid and Spark use mergeable summaries to estimate quantiles on these large datasets, but summary merge times are a bottleneck during high-cardinality aggregation. We show how a compact and efficiently mergeable quantile sketch can support aggregation workloads. This data structure, which we refer to as the moments sketch, operates with a small memory footprint (200 bytes) and computationally efficient (50ns) merges by tracking only a set of summary statistics, notably the sample moments. We demonstrate how we can efficiently and practically estimate quantiles using the method of moments and the maximum entropy principle, and show how the use of a cascade further improves query time for threshold predicates. Empirical evaluation on real-world datasets shows that the moments sketch can achieve less than 1 percent error with 40 times less merge overhead than comparable summaries, improving end query time in the MacroBase engine by up to 7 times and the Druid engine by up to 60 times.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2020

Storyboard: Optimizing Precomputed Summaries for Aggregation

An emerging class of data systems partition their data and precompute ap...
research
08/28/2019

DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees

Summary statistics such as the mean and variance are easily maintained f...
research
08/09/2022

Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams

Today's large-scale services (e.g., video streaming platforms, data cent...
research
02/09/2016

Graphical Model Sketch

Structured high-cardinality data arises in many domains, and poses a maj...
research
12/21/2018

Fast post-hoc method for updating moments of large datasets

Moments of large datasets utilise the mean of the dataset; consequently,...
research
05/23/2022

HyperLogLogLog: Cardinality Estimation With One Log More

We present HyperLogLogLog, a practical compression of the HyperLogLog sk...
research
06/01/2019

Approximate Quantiles for Datacenter Telemetry Monitoring

Datacenter systems require efficient troubleshooting and effective resou...

Please sign up or login with your details

Forgot password? Click here to reset