Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams

08/09/2022
by   Antonis Manousis, et al.
0

Today's large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a “sketch of sketches” to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2021

CrossRoI: Cross-camera Region of Interest Optimization for Efficient Real Time Video Analytics at Scale

Video cameras are pervasively deployed in city scale for public good or ...
research
03/06/2018

Moment-Based Quantile Sketches for Efficient High Cardinality Aggregation Queries

Interactive analytics increasingly involves querying for quantiles over ...
research
08/19/2022

Quancurrent: A Concurrent Quantiles Sketch

Sketches are a family of streaming algorithms widely used in the world o...
research
08/28/2019

DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees

Summary statistics such as the mean and variance are easily maintained f...
research
03/04/2022

Improving Tug-of-War sketch using Control-Variates method

Computing space-efficient summary, or a.k.a. sketches, of large data, is...
research
04/26/2022

Scheduling of Sensor Transmissions Based on Value of Information for Summary Statistics

The optimization of Value of Information (VoI) in sensor networks integr...
research
11/11/2020

Sketch and Scale: Geo-distributed tSNE and UMAP

Running machine learning analytics over geographically distributed datas...

Please sign up or login with your details

Forgot password? Click here to reset