Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees

08/10/2020
by   Stephen Macke, et al.
0

Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight intervals but only offer asymptotic guarantees, making them unreliable, or techniques that are correct but not tight, i.e., they offer rigorous guarantees, but are overly conservative, leading to confidence intervals that are too loose to be useful. In this paper, we develop a CI technique that is both correct and tighter than traditional approaches. Starting from conservative CIs, we identify two issues they often face: pessimistic mass allocation (PMA) and phantom outlier sensitivity (PHOS). By developing a novel range-trimming technique for eliminating PHOS and pairing it with known CI techniques without PMA, we develop a technique for computing CIs with strong guarantees that requires fewer samples for the same width. We implement our techniques underneath a sampling-optimized in-memory column store and show how to accelerate queries involving aggregates on a real dataset with speedups of up to 124x over traditional AQP-with-guarantees and more than 1000x over exact methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/25/2020

Hybrid Confidence Intervals for Informative Uniform Asymptotic Inference After Model Selection

I propose a new type of confidence interval for correct asymptotic infer...
research
06/09/2023

Conformalizing Machine Translation Evaluation

Several uncertainty estimation methods have been recently proposed for m...
research
12/07/2018

On the lengths of t-based confidence intervals

Given n=mk iid samples from N(θ,σ^2) with θ and σ^2 unknown, we have two...
research
10/24/2017

Confidence intervals for normalised citation counts: Can they delimit underlying research capability?

Normalised citation counts are routinely used to assess the average impa...
research
06/01/2023

Confidence Intervals for Error Rates in Matching Tasks: Critical Review and Recommendations

Matching algorithms are commonly used to predict matches between items i...
research
05/08/2019

Predictive inference with the jackknife+

This paper introduces the jackknife+, which is a novel method for constr...
research
03/29/2021

Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing

Sample-based approximate query processing (AQP) suffers from many pitfal...

Please sign up or login with your details

Forgot password? Click here to reset