Sharp Frequency Bounds for Sample-Based Queries

08/14/2022
by   Eric Bax, et al.
0

A data sketch algorithm scans a big data set, collecting a small amount of data – the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.

READ FULL TEXT

page 1

page 2

page 3

research
12/12/2017

A Random Sample Partition Data Model for Big Data Analysis

Big data sets must be carefully partitioned into statistically similar d...
research
08/06/2021

Scalable Analysis for Covid-19 and Vaccine Data

This paper explains the scalable methods used for extracting and analyzi...
research
08/12/2020

Sampling Based Approximate Skyline Calculation on Big Data

The existing algorithms for processing skyline queries cannot adapt to b...
research
06/06/2023

A Calibrated Data-Driven Approach for Small Area Estimation using Big Data

Where the response variable in a big data set is consistent with the var...
research
11/09/2022

Conformal Frequency Estimation with Sketched Data under Relaxed Exchangeability

A flexible method is developed to construct a confidence interval for th...
research
10/31/2014

Validation of Matching

We introduce a technique to compute probably approximately correct (PAC)...
research
08/19/2022

Quancurrent: A Concurrent Quantiles Sketch

Sketches are a family of streaming algorithms widely used in the world o...

Please sign up or login with your details

Forgot password? Click here to reset