MISS: Finding Optimal Sample Sizes for Approximate Analytics

07/29/2018
by   Xuebin Su, et al.
0

Nowadays, sampling-based Approximate Query Processing (AQP) is widely regarded as a promising way to achieve interactivity in big data analytics. To build such an AQP system, finding the minimal sample size for a query regarding given error constraints in general, called Sample Size Optimization (SSO), is an essential yet unsolved problem. Ideally, the goal of solving the SSO problem is to achieve statistical accuracy, computational efficiency and broad applicability all at the same time. Existing approaches either make idealistic assumptions on the statistical properties of the query, or completely disregard them. This may result in overemphasizing only one of the three goals while neglect the others. To overcome these limitations, we first examine carefully the statistical properties shared by common analytical queries. Then, based on the properties, we propose a linear model describing the relationship between sample sizes and the approximation errors of a query, which is called the error model. Then, we propose a Model-guided Iterative Sample Selection (MISS) framework to solve the SSO problem generally. Afterwards, based on the MISS framework, we propose a concrete algorithm, called L^2Miss, to find optimal sample sizes under the L^2 norm error metric. Moreover, we extend the L^2Miss algorithm to handle other error metrics. Finally, we show theoretically and empirically that the L^2Miss algorithm and its extensions achieve satisfactory accuracy and efficiency for a considerably wide range of analytical queries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/08/2021

Approximate Query Processing for Group-By Queries based on Conditional Generative Models

The Group-By query is an important kind of query, which is common and wi...
research
08/16/2020

DeepSampling: Selectivity Estimation with Predicted Error and Response Time

The rapid growth of spatial data urges the research community to find ef...
research
08/12/2020

Sampling Based Approximate Skyline Calculation on Big Data

The existing algorithms for processing skyline queries cannot adapt to b...
research
03/05/2020

LAQP: Learning-based Approximate Query Processing

Querying on big data is a challenging task due to the rapid growth of da...
research
01/02/2019

Approximate Computation for Big Data Analytics

Over the past a few years, research and development has made significant...
research
10/16/2019

Similarity Driven Approximation for Text Analytics

Text analytics has become an important part of business intelligence as ...
research
11/06/2017

An Iterative Scheme for Leverage-based Approximate Aggregation

Currently data explosion poses great challenges to approximate aggregati...

Please sign up or login with your details

Forgot password? Click here to reset