Random Sampling for Group-By Queries

09/05/2019
by   Trong Duc Nguyen, et al.
0

Random sampling has been widely used in approximate query processing on large databases, due to its potential to significantly reduce resource usage and response times, at the cost of a small approximation error. We consider random sampling for answering the ubiquitous class of group-by queries, which first group data according to one or more attributes, and then aggregate within each group after filtering through a predicate. The challenge with group-by queries is that a sampling method cannot focus on optimizing the quality of a single answer (e.g. the mean of selected data), but must simultaneously optimize the quality of a set of answers (one per group). We present CVOPT, a query- and data-driven sampling framework for a set of queries that return multiple answers, e.g. group-by queries. To evaluate the quality of a sample, CVOPT defines a metric based on the norm of the coefficients of variation (CVs) of different answers, and constructs a stratified sample that provably optimizes the metric. CVOPT can handle group-by queries on data where groups have vastly different statistical characteristics, such as frequencies, means, or variances. CVOPT jointly optimizes for multiple aggregations and multiple group-by clauses, and provides a way to prioritize specific groups or aggregates. It can be tuned to cases when partial information about a query workload is known, such as a data warehouse where queries are scheduled to run periodically. Our experimental results show that CVOPT outperforms current state-of-the-art on sample quality and estimation accuracy for group-by queries. On a set of queries on two real-world data sets, CVOPT yields relative errors that are 5 times smaller than competing approaches, under the same budget.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/08/2021

Approximate Query Processing for Group-By Queries based on Conditional Generative Models

The Group-By query is an important kind of query, which is common and wi...
research
03/08/2022

Aggregate Queries on Knowledge Graphs: Fast Approximation with Semantic-aware Sampling

A knowledge graph (KG) manages large-scale and real-world facts as a big...
research
10/16/2019

Similarity Driven Approximation for Text Analytics

Text analytics has become an important part of business intelligence as ...
research
07/10/2021

NeuroDB: A Neural Network Framework for Answering Range Aggregate Queries and Beyond

Range aggregate queries (RAQs) are an integral part of many real-world a...
research
11/24/2009

Group-based Query Learning for rapid diagnosis in time-critical situations

In query learning, the goal is to identify an unknown object while minim...
research
03/29/2021

Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing

Sample-based approximate query processing (AQP) suffers from many pitfal...
research
12/18/2022

GAN-based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions

In data-driven systems, data exploration is imperative for making real-t...

Please sign up or login with your details

Forgot password? Click here to reset