Approximate Partition Selection for Big-Data Workloads using Summary Statistics

08/24/2020
by   Kexin Rong, et al.
0

Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions. In this work, we seek to answer queries quickly and approximately by reading a subset of the data partitions and combining partial answers in a weighted manner without modifying the data layout. We illustrate how to efficiently perform this query processing using a set of pre-computed summary statistics, which inform the choice of partitions and weights. We develop novel means of using the statistics to assess the similarity and importance of partitions. Our experiments on several datasets and data layouts demonstrate that to achieve the same relative error compared to uniform partition sampling, our techniques offer from 2.7× to 70× reduction in the number of partitions read, and the statistics stored per partition require fewer than 100KB.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2021

Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing

Sample-based approximate query processing (AQP) suffers from many pitfal...
research
07/22/2020

R*-Grove: Balanced Spatial Partitioning for Large-scale Datasets

The rapid growth of big spatial data urged the research community to dev...
research
05/14/2019

Query Processing on Large Graphs: Approaches To Scalability and Response Time Trade Offs

With the advent of social networks and the web, the graph sizes have gro...
research
06/29/2018

Comparing Graph Clusterings: Set partition measures vs. Graph-aware measures

In this paper, we propose a family of graph partition similarity measure...
research
01/02/2023

Bent Partitions, Vectorial Dual-Bent Functions and Partial Difference Sets

It is known that partial spreads is a class of bent partitions. In <cit....
research
05/16/2021

Lexicographic Enumeration of Set Partitions

In this report, we summarize the set partition enumeration problems and ...
research
12/18/2022

GAN-based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions

In data-driven systems, data exploration is imperative for making real-t...

Please sign up or login with your details

Forgot password? Click here to reset