Qd-tree: Learning Data Layouts for Big Data Analytics

04/22/2020
∙
by   Zongheng Yang, et al.
∙
0
∙

Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today's systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries. Further, they are unable to exploit additional available storage to drive this metric down further. In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
∙ 11/07/2022

Query Complexity of the Metric Steiner Tree Problem

We study the query complexity of the metric Steiner Tree problem, where ...
research
∙ 04/07/2018

IDEBench: A Benchmark for Interactive Data Exploration

Existing benchmarks for analytical database systems such as TPC-DS and T...
research
∙ 03/08/2019

Deductive Optimization of Relational Data Storage

Optimizing the physical data storage and retrieval of data are two key d...
research
∙ 01/17/2021

Real-Time LSM-Trees for HTAP Workloads

Real-time data analytics systems such as SAP HANA, MemSQL, and IBM Wildf...
research
∙ 04/30/2021

Compactness of Hashing Modes and Efficiency beyond Merkle Tree

We revisit the classical problem of designing optimally efficient crypto...
research
∙ 06/04/2020

TASM: A Tile-Based Storage Manager for Video Analytics

The amount of video data being produced is rapidly growing. At the same ...

Please sign up or login with your details

Forgot password? Click here to reset