Extensible Data Skipping

09/17/2020
by   Paula Ta-Shma, et al.
0

Data skipping reduces I/O for SQL queries by skipping over irrelevant data objects (files) based on their metadata. We extend this notion by allowing developers to define their own data skipping metadata types and indexes using a flexible API. Our framework is the first to natively support data skipping for arbitrary data types (e.g. geospatial, logs) and queries with User Defined Functions (UDFs). We integrated our framework with Apache Spark and it is now deployed across multiple products/services at IBM. We present our extensible data skipping APIs, discuss index design, and implement various metadata indexes, requiring only around 30 lines of additional code per index. In particular we implement data skipping for a third party library with geospatial UDFs and demonstrate speedups of two orders of magnitude. Our centralized metadata approach provides a x3.6 speed up even when compared to queries which are rewritten to exploit Parquet min/max metadata. We demonstrate that extensible data skipping is applicable to broad class of applications, where user defined indexes achieve significant speedups and cost savings with very low development cost.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/06/2017

Software metadata: How much is enough?

Broad efforts are underway to capture metadata about research software a...
research
11/20/2022

Metadata Caching in Presto: Towards Fast Data Processing

Presto is an open-source distributed SQL query engine for OLAP, aiming f...
research
07/24/2018

Rule Based Metadata Extraction Framework from Academic Articles

Metadata of scientific articles such as title, abstract, keywords or ind...
research
05/26/2023

Cluster Analysis of Open Research Data and a Case for Replication Metadata

Research data are often released upon journal publication to enable resu...
research
01/27/2019

CRAQL: A Composable Language for Querying Source Code

This paper describes the design and implementation of CRAQL (Composable ...
research
10/31/2022

User Manual of Automatic Data Curation Tool(ADCT): A bulk data curator software in Library and Information Science

In library and information science, document storage and user-specific d...
research
06/21/2023

A Hierarchical Approach to exploiting Multiple Datasets from TalkBank

TalkBank is an online database that facilitates the sharing of linguisti...

Please sign up or login with your details

Forgot password? Click here to reset