Extensible Data Skipping

09/17/2020
by   Paula Ta-Shma, et al.
0

Data skipping reduces I/O for SQL queries by skipping over irrelevant data objects (files) based on their metadata. We extend this notion by allowing developers to define their own data skipping metadata types and indexes using a flexible API. Our framework is the first to natively support data skipping for arbitrary data types (e.g. geospatial, logs) and queries with User Defined Functions (UDFs). We integrated our framework with Apache Spark and it is now deployed across multiple products/services at IBM. We present our extensible data skipping APIs, discuss index design, and implement various metadata indexes, requiring only around 30 lines of additional code per index. In particular we implement data skipping for a third party library with geospatial UDFs and demonstrate speedups of two orders of magnitude. Our centralized metadata approach provides a x3.6 speed up even when compared to queries which are rewritten to exploit Parquet min/max metadata. We demonstrate that extensible data skipping is applicable to broad class of applications, where user defined indexes achieve significant speedups and cost savings with very low development cost.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/06/2017

Software metadata: How much is enough?

Broad efforts are underway to capture metadata about research software a...
06/14/2019

SchenQL -- A Domain-Specific Query Language on Bibliographic Metadata

Information access needs to be uncomplicated, users rather use incorrect...
04/24/2018

On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research

Scientific research requires access, analysis, and sharing of data that ...
01/27/2019

CRAQL: A Composable Language for Querying Source Code

This paper describes the design and implementation of CRAQL (Composable ...
07/14/2019

Metadata Extraction from Raw Astroparticle Data of TAIGA Experiment

Today, the operating TAIGA (Tunka Advanced Instrument for cosmic rays an...
08/14/2019

Large-Scale-Exploit of GitHub Repository Metadata and Preventive Measures

When working with Git, a popular version-control system, email addresses...
01/02/2019

Verity: Blockchains to Detect Insider Attacks in DBMS

Integrity and security of the data in database systems are typically mai...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.