Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats

04/29/2020
by   Tianyu Li, et al.
0

The proliferation of modern data processing tools has given rise to open-source columnar data formats. The advantage of these formats is that they help organizations avoid repeatedly converting data to a new format for each application. These formats, however, are read-only, and organizations must use a heavy-weight transformation process to load data from on-line transactional processing (OLTP) systems. We aim to reduce or even eliminate this process by developing a storage architecture for in-memory database management systems (DBMSs) that is aware of the eventual usage of its data and emits columnar storage blocks in a universal open-source format. We introduce relaxations to common analytical data formats to efficiently update records and rely on a lightweight transformation process to convert blocks to a read-optimized layout when they are cold. We also describe how to access data from third-party analytical tools with minimal serialization overhead. To evaluate our work, we implemented our storage engine based on the Apache Arrow format and integrated it into the DB-X DBMS. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while enabling orders-of-magnitude faster data exports to external data science and machine learning tools than existing methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/05/2022

Spatial Parquet: A Column File Format for Geospatial Data Lakes [Extended Version]

Modern data analytics applications prefer to use column-storage formats ...
research
01/25/2021

Towards an Open Format for Scalable System Telemetry

A data representation for system behavior telemetry for scalable big dat...
research
03/18/2020

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

There exists a natural tension between encouraging a diverse ecosystem o...
research
05/24/2018

GIRAF: General purpose In-storage Resistive Associative Framework

GIRAF is an in-storage architecture and algorithm framework based on Res...
research
10/26/2018

Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package

Vector space embedding models like word2vec, GloVe, fastText, and ELMo a...
research
03/03/2021

Integrating Column-Oriented Storage and Query Processing Techniques Into Graph Database Management Systems

We revisit column-oriented storage and query processing techniques in th...
research
02/04/2013

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups

Scaling up the backup storage for an ever-increasing volume of virtual m...

Please sign up or login with your details

Forgot password? Click here to reset