An Empirical Evaluation of Columnar Storage Formats

04/11/2023
by   Xinyu Zeng, et al.
0

Columnar storage is one of the core components of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed significantly. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. Our analysis identifies important considerations that may guide future formats to better fit modern technology trends.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/09/2019

Cold Storage Data Archives: More Than Just a Bunch of Tapes

The abundance of available sensor and derived data from large scientific...
research
08/05/2019

Toward Efficient In-memory Data Analytics on NUMA Systems

Data analytics systems commonly utilize in-memory query processing techn...
research
06/29/2022

AAE: An Active Auto-Estimator for Improving Graph Storage

Nowadays, graph becomes an increasingly popular model in many real appli...
research
09/16/2018

I/O Workload Management for All-Flash Datacenter Storage Systems Based on Total Cost of Ownership

Recently, the capital expenditure of flash-based Solid State Driver (SSD...
research
04/14/2023

GreedyGD: Enhanced Generalized Deduplication for Direct Analytics in IoT

Exponential growth in the amount of data generated by the Internet of Th...
research
02/14/2018

ForkBase: An Efficient Storage Engine for Blockchain and Forkable Applications

Existing data storage systems offer a wide range of functionalities to a...
research
04/16/2020

ForkBase: Immutable, Tamper-evident Storage Substrate for Branchable Applications

Data collaboration activities typically require systematic or protocol-b...

Please sign up or login with your details

Forgot password? Click here to reset