Spatial Parquet: A Column File Format for Geospatial Data Lakes [Extended Version]

09/05/2022
by   Majid Saeedan, et al.
0

Modern data analytics applications prefer to use column-storage formats due to their improved storage efficiency through encoding and compression. Parquet is the most popular file format for column data storage that provides several of these benefits out of the box. However, geospatial data is not readily supported by Parquet. This paper introduces Spatial Parquet, a Parquet extension that efficiently supports geospatial data. Spatial Parquet inherits all the advantages of Parquet for non-spatial data, such as rich data types, compression, and column/row filtering. Additionally, it adds three new features to accommodate geospatial data. First, it introduces a geospatial data type that can encode all standard spatial data types in a column format compatible with Parquet. Second, it adds a new lossless and efficient encoding method, termed FP-delta, that is customized to efficiently store geospatial coordinates stored in floating-point format. Third, it adds a light-weight spatial index that allows the reader to skip non-relevant parts of the file for increased read efficiency. Experiments on large-scale real data showed that SpatialParquet can reduce the data size by a factor of three, even without compression. Compression can further reduce the storage size. Additionally, Spatial Parquet can reduce the reading time by two orders of magnitude when the light-weight index is applied. This initial prototype can open new research directions to further improve geospatial data storage in column format.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2021

LEA: A Learned Encoding Advisor for Column Stores

Data warehouses organize data in a columnar format to enable faster scan...
research
01/17/2021

Real-Time LSM-Trees for HTAP Workloads

Real-time data analytics systems such as SAP HANA, MemSQL, and IBM Wildf...
research
09/08/2023

Value-Compressed Sparse Column (VCSC): Sparse Matrix Storage for Redundant Data

Compressed Sparse Column (CSC) and Coordinate (COO) are popular compress...
research
04/29/2020

Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats

The proliferation of modern data processing tools has given rise to open...
research
01/30/2019

A study for Image compression using Re-Pair algorithm

The compression is an important topic in computer science which allows w...
research
01/25/2021

Towards an Open Format for Scalable System Telemetry

A data representation for system behavior telemetry for scalable big dat...
research
10/25/2018

Waveform Signal Entropy and Compression Study of Whole-Building Energy Datasets

Electrical energy consumption has been an ongoing research area since th...

Please sign up or login with your details

Forgot password? Click here to reset