Towards an Arrow-native Storage System

05/20/2021
by   Jayjeet Chakraborty, et al.
0

With the ever-increasing dataset sizes, several file formats like Parquet, ORC, and Avro have been developed to store data efficiently and to save network and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1, 000, 000 reqs/sec the CPU has become the bottleneck, trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires extensive understanding of the internals. Previous approaches re-implemented functionality of data processing frameworks and access library for a particular storage system, a duplication of effort that might have to be repeated for different storage systems. In this paper, we introduce a new design paradigm that allows extending programmable object storage systems to embed existing, widely used data processing frameworks and access libraries into the storage layer with minimal modifications. In this approach data processing frameworks and access libraries can evolve independently from storage systems while leveraging the scale-out and availability properties of distributed storage systems. We present one example implementation of our design paradigm using Ceph, Apache Arrow, and Parquet. We provide a brief performance evaluation of our implementation and discuss key results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/12/2022

Skyhook: Towards an Arrow-Native Storage System

With the ever-increasing dataset sizes, several file formats such as Par...
research
07/03/2020

Mapping Datasets to Object Storage System

Access libraries such as ROOT and HDF5 allow users to interact with data...
research
12/22/2022

A Moveable Beast: Partitioning Data and Compute for Computational Storage

Over the years, hardware trends have introduced various heterogeneous co...
research
08/31/2023

Meld: Exploring the Feasibility of a Framework-less Framework

HEP data-processing frameworks are essential ingredients in getting from...
research
09/09/2009

Remembrance: The Unbearable Sentience of Being Digital

We introduce a world vision in which data is endowed with memory. In thi...
research
01/07/2021

Towards a Smart Data Processing and Storage Model

In several domains it is crucial to store and manipulate data whose orig...
research
06/20/2022

Building Blocks for Network-Accelerated Distributed File Systems

High-performance clusters and datacenters pose increasingly demanding re...

Please sign up or login with your details

Forgot password? Click here to reset