Skyhook: Towards an Arrow-Native Storage System

04/12/2022
by   Jayjeet Chakraborty, et al.
0

With the ever-increasing dataset sizes, several file formats such as Parquet, ORC, and Avro have been developed to store data efficiently, save the network, and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1, 000, 000 reqs/sec, the CPU has become the bottleneck trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires an extensive understanding of the system internals. Previous approaches re-implemented functionality of data processing frameworks and access libraries for a particular storage system, a duplication of effort that might have to be repeated for different storage systems. This paper introduces a new design paradigm that allows extending programmable object storage systems to embed existing, widely used data processing frameworks and access libraries into the storage layer with no modifications. In this approach, data processing frameworks and access libraries can evolve independently from storage systems while leveraging distributed storage systems scale-out and availability properties. We present Skyhook, an example implementation of our design paradigm using Ceph, Apache Arrow, and Parquet. We provide a brief performance evaluation of Skyhook and discuss key results.

READ FULL TEXT
research
05/20/2021

Towards an Arrow-native Storage System

With the ever-increasing dataset sizes, several file formats like Parque...
research
07/03/2020

Mapping Datasets to Object Storage System

Access libraries such as ROOT and HDF5 allow users to interact with data...
research
12/22/2022

A Moveable Beast: Partitioning Data and Compute for Computational Storage

Over the years, hardware trends have introduced various heterogeneous co...
research
01/07/2021

Towards a Smart Data Processing and Storage Model

In several domains it is crucial to store and manipulate data whose orig...
research
06/20/2022

Building Blocks for Network-Accelerated Distributed File Systems

High-performance clusters and datacenters pose increasingly demanding re...
research
01/21/2021

Clairvoyant Prefetching for Distributed Machine Learning I/O

I/O is emerging as a major bottleneck for machine learning training, esp...
research
08/31/2023

Meld: Exploring the Feasibility of a Framework-less Framework

HEP data-processing frameworks are essential ingredients in getting from...

Please sign up or login with your details

Forgot password? Click here to reset