Mapping Datasets to Object Storage System

07/03/2020
by   Xiaowei, et al.
0

Access libraries such as ROOT and HDF5 allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset mapping infrastructures that can integrate and scale out existing access libraries using Ceph's extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset mapping techniques enable: 1) access library operations to be offloaded to storage system servers, 2) the independent evolution of access libraries and storage systems and 3) fully leveraging of the existing load balancing, elasticity, and failure management of distributed storage systems like Ceph. They also create more opportunities to conduct storage server-local optimizations specific to storage servers. For example, storage servers might include local key/value stores combined with chunk stores that require different optimizations than a local file system. As storage servers evolve to support new storage devices like non-volatile memory, these server-local optimizations can be implemented while minimizing disruptions to applications. We will report progress on the means by which distributed dataset mapping can be abstracted over particular access libraries, including access libraries for ROOT data, and how we address some of the challenges revolving around data partitioning and composability of access operations.

READ FULL TEXT

page 3

page 5

research
05/20/2021

Towards an Arrow-native Storage System

With the ever-increasing dataset sizes, several file formats like Parque...
research
04/12/2022

Skyhook: Towards an Arrow-Native Storage System

With the ever-increasing dataset sizes, several file formats such as Par...
research
10/16/2019

UMap: Enabling Application-driven Optimizations for Page Management

Leadership supercomputers feature a diversity of storage, from node-loca...
research
04/07/2022

RNTuple performance: Status and Outlook

Upcoming HEP experiments, e.g. at the HL-LHC, are expected to increase t...
research
07/04/2017

Sequential Checking: Reallocation-Free Data-Distribution Algorithm for Scale-out Storage

Using tape or optical devices for scale-out storage is one option for st...
research
05/06/2020

Fast Mapping onto Census Blocks

Pandemic measures such as social distancing and contact tracing can be e...
research
05/08/2018

Round-Hashing for Data Storage: Distributed Servers and External-Memory Tables

This paper proposes round-hashing, which is suitable for data storage on...

Please sign up or login with your details

Forgot password? Click here to reset