User-Defined Functions for HDF5

09/24/2021
by   Lucas C. Villa Real, et al.
0

Scientific datasets are known for their challenging storage demands and the associated processing pipelines that transform their information. Some of those processing tasks include filtering, cleansing, aggregation, normalization, and data format translation – all of which generate even more data. In this paper, we present an infrastructure for the HDF5 file format that enables dataset values to be populated on the fly: task-related scripts can be attached into HDF5 files and only execute when the dataset is read by an application. We provide details on the software architecture that supports user-defined functions (UDFs) and how it integrates with hardware accelerators and computational storage. Moreover, we describe the built-in security model that limits the system resources a UDF can access. Last, we present several use cases that show how UDFs can be used to extend scientific datasets in ways that go beyond the original scope of this work.

READ FULL TEXT

page 5

page 6

research
06/11/2018

A Cost-based Storage Format Selector for Materialization in Big Data Frameworks

Modern big data frameworks (such as Hadoop and Spark) allow multiple use...
research
01/25/2020

GeoRocket: A scalable and cloud-based data store for big geospatial files

We present GeoRocket, a software for the management of very large geospa...
research
11/30/2021

RawArray: A Simple, Fast, and Extensible Archival Format for Numeric Data

Raw data sizes are growing and proliferating in scientific research, dri...
research
08/01/2023

Understanding URDF: A Dataset and Analysis

As the complexity of robot systems increases, it becomes more effective ...
research
07/19/2022

A Comparison of HDF5, Zarr, and netCDF4 in Performing Common I/O Operations

Scientific data is often stored in files because of the simplicity they ...
research
04/26/2020

TRAKO: Efficient Transmission of Tractography Data for Visualization

Fiber tracking produces large tractography datasets that are tens of gig...
research
09/18/2020

GrateTile: Efficient Sparse Tensor Tiling for CNN Processing

We propose GrateTile, an efficient, hardwarefriendly data storage scheme...

Please sign up or login with your details

Forgot password? Click here to reset