PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

08/02/2023
by   Runzhou Han, et al.
0

Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins of data products, usage patterns of datasets). Unfortunately, existing provenance solutions cannot address the challenges due to their incompatible provenance models and/or system implementations. In this paper, we analyze four representative scientific workflows in collaboration with the domain scientists to identify concrete provenance needs. Based on the first-hand analysis, we propose a provenance framework called PROV-IO+, which includes an I/O-centric provenance model for describing scientific data and the associated I/O operations and environments precisely. Moreover, we build a prototype of PROV-IO+ to enable end-to-end provenance support on real HPC systems with little manual effort. The PROV-IO+ framework can support both containerized and non-containerized workflows on different HPC platforms with flexibility in selecting various classes of provenance. Our experiments with realistic workflows show that PROV-IO+ can address the provenance needs of the domain scientists effectively with reasonable performance (e.g., less than 3.5 tracking overhead for most experiments). Moreover, PROV-IO+ outperforms a state-of-the-art system (i.e., ProvLake) in our experiments.

READ FULL TEXT

page 8

page 10

page 14

research
02/01/2020

SciChain: Trustworthy Scientific Data Provenance

The state-of-the-art for auditing and reproducing scientific application...
research
06/13/2018

The importance and need for system monitoring and analysis in HPC operations and research

In this work, system monitoring and analysis are discussed in terms of t...
research
07/15/2023

PSI/J: A Portable Interface for Submitting, Monitoring, and Managing Jobs

It is generally desirable for high-performance computing (HPC) applicati...
research
03/22/2018

SCISPACE: A Scientific Collaboration Workspace for File Systems in Geo-Distributed HPC Data Centers

Future terabit networks are committed to dramatically improving big data...
research
06/26/2021

Exploring Spatial Indexing for Accelerated Feature Retrieval in HPC

Despite the critical role that range queries play in analysis and visual...
research
10/02/2017

Accelerating Scientific Data Exploration via Visual Query Systems

The increasing availability of rich and complex data in a variety of sci...
research
09/24/2021

Aristotle Cloud Federation: Container Runtimes Technical Report

A National Science Foundation-sponsored container runtimes investigation...

Please sign up or login with your details

Forgot password? Click here to reset