Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

08/17/2023
by   Renan Souza, et al.
0

Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.

READ FULL TEXT

page 1

page 2

page 6

page 7

page 8

research
03/31/2023

Workflows Community Summit 2022: A Roadmap Revolution

Scientific workflows have become integral tools in broad scientific comp...
research
05/11/2021

Distributed In-memory Data Management for Workflow Executions

Complex scientific experiments from various domains are typically modele...
research
05/24/2021

Challenges of Translating HPC codes to Workflows for Heterogeneous and Dynamic Environments

In this paper we would like to share our experience for transforming a p...
research
01/11/2018

BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments

Advances in sequencing techniques have led to exponential growth in biol...
research
05/27/2021

RADICAL-Pilot and Parsl: Executing Heterogeneous Workflows on HPC Platforms

Executing scientific workflows with heterogeneous tasks on HPC platforms...
research
08/04/2022

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

Deep learning has been postulated as a solution for numerous problems in...
research
02/24/2011

Scientific Visualization in Astronomy: Towards the Petascale Astronomy Era

Astronomy is entering a new era of discovery, coincident with the establ...

Please sign up or login with your details

Forgot password? Click here to reset