Astronomical Pipeline Provenance: A Use Case Evaluation

09/22/2021
by   Michael A. C. Johnson, et al.
0

In this decade astronomy is undergoing a paradigm shift to handle data from next generation observatories such as the Square Kilometre Array (SKA) or the Vera C. Rubin Observatory (LSST). Producing real time data streams of up to 10 TB/s and data products of the order of 600 Pbytes/year, the SKA will be the biggest civil data producing machine of the world that demands novel solutions on how these data volumes can be stored and analysed. Through the use of complex, automated pipelines the provenance of this real time data processing is key to establish confidence within the system, its final data products, and ultimately its scientific results. The intention of this paper is to lay the foundation for making an automated provenance generation tool for astronomical/data-processing pipelines. We therefore present a use case analysis, specific to the astronomical needs which addresses the issues of trust and reproducibility as well as other ulterior use cases which are of interest to astronomers. This analysis is subsequently used as the basis to discuss the requirements, challenges, and opportunities involved in designing both the tool and the associated provenance model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2018

A Scientific Workflow System for Satellite Data Processing with Real-Time Monitoring

This paper provides a case study on satellite data processing, storage, ...
research
10/01/2021

Album: a framework for scientific data processing with software solutions of heterogeneous tools

Album is a decentralized distribution platform for solutions to specific...
research
11/04/2022

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Data processing frameworks such as Apache Beam and Apache Spark are used...
research
05/16/2018

Spark-MPI: Approaching the Fifth Paradigm of Cognitive Applications

Over the past decade, the fourth paradigm of data-intensive science rapi...
research
12/02/2019

Large-scale text processing pipeline with Apache Spark

In this paper, we evaluate Apache Spark for a data-intensive machine lea...
research
02/07/2022

Comprehensive Performance Analysis of Homomorphic Cryptosystems for Practical Data Processing

Oblivious data processing has been an on and off topic for the last deca...
research
08/19/2022

Resource Allocation in Serverless Query Processing

Data lakes hold a growing amount of cold data that is infrequently acces...

Please sign up or login with your details

Forgot password? Click here to reset