Data Pallets: Containerizing Storage For Reproducibility and Traceability

11/07/2018
by   Jay Lofstead, et al.
0

Trusting simulation output is crucial for Sandia's mission objectives. We rely on these simulations to perform our high-consequence mission tasks given national treaty obligations. Other science and modeling applications, while they may have high-consequence results, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future research. To this end, the computing community has developed workflow and provenance systems to aid in both automating simulation and modeling execution as well as determining exactly how was some output was created so that conclusions can be drawn from the data. Current approaches for workflows and provenance systems are all at the user level and have little to no system level support making them fragile, difficult to use, and incomplete solutions. The introduction of container technology is a first step towards encapsulating and tracking artifacts used in creating data and resulting insights, but their current implementation is focused solely on making it easy to deploy an application in an isolated "sandbox" and maintaining a strictly read-only mode to avoid any potential changes to the application. All storage activities are still using the system-level shared storage. This project explores extending the container concept to include storage as a new container type we call data pallets. Data Pallets are potentially writeable, auto generated by the system based on IO activities, and usable as a way to link the contained data back to the application and input deck used to create it.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2017

The Eclipse Integrated Computational Environment

Problems in modeling and simulation require significantly different work...
research
11/26/2019

Distributed graphs: in search of fast, low-latency, resource-efficient, semantics-rich Big-Data processing

Large graphs can be processed with single high-memory or distributed sys...
research
10/02/2021

Promoting Open Science Through Research Data Management

Data management, which encompasses activities and strategies related to ...
research
05/07/2021

Simulation and evaluation of cloud storage caching for data intensive science

A common task in scientific computing is the derivation of data. This wo...
research
11/27/2019

Dynamically Provisioning Cray DataWarp Storage

Complex applications and workflows needs are often exclusively expressed...
research
08/15/2023

IceCube experience using XRootD-based Origins with GPU workflows in PNRP

The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope...

Please sign up or login with your details

Forgot password? Click here to reset