Simulation and evaluation of cloud storage caching for data intensive science

05/07/2021
by   Tobias Wegner, et al.
0

A common task in scientific computing is the derivation of data. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis tasks. Typically, those workflows use distributed storage and computing resources. A straightforward configuration of storage media would be low cost tape storage and higher cost disk storage. The large, infrequently accessed input data is stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best case scenario, the large input data is only accessed very infrequently and in a well planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer depending on the computational workflow. The proposed model is elaborated for the case of continuously processed data. For the evaluation, a simulation was developed, which can be used to evaluate models related to storage and network resources. We show that using commercial cloud storage can reduce the on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed and an approach is described that uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome the future data challenges.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2022

EdgeFaaS: A Function-based Framework for Edge Computing

The rapid growth of data generated from Internet of Things (IoTs) such a...
research
10/10/2020

AstroDS – A Distributed Storage for Astrophysics of Cosmic Rays. Current Status

Currently, the processing of scientific data in astroparticle physics is...
research
01/24/2019

SimFS: A Simulation Data Virtualizing File System Interface

Nowadays simulations can produce petabytes of data to be stored in paral...
research
10/26/2018

Federating distributed storage for clouds in ATLAS

Input data for applications that run in cloud computing centres can be s...
research
03/09/2023

Dedicated Analysis Facility for HEP Experiments

High-energy physics (HEP) provides ever-growing amount of data. To analy...
research
08/02/2021

Information Batteries: Storing Opportunity Power with Speculative Execution

Coping with the intermittency of renewables is a fundamental challenge, ...
research
11/07/2018

Data Pallets: Containerizing Storage For Reproducibility and Traceability

Trusting simulation output is crucial for Sandia's mission objectives. W...

Please sign up or login with your details

Forgot password? Click here to reset