Effectiveness and predictability of in-network storage cache for scientific workflows

07/20/2023
by   Caitlin Sim, et al.
0

Large scientific collaborations often have multiple scientists accessing the same set of files while doing different analyses, which create repeated accesses to the large amounts of shared data located far away. These data accesses have long latency due to distance and occupy the limited bandwidth available over the wide-area network. To reduce the wide-area network traffic and the data access latency, regional data storage caches have been installed as a new networking service. To study the effectiveness of such a cache system in scientific applications, we examine the Southern California Petabyte Scale Cache for a high-energy physics experiment. By examining about 3TB of operational logs, we show that this cache removed 67.6 the wide-area network and reduced the traffic volume on wide-area network by 12.3TB (or 35.4 is less than the reduction in file counts (67.6 less likely to be reused. Due to this difference in data access patterns, the cache system has implemented a policy to avoid evicting smaller files when processing larger files. We also build a machine learning model to study the predictability of the cache behavior. Tests show that this model is able to accurately predict the cache accesses, cache misses, and network throughput, making the model useful for future studies on resource provisioning and planning.

READ FULL TEXT

page 1

page 2

research
05/11/2022

Access Trends of In-network Cache for Scientific Data

Scientific collaborations are increasingly relying on large volumes of d...
research
05/11/2022

Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches

The XRootD system is used to transfer, store, and cache large datasets f...
research
10/19/2020

Enabling High-Capacity, Latency-Tolerant, and Highly-Concurrent GPU Register Files via Software/Hardware Cooperation

Graphics Processing Units (GPUs) employ large register files to accommod...
research
05/02/2022

A Case Study on Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis

In High Energy Physics (HEP), experimentalists generate large volumes of...
research
05/01/2023

Analyzing Transatlantic Network Traffic over Scientific Data Caches

Large scientific collaborations often share huge volumes of data around ...
research
04/03/2021

Self-adjusting Advertisement of Cache Indicators with Bandwidth Constraints

Cache advertisements reduce the access cost by allowing users to skip th...
research
05/03/2021

Analyzing scientific data sharing patterns for in-network data caching

The volume of data moving through a network increases with new scientifi...

Please sign up or login with your details

Forgot password? Click here to reset