Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches

05/11/2022
by   Julian Bellavita, et al.
0

The XRootD system is used to transfer, store, and cache large datasets from high-energy physics (HEP). In this study we focus on its capability as distributed on-demand storage cache. Through exploring a large set of daily log files between 2020 and 2021, we seek to understand the data access patterns that might inform future cache design. Our study begins with a set of summary statistics regarding file read operations, file lifetimes, and file transfers. We observe that the number of read operations on each file remains nearly constant, while the average size of a read operation grows over time. Furthermore, files tend to have a consistent length of time during which they remain open and are in use. Based on this comprehensive study of the cache access statistics, we developed a cache simulator to explore the behavior of caches of different sizes. Within a certain size range, we find that increasing the XRootD cache size improves the cache hit rate, yielding faster overall file access. In particular, we find that increase the cache size from 40TB to 56TB could increase the hit rate from 0.62 to 0.89, which is a significant increase in cache effectiveness for modest cost.

READ FULL TEXT

page 3

page 4

page 5

page 6

research
05/20/2020

Information Freshness in Cache Updating Systems with Limited Cache Storage Capacity

We consider a cache updating system with a source, a cache with limited ...
research
04/15/2019

RF-Trojan: Leaking Kernel Data Using Register File Trojan

Register Files (RFs) are the most frequently accessed memories in a micr...
research
07/20/2023

Effectiveness and predictability of in-network storage cache for scientific workflows

Large scientific collaborations often have multiple scientists accessing...
research
03/22/2019

Understanding and taming SSD read performance variability: HDFS case study

In this paper we analyze the influence that lower layers (file system, O...
research
08/08/2018

On Distributed Storage Allocations of Large Files for Maximum Service Rate

Allocation of (redundant) file chunks throughout a distributed storage s...
research
04/11/2023

An Associativity Threshold Phenomenon in Set-Associative Caches

In an α-way set-associative cache, the cache is partitioned into disjoin...
research
09/30/2020

CTDGM: A Data Grouping Model Based on Cache Transaction for Unstructured Data Storage Systems

Cache prefetching technology has become the mainstream data access optim...

Please sign up or login with your details

Forgot password? Click here to reset