DeepAI AI Chat
Log In Sign Up

A Case Study on Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis

05/02/2022
by   Sunwoo Lee, et al.
1

In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large scale platforms. In this paper, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting data during analysis at scale. We focus on NOvA detector data in this case study, a large-scale HEP experiment generating many terabytes of data. The lessons learned from our case study inform the handling of similar datasets, thus expanding community knowledge related to this common data management task.

READ FULL TEXT

page 1

page 2

page 3

page 4

10/20/2019

Micro-level Modularity of Computaion-intensive Programs in Big Data Platforms: A Case Study with Image Data

With the rapid advancement of Big Data platforms such as Hadoop, Spark, ...
10/18/2021

Small Data and Process in Data Visualization: The Radical Translations Case Study

This paper uses the collaborative project Radical Translations as case s...
05/11/2022

Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches

The XRootD system is used to transfer, store, and cache large datasets f...
01/20/2020

The Parallelism Motifs of Genomic Data Analysis

Genomic data sets are growing dramatically as the cost of sequencing con...
01/28/2016

Revealing Fundamental Physics from the Daya Bay Neutrino Experiment using Deep Neural Networks

Experiments in particle physics produce enormous quantities of data that...
04/12/2022

Strategic model reduction by analysing model sloppiness: a case study in coral calcification

It can be difficult to identify ways to reduce the complexity of large m...
04/12/2019

Guidelines for data analysis scripts

Unorganized heaps of analysis code are a growing liability as data analy...