A Case Study on Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis

by   Sunwoo Lee, et al.

In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large scale platforms. In this paper, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting data during analysis at scale. We focus on NOvA detector data in this case study, a large-scale HEP experiment generating many terabytes of data. The lessons learned from our case study inform the handling of similar datasets, thus expanding community knowledge related to this common data management task.


page 1

page 2

page 3

page 4


Micro-level Modularity of Computaion-intensive Programs in Big Data Platforms: A Case Study with Image Data

With the rapid advancement of Big Data platforms such as Hadoop, Spark, ...

Small Data and Process in Data Visualization: The Radical Translations Case Study

This paper uses the collaborative project Radical Translations as case s...

Effectiveness and predictability of in-network storage cache for scientific workflows

Large scientific collaborations often have multiple scientists accessing...

Revealing Fundamental Physics from the Daya Bay Neutrino Experiment using Deep Neural Networks

Experiments in particle physics produce enormous quantities of data that...

Strategic model reduction by analysing model sloppiness: a case study in coral calcification

It can be difficult to identify ways to reduce the complexity of large m...

The Parallelism Motifs of Genomic Data Analysis

Genomic data sets are growing dramatically as the cost of sequencing con...

Architecture of Distributed Data Storage for Astroparticle Physics

For the successful development of the astrophysics and, accordingly, for...

Please sign up or login with your details

Forgot password? Click here to reset