Big Data Staging with MPI-IO for Interactive X-ray Science

02/14/2020
by   Justin M. Wozniak, et al.
0

New techniques in X-ray scattering science experiments produce large data sets that can require millions of high-performance processing hours per week of computation for analysis. In such applications, data is typically moved from X-ray detectors to a large parallel file system shared by all nodes of a petascale supercomputer and then is read repeatedly as different science application tasks proceed. However, this straightforward implementation causes significant contention in the file system. We propose an alternative approach in which data is instead staged into and cached in compute node memory for extended periods, during which time various processing tasks may efficiently access it. We describe here such a big data staging framework, based on MPI-IO and the Swift parallel scripting language. We discuss a range of large-scale data management issues involved in X-ray scattering science and measure the performance benefits of the new staging framework for high-energy diffraction microscopy, an important emerging application in data-intensive X-ray scattering. We show that our framework accelerates scientific processing turnaround from three months to under 10 minutes, and that our I/O technique reduces input overheads by a factor of 5 on 8K Blue Gene/Q nodes.

READ FULL TEXT
research
11/12/2018

Comparing Spark vs MPI/OpenMP On Word Count MapReduce

Spark provides an in-memory implementation of MapReduce that is widely u...
research
12/30/2020

SDN helps Big Data to optimize access to data

This chapter introduces the state-of-the-art in the emerging area of com...
research
04/30/2018

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

One of the hardest challenges of the current Big Data landscape is the l...
research
05/16/2018

Spark-MPI: Approaching the Fifth Paradigm of Cognitive Applications

Over the past decade, the fourth paradigm of data-intensive science rapi...
research
05/08/2018

Parallel Computation of PDFs on Big Spatial Data Using Spark

We consider big spatial data, which is typically produced in scientific ...
research
11/22/2021

The EOSC-Synergy cloud services implementation for the Latin American Giant Observatory (LAGO)

The Latin American Giant Observatory (LAGO) is a distributed cosmic ray ...
research
05/13/2018

Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

Advances in detectors and computational technologies provide new opportu...

Please sign up or login with your details

Forgot password? Click here to reset