Analyzing astronomical data with Apache Spark

04/20/2018
by   Julien Peloton, et al.
0

We investigate the performances of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but its main use is often limited to naively structured data. We show how to manage more complex binary data structures such as those handled in astrophysics experiments, within a distributed environment. To this purpose, we first designed and implemented a Spark connector to handle sets of arbitrarily large FITS files, called spark-fits. The user interface is such that a simple file "drag-and-drop" to a cluster gives full advantage of the framework. We demonstrate the very high scalability of spark-fits using the LSST fast simulation tool, CoLoRe, and present the methodologies for measuring and tuning the performance bottlenecks for the workloads, scaling up to terabytes of FITS data on the Cloud@VirtualData, located at Université Paris Sud. We also evaluate its performance on Cori, a High-Performance Computing system located at NERSC, and widely used in the scientific community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/28/2022

Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

The paradigm of big data is characterized by the need to collect and pro...
research
01/25/2020

GeoRocket: A scalable and cloud-based data store for big geospatial files

We present GeoRocket, a software for the management of very large geospa...
research
04/30/2018

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

One of the hardest challenges of the current Big Data landscape is the l...
research
05/08/2020

High Performance Cluster Computing for MapReduce

MapReduce is a technique used to vastly improve distributed processing o...
research
03/26/2019

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Apache Hive is an open-source relational database system for analytic bi...
research
09/14/2020

Performance Evaluation of Linear Regression Algorithm in Cluster Environment

Cluster computing was introduced to replace the superiority of super com...
research
03/22/2023

How does SSD Cluster Perform for Distributed File Systems: An Empirical Study

As the capacity of Solid-State Drives (SSDs) is constantly being optimis...

Please sign up or login with your details

Forgot password? Click here to reset