A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

04/26/2019
by   George K. Thiruvathukal, et al.
0

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory access and data sharing are becoming performance bottlenecks. Cloud computing employs a data processing paradigm typically built on a loosely connected group of low-cost computing nodes without relying upon shared storage and/or memory. Apache Spark is a popular engine for large-scale data analysis in the cloud, which we have successfully deployed via job submission scripts on production clusters. In this paper, we describe common parallel analysis dataflows for both Message Passing Interface (MPI) and cloud based applications. We developed an effective benchmark to measure the performance characteristics of these tasks using both types of systems, specifically comparing MPI/C-based analyses with Spark. The benchmark is a data processing pipeline representative of a typical analytics framework implemented using map-reduce. In the case of Spark, we also consider whether language plays a role by writing tests using both Python and Scala, a language built on the Java Virtual Machine (JVM). We include performance results from two large systems at Argonne National Laboratory including Theta, a Cray XC40 supercomputer on which our experiments run with 65,536 cores (1024 nodes with 64 cores each). The results of our experiments are discussed in the context of their applicability to future HPC architectures. Beyond understanding performance, our work demonstrates that technologies such as Spark, while typically aimed at multi-tenant cloud-based environments, show promise for data analysis needs in a traditional clustering/supercomputing environment.

READ FULL TEXT
research
12/28/2022

Hybrid Cloud and HPC Approach to High-Performance Dataframes

Data pre-processing is a fundamental component in any data-driven applic...
research
01/01/2020

AIR – A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing

Distributed Stream Processing Systems (DSPSs) are among the currently mo...
research
02/13/2017

Data-Intensive Supercomputing in the Cloud: Global Analytics for Satellite Imagery

We present our experiences using cloud computing to support data-intensi...
research
10/18/2016

Diagnosis of aerospace structure defects by a HPC implemented soft computing algorithm

This study concerns with the diagnosis of aerospace structure defects by...
research
06/27/2023

Challenges and Opportunities for RISC-V Architectures towards Genomics-based Workloads

The use of large-scale supercomputing architectures is a hard requiremen...
research
10/13/2020

Performance Evaluation and Modeling of Cryptographic Libraries for MPI Communications

In order for High-Performance Computing (HPC) applications with data sec...
research
01/20/2022

High Performance Parallel I/O and In-Situ Analysis in the WRF Model with ADIOS2

As the computing power of large-scale HPC clusters approaches the Exasca...

Please sign up or login with your details

Forgot password? Click here to reset