Evaluation of pilot jobs for Apache Spark applications on HPC clusters

05/29/2019
by   Valérie Hayot-Sasson, et al.
0

Big Data has become prominent throughout many scientific fields and, as a result, scientific communities have sought out Big Data frameworks to accelerate the processing of their increasingly data-intensive pipelines. However, while scientific communities typically rely on High-Performance Computing (HPC) clusters for the parallelization of their pipelines, many popular Big Data frameworks such as Hadoop and Apache Spark were primarily designed to be executed on dedicated commodity infrastructures. This paper evaluates the benefits of pilot jobs over traditional batch submission for Apache Spark on HPC clusters. Surprisingly, our results show that the speed-up provided by pilot jobs over batch scheduling is moderate to inexistent (0.98 on average) despite the presence of long queuing times. In addition, pilot jobs provide an extra layer of scheduling that complexifies debugging and deployment. We conclude that traditional batch scheduling should remain the default strategy to deploy Apache Spark applications on HPC clusters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2021

Towards Accommodating Real-time Jobs on HPC Platforms

Increasing data volumes in scientific experiments necessitate the use of...
research
07/04/2022

Sea: A lightweight data-placement library for Big Data scientific computing

The recent influx of open scientific data has contributed to the transit...
research
07/06/2023

Applying Process Mining on Scientific Workflows: a Case Study

Computer-based scientific experiments are becoming increasingly data-int...
research
01/23/2018

Task-parallel Analysis of Molecular Dynamics Trajectories

Different frameworks for implementing parallel data analytics applicatio...
research
05/03/2018

Why do Users Kill HPC Jobs?

Given the cost of HPC clusters, making best use of them is crucial to im...
research
11/04/2018

Exploring the Relation Between Two Levels of Scheduling Using a Novel Simulation Approach

Modern high performance computing (HPC) systems exhibit a rapid growth i...
research
09/12/2021

Hybrid Workload Scheduling on HPC Systems

Traditionally, on-demand, rigid, and malleable applications have been sc...

Please sign up or login with your details

Forgot password? Click here to reset