MaRe: Container-Based Parallel Computing with Data Locality

08/07/2018
by   Marco Capuccini, et al.
0

Application containers are emerging as key components in scientific processing, as they can improve reproducibility and standardization in-silico analysis. Chaining software tools in processing pipelines is a common practice in scientific applications and, as application containers gain momentum, workflow systems are starting to provide support for this emerging technology. Nevertheless, workflow systems fall short when it comes to data-intensive analysis, as they do not provide locality-aware scheduling for parallel workloads. To this extent, Big Data cluster-computing frameworks, such as Apache Spark, represent a natural choice. However, even though these frameworks excel at parallelizing code blocks, they do not provide any support for containerized tools parallelization. Here we introduce MaRe, which extends Apache Spark, providing an easy way to parallelize container-based analytics, with transparent management of data locality. MaRe is Docker-compliant, and it can be used as a standalone solution, as well as a workflow system add-on. We demonstrate MaRe on two data-intensive applications in virtual drug screening and in predictive toxicology, showing good scalability. MaRe is generally applicable and available as open source: https://github.com/mcapuccini/MaRe

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/14/2019

Theory-plus-code documentation of the DEPAM workflow for soundscape description

In the Big Data era, the community of PAM faces strong challenges, inclu...
research
07/04/2022

Sea: A lightweight data-placement library for Big Data scientific computing

The recent influx of open scientific data has contributed to the transit...
research
05/16/2018

A Cross-Layer Solution in Scientific Workflow System for Tackling Data Movement Challenge

Scientific applications in HPC environment are more com-plex and more da...
research
11/10/2022

Evaluation of tools for describing, reproducing and reusing scientific workflows

In the field of computational science and engineering, workflows often e...
research
01/26/2015

JMS: A workflow management system and web-based cluster front-end for the Torque resource manager

Motivation: Complex computational pipelines are becoming a staple of mod...
research
03/08/2021

Efficient Fuzz Testing for Apache Spark Using Framework Abstraction

The emerging data-intensive applications are increasingly dependent on d...
research
01/09/2019

Interim Report on Adaptive Event Dispatching in Serverless Computing Infrastructures

Serverless computing is an emerging service model in distributed computi...

Please sign up or login with your details

Forgot password? Click here to reset