A performance comparison of Dask and Apache Spark for data-intensive neuroimaging pipelines

07/30/2019
by   Mathieu Dugré, et al.
0

In the past few years, neuroimaging has entered the Big Data era due to the joint increase in image resolution, data sharing, and study sizes. However, no particular Big Data engines have emerged in this field, and several alternatives remain available. We compare two popular Big Data engines with Python APIs, Apache Spark and Dask, for their runtime performance in processing neuroimaging pipelines. Our evaluation uses two synthetic pipelines processing the 81GB BigBrain image, and a real pipeline processing anatomical data from more than 1,000 subjects. We benchmark these pipelines using various combinations of task durations, data sizes, and numbers of workers, deployed on an 8-node (8 cores ea.) compute cluster in Compute Canada's Arbutus cloud. We evaluate PySpark's RDD API against Dask's Bag, Delayed and Futures. Results show that despite slight differences between Spark and Dask, both engines perform comparably. However, Dask pipelines risk being limited by Python's GIL depending on task type and cluster configuration. In all cases, the major limiting factor was data transfer. While either engine is suitable for neuroimaging pipelines, more effort needs to be placed in reducing data transfer time.

READ FULL TEXT

page 1

page 5

page 6

page 7

page 8

page 9

research
12/16/2018

Performance Evaluation of Big Data Processing Strategies for Neuroimaging

Neuroimaging datasets are rapidly growing in size as a result of advance...
research
10/22/2018

biggy: An Implementation of Unified Framework for Big Data Management System

Various tools, softwares and systems are proposed and implemented to tac...
research
08/07/2021

Building Analytics Pipelines for Querying Big Streams and Data Histories with H-STREAM

This paper introduces H-STREAM, a big stream/data processing pipelines e...
research
09/23/2019

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

The effective utilization at scale of complex machine learning (ML) tech...
research
02/26/2020

CAAI – A Cognitive Architecture to Introduce Artificial Intelligence in Cyber-Physical Production Systems

This paper introduces CAAI, a novel cognitive architecture for artificia...
research
08/02/2018

Diversification on Big Data in Query Processing

Recently, in the area of big data, some popular applications such as web...
research
02/22/2020

BAD to the Bone: Big Active Data at its Core

Virtually all of today's Big Data systems are passive in nature, respond...

Please sign up or login with your details

Forgot password? Click here to reset