Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

03/28/2018
by   Bilal Akil, et al.
0

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those platforms not only aim to improve performance through improved in-memory processing, but in particular provide built-in high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop than with plain Hadoop MapReduce. But is this indeed the case? This paper compares three prominent distributed data processing platforms: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from a usability perspective. We report on the design, execution and results of a usability study with a cohort of masters students, who were learning and working with all three platforms in order to solve different use cases set in a data science context. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink as platforms for batch-oriented big data analysis. This study starts an exploration of the factors that make big data platforms more - or less - effective for users in data science.

READ FULL TEXT
research
10/20/2019

Micro-level Modularity of Computaion-intensive Programs in Big Data Platforms: A Case Study with Image Data

With the rapid advancement of Big Data platforms such as Hadoop, Spark, ...
research
10/27/2020

Big Data Science

In ever more disciplines, science is driven by data, which leads to data...
research
04/04/2020

The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis

Getting the best performance from the ever-increasing number of hardware...
research
06/07/2023

The Noir Dataflow Platform: Efficient Data Processing without Complexity

Today, data analysis drives the decision-making process in virtually eve...
research
10/27/2020

FACT-Tools - Processing High-Volume Telescope Data

Several large experiments such as MAGIC, FACT, VERITAS, HESS or the upco...
research
02/11/2018

Distributed Readability Analysis Of Turkish Elementary School Textbooks

The readability assessment deals with estimating the level of difficulty...
research
12/20/2021

NavP: Enabling Navigational Programming for Science Data Processing via Application-Initiated Checkpointing

Science Data Systems (SDS) handle science data from acquisition through ...

Please sign up or login with your details

Forgot password? Click here to reset