Micro-level Modularity of Computaion-intensive Programs in Big Data Platforms: A Case Study with Image Data

10/20/2019
by   Amit Kumar Mondal, et al.
1

With the rapid advancement of Big Data platforms such as Hadoop, Spark, and Dataflow, many tools are being developed that are intended to provide end users with an interactive environment for large-scale data analysis (e.g., IQmulus). However, there are challenges using these platforms. For example, developers find it difficult to use these platforms when developing interactive and reusable data analytic tools. One approach to better support interactivity and reusability is the use of microlevel modularisation for computation-intensive tasks, which splits data operations into independent, composable modules. However, modularizing data and computation-intensive tasks into independent components differs from traditional programming, e.g., when accessing large scale data, controlling data-flow among components, and structuring computation logic. In this paper, we present a case study on modularizing real world computationintensive tasks that investigates the impact of modularization on processing large scale image data. To that end, we synthesize image data-processing patterns and propose a unified modular model for the effective implementation of computation-intensive tasks on data-parallel frameworks considering reproducibility, reusability, and customization. We present various insights of using the modularity model based on our experimental results from running image processing tasks on Spark and Hadoop clusters.

READ FULL TEXT
research
03/28/2018

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Distributed data processing platforms for cloud computing are important ...
research
05/02/2022

A Case Study on Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis

In High Energy Physics (HEP), experimentalists generate large volumes of...
research
01/26/2020

A Visual Analytics Framework for Reviewing Streaming Performance Data

Understanding and tuning the performance of extreme-scale parallel compu...
research
06/21/2019

The Coming Age of Pervasive Data Processing

Emerging Big Data analytics and machine learning applications require a ...
research
07/26/2021

HySec-Flow: Privacy-Preserving Genomic Computing with SGX-based Big-Data Analytics Framework

Trusted execution environments (TEE) such as Intel's Software Guard Exte...
research
07/03/2016

A Hierarchical Distributed Processing Framework for Big Image Data

This paper introduces an effective processing framework nominated ICP (I...
research
08/19/2019

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

Analyzing the increasingly large volumes of data that are available toda...

Please sign up or login with your details

Forgot password? Click here to reset