Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

12/28/2022
by   Duarte M. Nascimento, et al.
0

The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be integrated with Hadoop to provide powerful abstractions to developers, such as distributed storage through HDFS and resource management through YARN. When all the required configurations are made, Spark can also provide quality attributes, such as scalability, fault tolerance, and security. However, all of these benefits come at the cost of complexity, with high memory requirements, and additional latency in processing. An alternative approach is to use a lean software stack, like Unicage, that delegates most control back to the developer. In this work we evaluated the performance of big data processing with Spark versus Unicage, in a cluster environment hosted in the IBM Cloud. Two sets of experiments were performed: batch processing of unstructured data sets, and query processing of structured data sets. The input data sets were of significant size, ranging from 64 GB to 8192 GB in volume. The results show that the performance of Unicage scripts is superior to Spark for search workloads like grep and select, but that the abstractions of distributed storage and resource management from the Hadoop stack enable Spark to execute workloads with inter-record dependencies, such as sort and join, with correct outputs.

READ FULL TEXT

page 2

page 14

research
06/14/2013

Rethinking Abstractions for Big Data: Why, Where, How, and What

Big data refers to large and complex data sets that, under existing appr...
research
01/29/2023

Large-scale Data Modelling in Hive and Distributed Query Processing using MapReduce and Tez

Huge amounts of data being generated continuously by digitally interconn...
research
04/20/2018

Analyzing astronomical data with Apache Spark

We investigate the performances of Apache Spark, a cluster computing fra...
research
03/25/2021

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstr...
research
12/12/2021

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their ...
research
12/11/2020

The Future is Big Graphs! A Community View on Graph Processing Systems

Graphs are by nature unifying abstractions that can leverage interconnec...
research
03/22/2023

How does SSD Cluster Perform for Distributed File Systems: An Empirical Study

As the capacity of Solid-State Drives (SSDs) is constantly being optimis...

Please sign up or login with your details

Forgot password? Click here to reset