Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

04/27/2018
by   Zhengyu Yang, et al.
0

In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such data processing at scale. Specifically, Spark leverages distributed memory to cache the intermediate results, represented as Resilient Distributed Datasets (RDDs). This gives Spark an advantage over other parallel frameworks for implementations of iterative machine learning and data mining algorithms, by avoiding repeated computation or hard disk accesses to retrieve RDDs. By default, caching decisions are left at the programmer's discretion, and the LRU policy is used for evicting RDDs when the cache is full. However, when the objective is to minimize total work, LRU is woefully inadequate, leading to arbitrarily suboptimal caching decisions. In this paper, we design an adaptive algorithm for multi-stage big data processing platforms to adaptively determine the most valuable intermediate datasets that can be reused in the future to store in the memory. Our solution automates the decision of which RDDs to cache: this amounts to identifying nodes in a direct acyclic graph (DAG) representing computations whose outputs should persist in the memory. Our experiment results show that our proposed cache optimization solution can improve the performance of machine learning applications on Spark decreasing the total work to recompute RDDs by 12

READ FULL TEXT
research
11/24/2017

Big Data Computing Using Cloud-Based Technologies, Challenges and Future Perspectives

The excessive amounts of data generated by devices and Internet-based so...
research
12/16/2018

Performance Evaluation of Big Data Processing Strategies for Neuroimaging

Neuroimaging datasets are rapidly growing in size as a result of advance...
research
07/05/2022

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Distributed in-memory data processing engines accelerate iterative appli...
research
05/29/2021

SMURF: Efficient and Scalable Metadata Access for Distributed Applications

In parallel with big data processing and analysis dominating the usage o...
research
12/13/2019

RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework

Initially, a number of frequent itemset mining (FIM) algorithms have bee...
research
07/16/2017

Performance Evaluation of Distributed Computing Environments with Hadoop and Spark Frameworks

Recently, due to rapid development of information and communication tech...
research
01/05/2021

Modeling the Linux page cache for accurate simulation of data-intensive applications

The emergence of Big Data in recent years has resulted in a growing need...

Please sign up or login with your details

Forgot password? Click here to reset