Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

07/05/2022
by   Hani Al-Sayeh, et al.
0

Distributed in-memory data processing engines accelerate iterative applications by caching substantial datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an essential role in achieving optimal performance. In practice, this is a tedious and hard task for end users, who are typically not aware of cluster specifications, workload semantics and sizes of intermediate data. We present Blink, an autonomous sampling-based framework, which predicts sizes of cached datasets and selects optimal cluster size without relying on historical runs. We evaluate Blink on a variety of iterative, real-world, machine learning applications. With an average sample runs cost of 4.6 compared to the cost of optimal runs, Blink selects the optimal cluster size in 15 out of 16 cases, saving up to 47.4 costs.

READ FULL TEXT
research
11/08/2022

Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

Selecting appropriate computational resources for data processing jobs o...
research
04/27/2018

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

In the era of big data and cloud computing, large amounts of data are ge...
research
06/28/2022

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Distributed dataflow systems like Apache Spark and Apache Hadoop enable ...
research
08/22/2023

Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Selecting the right resources for big data analytics jobs is hard becaus...
research
02/20/2017

Hemingway: Modeling Distributed Optimization Algorithms

Distributed optimization algorithms are widely used in many industrial m...
research
12/12/2021

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their ...
research
11/30/2022

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Modern Deep Learning (DL) models have grown to sizes requiring massive c...

Please sign up or login with your details

Forgot password? Click here to reset