ConEx: Efficient Exploration of Big-Data System Configurations for Better Performance

10/17/2019
by   Rahul Krishna, et al.
0

Configuration space complexity makes the big-data software systems hard to configure well. Consider Hadoop, with over nine hundred parameters, developers often just use the default configurations provided with Hadoop distributions. The opportunity costs in lost performance are significant. Popular learning-based approaches to auto-tune software does not scale well for big-data systems because of the high cost of collecting training data. We present a new method based on a combination of Evolutionary Markov Chain Monte Carlo (EMCMC) sampling and cost reduction techniques to cost-effectively find better-performing configurations for big data systems. For cost reduction, we developed and experimentally tested and validated two approaches: using scaled-up big data jobs as proxies for the objective function for larger jobs and using a dynamic job similarity measure to infer that results obtained for one kind of big data problem will work well for similar problems. Our experimental results suggest that our approach promises to significantly improve the performance of big data systems and that it outperforms competing approaches based on random sampling, basic genetic algorithms (GA), and predictive model learning. Our experimental results support the conclusion that our approach has strongly demonstrated potential to significantly and cost-effectively improve the performance of big data systems.

READ FULL TEXT
research
11/04/2021

Auto Tuning of Hadoop and Spark parameters

Data of the order of terabytes, petabytes, or beyond is known as Big Dat...
research
01/06/2021

Bridging BAD Islands: Declarative Data Sharing at Scale

In many Big Data applications today, information needs to be actively sh...
research
08/22/2023

Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Selecting the right resources for big data analytics jobs is hard becaus...
research
06/25/2019

Fast Data: Moving beyond from Big Data's map-reduce

Big Data may not be the solution many are looking for. The latest rise o...
research
12/05/2016

Support vector regression model for BigData systems

Nowadays Big Data are becoming more and more important. Many sectors of ...
research
07/29/2022

A model robust sub-sampling approach for Generalised Linear Models in Big data settings

In today's modern era of Big data, computationally efficient and scalabl...
research
10/17/2018

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

With the emergence of the big data age, the issue of how to obtain valua...

Please sign up or login with your details

Forgot password? Click here to reset