Machine Learning for Performance Prediction of Spark Cloud Applications

08/27/2021
by   Alexandre Maros, et al.
0

Big data applications and analytics are employed in many sectors for a variety of goals: improving customers satisfaction, predicting market behavior or improving processes in public health. These applications consist of complex software stacks that are often run on cloud systems. Predicting execution times is important for estimating the cost of cloud services and for effectively managing the underlying resources at runtime. Machine Learning (ML), providing black box solutions to model the relationship between application performance and system configuration without requiring in-detail knowledge of the system, has become a popular way of predicting the performance of big data applications. We investigate the cost-benefits of using supervised ML models for predicting the performance of applications on Spark, one of today's most widely used frameworks for big data analysis. We compare our approach with Ernest (an ML-based technique proposed in the literature by the Spark inventors) on a range of scenarios, application workloads, and cloud system configurations. Our experiments show that Ernest can accurately estimate the performance of very regular applications, but it fails when applications exhibit more irregular patterns and/or when extrapolating on bigger data set sizes. Results show that our models match or exceed Ernest's performance, sometimes enabling us to reduce the prediction error from 126-187 5-19

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2020

ContainerStress: Autonomous Cloud-Node Scoping Framework for Big-Data ML Use Cases

Deploying big-data Machine Learning (ML) services in a cloud environment...
research
12/05/2016

Support vector regression model for BigData systems

Nowadays Big Data are becoming more and more important. Many sectors of ...
research
05/23/2020

Benchmarking and Performance Modelling of MapReduce Communication Pattern

Understanding and predicting the performance of big data applications ru...
research
12/07/2020

SpotTune: Leveraging Transient Resources for Cost-efficient Hyper-parameter Tuning in the Public Cloud

Hyper-parameter tuning (HPT) is crucial for many machine learning (ML) a...
research
12/10/2022

Phases, Modalities, Temporal and Spatial Locality: Domain Specific ML Prefetcher for Accelerating Graph Analytics

Graph processing applications are severely bottlenecked by memory system...
research
10/12/2021

Scalable machine learning in the R language using a summarization matrix

Big data analytics generally rely on parallel processing in large comput...
research
04/04/2023

Predicting the Performance-Cost Trade-off of Applications Across Multiple Systems

In modern computing environments, users may have multiple systems access...

Please sign up or login with your details

Forgot password? Click here to reset