Predictive Performance Modeling for Distributed Computing using Black-Box Monitoring and Machine Learning

05/30/2018
by   Carl Witt, et al.
0

In many domains, the previous decade was characterized by increasing data volumes and growing complexity of computational workloads, creating new demands for highly data-parallel computing in distributed systems. Effective operation of these systems is challenging when facing uncertainties about the performance of jobs and tasks under varying resource configurations, e.g., for scheduling and resource allocation. We survey predictive performance modeling (PPM) approaches to estimate performance metrics such as execution duration, required memory or wait times of future jobs and tasks based on past performance observations. We focus on non-intrusive methods, i.e., methods that can be applied to any workload without modification, since the workload is usually a black-box from the perspective of the systems managing the computational infrastructure. We classify and compare sources of performance variation, predicted performance metrics, required training data, use cases, and the underlying prediction techniques. We conclude by identifying several open problems and pressing research needs in the field.

READ FULL TEXT
research
03/10/2022

Efficient Runtime Profiling for Black-box Machine Learning Services on Sensor Streams

In highly distributed environments such as cloud, edge and fog computing...
research
06/25/2020

Sequence-to-sequence models for workload interference

Co-scheduling of jobs in data-centers is a challenging scenario, where j...
research
03/24/2019

TonY: An Orchestrator for Distributed Machine Learning Jobs

Training machine learning (ML) models on large datasets requires conside...
research
09/12/2021

Hybrid Workload Scheduling on HPC Systems

Traditionally, on-demand, rigid, and malleable applications have been sc...
research
08/08/2018

Characterizing Co-located Datacenter Workloads: An Alibaba Case Study

Warehouse-scale cloud datacenters co-locate workloads with different and...
research
04/25/2023

A Multi-Task Approach to Robust Deep Reinforcement Learning for Resource Allocation

With increasing complexity of modern communication systems, machine lear...
research
12/19/2021

An Experimental and Comparative Benchmark Study Examining Resource Utilization in Managed Hadoop Context

Transitioning cloud-based Hadoop from IaaS to PaaS, which are commercial...

Please sign up or login with your details

Forgot password? Click here to reset