Online Collection and Forecasting of Resource Utilization in Large-Scale Distributed Systems

05/22/2019
by   Tiffany Tuor, et al.
0

Large-scale distributed computing systems often contain thousands of distributed nodes (machines). Monitoring the conditions of these nodes is important for system management purposes, which, however, can be extremely resource demanding as this requires collecting local measurements of each individual node and constantly sending those measurements to a central controller. Meanwhile, it is often useful to forecast the future system conditions for various purposes such as resource planning/allocation and anomaly detection, but it is usually too resource-consuming to have one forecasting model running for each node, which may also neglect correlations in observed metrics across different nodes. In this paper, we propose a mechanism for collecting and forecasting the resource utilization of machines in a distributed computing system in a scalable manner. We present an algorithm that allows each local node to decide when to transmit its most recent measurement to the central node, so that the transmission frequency is kept below a given constraint value. Based on the measurements received from local nodes, the central node summarizes the received data into a small number of clusters. Since the cluster partitioning can change over time, we also present a method to capture the evolution of clusters and their centroids. As an effective way to reduce the amount of computation, time-series forecasting models are trained on the time-varying centroids of each cluster, to forecast the future resource utilizations of a group of local nodes. The effectiveness of our proposed approach is confirmed by extensive experiments using multiple real-world datasets.

READ FULL TEXT
research
01/13/2020

Fast-Fourier-Forecasting Resource Utilisation in Distributed Systems

Distributed computing systems often consist of hundreds of nodes, execut...
research
07/30/2019

Distributed Resource Allocation over Time-varying Balanced Digraphs with Discrete-time Communication

Consider a group of nodes aiming to solve a resource allocation problem ...
research
08/26/2022

Affinity-Aware Resource Provisioning for Long-Running Applications in Shared Clusters

Resource provisioning plays a pivotal role in determining the right amou...
research
07/01/2018

A Data-Driven Approach to Dynamically Adjust Resource Allocation for Compute Clusters

Nowadays, data-centers are largely under-utilized because resource alloc...
research
12/19/2019

ODIN: Tamper-Resistant Round Trip Time Measurement for Distributed Systems

Measuring round trip time (RTT) in a hostile network is an unsolved prob...
research
10/26/2020

Anomaly Detection in Vertically Partitioned Data by Distributed Core Vector Machines

Observations of physical processes suffer from instrument malfunction an...
research
09/18/2020

C-Balancer: A System for Container Profiling and Scheduling

Linux containers have gained high popularity in recent times. This popul...

Please sign up or login with your details

Forgot password? Click here to reset