Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems

08/15/2023
by   Jinyang Liu, et al.
0

Ensuring the reliability of cloud systems is critical for both cloud vendors and customers. Cloud systems often rely on virtualization techniques to create instances of hardware resources, such as virtual machines. However, virtualization hinders the observability of cloud systems, making it challenging to diagnose platform-level issues. To improve system observability, we propose to infer functional clusters of instances, i.e., groups of instances having similar functionalities. We first conduct a pilot study on a large-scale cloud system, i.e., Huawei Cloud, demonstrating that instances having similar functionalities share similar communication and resource usage patterns. Motivated by these findings, we formulate the identification of functional clusters as a clustering problem and propose a non-intrusive solution called Prism. Prism adopts a coarse-to-fine clustering strategy. It first partitions instances into coarse-grained chunks based on communication patterns. Within each chunk, Prism further groups instances with similar resource usage patterns to produce fine-grained functional clusters. Such a design reduces noises in the data and allows Prism to process massive instances efficiently. We evaluate Prism on two datasets collected from the real-world production environment of Huawei Cloud. Our experiments show that Prism achieves a v-measure of  0.95, surpassing existing state-of-the-art solutions. Additionally, we illustrate the integration of Prism within monitoring systems for enhanced cloud reliability through two real-world use cases.

READ FULL TEXT

page 1

page 8

research
11/09/2021

Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

Scientific workflow management systems like Nextflow support large-scale...
research
08/29/2023

Practice of Alibaba Cloud on Elastic Resource Provisioning for Large-scale Microservices Cluster

Cloud-native architecture is becoming increasingly crucial for today's c...
research
12/27/2018

An efficient cloud scheduler design supporting preemptible instances

Maximizing resource utilization by performing an efficient resource prov...
research
10/20/2020

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale...
research
09/09/2020

CASH: A Credit Aware Scheduling for Public Cloud Platforms

The public cloud offers a myriad of services which allows its tenants to...
research
11/29/2021

Multi-instance Point Cloud Registration by Efficient Correspondence Clustering

We address the problem of estimating the poses of multiple instances of ...
research
05/27/2023

Dynamic User Segmentation and Usage Profiling

Usage data of a group of users distributed across a number of categories...

Please sign up or login with your details

Forgot password? Click here to reset