Diffusion-based Time Series Data Imputation for Microsoft 365

08/03/2023
by   Fangkai Yang, et al.
0

Reliability is extremely important for large-scale cloud systems like Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten service reliability, resulting in online service interruptions and economic loss. Existing works focus on predicting cloud failures and proactively taking action before failures happen. However, they suffer from poor data quality like data missing in model training and prediction, which limits the performance. In this paper, we focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model, to impute the missing data efficiently based on the observed data. Our experiments and application practice show that our model contributes to improving the performance of the downstream failure prediction task.

READ FULL TEXT
research
07/05/2022

Data Integrity Error Localization in Networked Systems with Missing Data

Most recent network failure diagnosis systems focused on data center net...
research
10/15/2022

Failure Analysis of Big Cloud Service Providers Prior to and During Covid-19 Period

Cloud services are important for societal function such as healthcare, c...
research
07/02/2019

Sample Adaptive Multiple Kernel Learning for Failure Prediction of Railway Points

Railway points are among the key components of railway infrastructure. A...
research
10/06/2021

Cloud Failure Prediction with Hierarchical Temporal Memory: An Empirical Assessment

Hierarchical Temporal Memory (HTM) is an unsupervised learning algorithm...
research
11/21/2022

First CE Matters: On the Importance of Long Term Properties on Memory Failure Prediction

Dynamic random access memory failures are a threat to the reliability of...
research
11/05/2020

Prediction of Future Failures for Heterogeneous Reliability Field Data

This article introduces methods for constructing prediction bounds or in...
research
08/20/2021

AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems

Service reliability is one of the key challenges that cloud providers ha...

Please sign up or login with your details

Forgot password? Click here to reset