Highly Efficient Memory Failure Prediction using Mcelog-based Data Mining and Machine Learning

04/24/2021
by   Chengdong Yao, et al.
0

In the data center, unexpected downtime caused by memory failures can lead to a decline in the stability of the server and even the entire information technology infrastructure, which harms the business. Therefore, whether the memory failure can be accurately predicted in advance has become one of the most important issues to be studied in the data center. However, for the memory failure prediction in the production system, it is necessary to solve technical problems such as huge data noise and extreme imbalance between positive and negative samples, and at the same time ensure the long-term stability of the algorithm. This paper compares and summarizes some commonly used skills and the improvement they can bring. The single model we proposed won the top 14th in the 2nd Alibaba Cloud AIOps Competition belonging to the 25th PAKDD conference. It takes only 30 minutes to pass the online test, while most of the other contestants' solution need more than 3 hours. Codes has been open source to https://www.github.com/ycd2016/acaioc2.

READ FULL TEXT
research
11/21/2022

First CE Matters: On the Importance of Long Term Properties on Memory Failure Prediction

Dynamic random access memory failures are a threat to the reliability of...
research
01/17/2021

A Non-intrusive Failure Prediction Mechanism for Deployed Optical Networks

Failures in optical network backbone can lead to major disruption of int...
research
12/20/2019

Robust Data Preprocessing for Machine-Learning-Based Disk Failure Prediction in Cloud Production Environments

To provide proactive fault tolerance for modern cloud data centers, exte...
research
03/30/2021

Predicting Landfall's Location and Time of a Tropical Cyclone Using Reanalysis Data

Landfall of a tropical cyclone is the event when it moves over the land ...
research
01/24/2021

Online Memory Leak Detection in the Cloud-based Infrastructures

A memory leak in an application deployed on the cloud can affect the ava...
research
10/06/2021

Cloud Failure Prediction with Hierarchical Temporal Memory: An Empirical Assessment

Hierarchical Temporal Memory (HTM) is an unsupervised learning algorithm...
research
01/16/2019

The Winning Solution to the IEEE CIG 2017 Game Data Mining Competition

Machine learning competitions such as those organized by Kaggle or KDD r...

Please sign up or login with your details

Forgot password? Click here to reset