AlerTiger: Deep Learning for AI Model Health Monitoring at LinkedIn

06/03/2023
by   Zhentao Xu, et al.
0

Data-driven companies use AI models extensively to develop products and intelligent business solutions, making the health of these models crucial for business success. Model monitoring and alerting in industries pose unique challenges, including a lack of clear model health metrics definition, label sparsity, and fast model iterations that result in short-lived models and features. As a product, there are also requirements for scalability, generalizability, and explainability. To tackle these challenges, we propose AlerTiger, a deep-learning-based MLOps model monitoring system that helps AI teams across the company monitor their AI models' health by detecting anomalies in models' input features and output score over time. The system consists of four major steps: model statistics generation, deep-learning-based anomaly detection, anomaly post-processing, and user alerting. Our solution generates three categories of statistics to indicate AI model health, offers a two-stage deep anomaly detection solution to address label sparsity and attain the generalizability of monitoring new models, and provides holistic reports for actionable alerts. This approach has been deployed to most of LinkedIn's production AI models for over a year and has identified several model issues that later led to significant business metric gains after fixing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2019

On Accurate and Reliable Anomaly Detection for Gas Turbine Combustors: A Deep Learning Approach

Monitoring gas turbine combustors health, in particular, early detecting...
research
12/13/2021

Challenges and Solutions to Build a Data Pipeline to Identify Anomalies in Enterprise System Performance

We discuss how VMware is solving the following challenges to harness dat...
research
04/06/2020

Moving Metric Detection and Alerting System at eBay

At eBay, there are thousands of product health metrics for different dom...
research
08/21/2022

Performance, Opaqueness, Consequences, and Assumptions: Simple questions for responsible planning of machine learning solutions

The data revolution has generated a huge demand for data-driven solution...
research
03/24/2020

Dividing Deep Learning Model for Continuous Anomaly Detection of Inconsistent ICT Systems

Health monitoring is important for maintaining reliable information and ...
research
10/21/2020

Anomaly Detection in a Large-scale Cloud Platform

Cloud computing is ubiquitous: more and more companies are moving the wo...
research
02/27/2020

How can I start with AI in agriculture?

To star AI in agriculture you have multiple options to integrate this te...

Please sign up or login with your details

Forgot password? Click here to reset