Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

06/04/2023
by   Dezhan Tu, et al.
0

Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications. Crucially, these pipelines are recurring (e.g., daily or hourly) in production settings to keep data updated so that ML models can be re-trained regularly, and BI dashboards refreshed frequently. However, data quality (DQ) issues can often creep into recurring pipelines because of upstream schema and data drift over time. As modern enterprises operate thousands of recurring pipelines, today data engineers have to spend substantial efforts to manually monitor and resolve DQ issues, as part of their DataOps and MLOps practices. Given the high human cost of managing large-scale pipeline operations, it is imperative that we can automate as much as possible. In this work, we propose Auto-Validate-by-History (AVH) that can automatically detect DQ issues in recurring pipelines, leveraging rich statistics from historical executions. We formalize this as an optimization problem, and develop constant-factor approximation algorithms with provable precision guarantees. Extensive evaluations using 2000 production data pipelines at Microsoft demonstrate the effectiveness and efficiency of AVH.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/10/2021

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

Complex data pipelines are increasingly common in diverse applications s...
research
03/10/2023

Moving Fast With Broken Data

Machine learning (ML) models in production pipelines are frequently retr...
research
06/25/2021

Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search

Recent work has made significant progress in helping users to automate s...
research
03/30/2021

Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

Machine learning (ML) is now commonplace, powering data-driven applicati...
research
06/21/2023

Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph

Business Intelligence (BI) is crucial in modern enterprises and billion-...
research
02/09/2023

REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines

Nowadays, machine learning (ML) plays a vital role in many aspects of ou...
research
03/19/2022

METL: a modern ETL pipeline with a dynamic mapping matrix

Modern ETL streaming pipelines extract data from various sources and for...

Please sign up or login with your details

Forgot password? Click here to reset