Picket: Self-supervised Data Diagnostics for ML Pipelines

06/08/2020
by   Zifan Liu, et al.
0

Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inference. We present, Picket, a first-of-its-kind system that enables data diagnostics for machine learning pipelines over tabular data. Picket can safeguard against data corruptions that lead to degradation either during training or deployment. For the training stage, Picket identifies erroneous training examples that can result in a biased model, while for the deployment stage, Picket flags corrupted query points to a trained machine learning model that due to noise will result to incorrect predictions. Picket is built around a novel self-supervised deep learning model for mixed-type tabular data. Learning this model is fully unsupervised to minimize the burden of deployment, and Picket is designed as a plugin that can increase the robustness of any machine learning pipeline. We evaluate Picket on a diverse array of real-world data considering different corruption models that include systematic and adversarial noise. We show that Picket offers consistently accurate diagnostics during both training and deployment of various models ranging from SVMs to neural networks, beating competing methods of data quality validation in machine learning pipelines.

READ FULL TEXT

page 6

page 9

research
02/21/2022

Toward more generalized Malicious URL Detection Models

This paper reveals a data bias issue that can severely affect the perfor...
research
11/06/2020

Underspecification Presents Challenges for Credibility in Modern Machine Learning

ML models often exhibit unexpectedly poor behavior when they are deploye...
research
06/12/2020

dagger: A Python Framework for Reproducible Machine Learning Experiment Orchestration

Many research directions in machine learning, particularly in deep learn...
research
08/24/2018

Unknown Examples & Machine Learning Model Generalization

Over the past decades, researchers and ML practitioners have come up wit...
research
10/21/2021

Self-Supervised Visual Representation Learning Using Lightweight Architectures

In self-supervised learning, a model is trained to solve a pretext task,...
research
03/10/2023

Moving Fast With Broken Data

Machine learning (ML) models in production pipelines are frequently retr...
research
08/11/2021

Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems

The industrial machine learning pipeline requires iterating on model fea...

Please sign up or login with your details

Forgot password? Click here to reset