A Unified Framework for Task-Driven Data Quality Management

06/10/2021
by   Tianhao Wang, et al.
0

High-quality data is critical to train performant Machine Learning (ML) models, highlighting the importance of Data Quality Management (DQM). Existing DQM schemes often cannot satisfactorily improve ML performance because, by design, they are oblivious to downstream ML tasks. Besides, they cannot handle various data quality issues (especially those caused by adversarial attacks) and have limited applications to only certain types of ML models. Recently, data valuation approaches (e.g., based on the Shapley value) have been leveraged to perform DQM; yet, empirical studies have observed that their performance varies considerably based on the underlying data and training process. In this paper, we propose a task-driven, multi-purpose, model-agnostic DQM framework, DataSifter, which is optimized towards a given downstream ML task, capable of effectively removing data points with various defects, and applicable to diverse models. Specifically, we formulate DQM as an optimization problem and devise a scalable algorithm to solve it. Furthermore, we propose a theoretical framework for comparing the worst-case performance of different DQM strategies. Remarkably, our results show that the popular strategy based on the Shapley value may end up choosing the worst data subset in certain practical scenarios. Our evaluation shows that DataSifter achieves and most often significantly improves the state-of-the-art performance over a wide range of DQM tasks, including backdoor, poison, noisy/mislabel data detection, data summarization, and data debiasing.

READ FULL TEXT
research
06/01/2022

RoCourseNet: Distributionally Robust Training of a Prediction Aware Recourse Model

Counterfactual (CF) explanations for machine learning (ML) models are pr...
research
07/13/2021

Learnability of Learning Performance and Its Application to Data Valuation

For most machine learning (ML) tasks, evaluating learning performance on...
research
07/08/2022

Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation

Machine learning (ML) is playing an increasingly important role in data ...
research
06/02/2023

Hyperparameter Learning under Data Poisoning: Analysis of the Influence of Regularization via Multiobjective Bilevel Optimization

Machine Learning (ML) algorithms are vulnerable to poisoning attacks, wh...
research
06/18/2023

OpenDataVal: a Unified Benchmark for Data Valuation

Assessing the quality and impact of individual data points is critical f...
research
05/25/2021

Improving Machine Learning-Based Modeling of Semiconductor Devices by Data Self-Augmentation

In the electronics industry, introducing Machine Learning (ML)-based tec...
research
05/30/2022

Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity

This paper studies the robustness of data valuation to noisy model perfo...

Please sign up or login with your details

Forgot password? Click here to reset