Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

04/16/2023
by   Yongchan Kwon, et al.
0

Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are 10^6 samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2023

Selective and Collaborative Influence Function for Efficient Recommendation Unlearning

Recent regulations on the Right to be Forgotten have greatly influenced ...
research
08/10/2019

Adaptive RBF Interpolation for Estimating Missing Values in Geographical Data

The quality of datasets is a critical issue in big data mining. More int...
research
12/31/2020

FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging

Influence functions approximate the 'influences' of training data-points...
research
02/14/2023

BLIAM: Literature-based Data Synthesis for Synergistic Drug Combination Prediction

Language models pre-trained on scientific literature corpora have substa...
research
07/15/2023

Intuitionistic Fuzzy Broad Learning System: Enhancing Robustness Against Noise and Outliers

In the realm of data classification, broad learning system (BLS) has pro...
research
03/19/2019

A Quantum Annealing-Based Approach to Extreme Clustering

In this age of data abundance, there is a growing need for algorithms an...
research
08/27/2017

Gatherplots: Generalized Scatterplots for Nominal Data

Overplotting of data points is a common problem when visualizing large d...

Please sign up or login with your details

Forgot password? Click here to reset