A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms"

04/09/2023
by   Jiachen T. Wang, et al.
0

Data valuation is a growing research field that studies the influence of individual data points for machine learning (ML) models. Data Shapley, inspired by cooperative game theory and economics, is an effective method for data valuation. However, it is well-known that the Shapley value (SV) can be computationally expensive. Fortunately, Jia et al. (2019) showed that for K-Nearest Neighbors (KNN) models, the computation of Data Shapley is surprisingly simple and efficient. In this note, we revisit the work of Jia et al. (2019) and propose a more natural and interpretable utility function that better reflects the performance of KNN models. We derive the corresponding calculation procedure for the Data Shapley of KNN classifiers/regressors with the new utility functions. Our new approach, dubbed soft-label KNN-SV, achieves the same time complexity as the original method. We further provide an efficient approximation algorithm for soft-label KNN-SV based on locality sensitive hashing (LSH). Our experimental results demonstrate that Soft-label KNN-SV outperforms the original method on most datasets in the task of mislabeled data detection, making it a better baseline for future work on data valuation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2019

Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

Given a data set D containing millions of data points and a data consume...
research
10/31/2022

Learning New Tasks from a Few Examples with Soft-Label Prototypes

It has been experimentally demonstrated that humans are able to learn in...
research
11/17/2019

An Empirical and Comparative Analysis of Data Valuation with Scalable Algorithms

This paper focuses on valuating training data for supervised learning ta...
research
02/22/2023

A Note on "Towards Efficient Data Valuation Based on the Shapley Value”

The Shapley value (SV) has emerged as a promising method for data valuat...
research
01/12/2021

Locality Sensitive Hashing for Efficient Similar Polygon Retrieval

Locality Sensitive Hashing (LSH) is an effective method of indexing a se...
research
09/06/2013

Convergence of Nearest Neighbor Pattern Classification with Selective Sampling

In the panoply of pattern classification techniques, few enjoy the intui...
research
03/08/2017

Leveraging Sparsity for Efficient Submodular Data Summarization

The facility location problem is widely used for summarizing large datas...

Please sign up or login with your details

Forgot password? Click here to reset