LAVA: Data Valuation without Pre-Specified Learning Algorithms

04/28/2023
by   Hoang Anh Just, et al.
4

Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between the training and the validation set. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over the state-of-the-art performance while being orders of magnitude faster.

READ FULL TEXT
research
06/04/2018

Diffeomorphic Learning

We introduce in this paper a learning paradigm in which the training dat...
research
10/17/2019

Hypothesis Test and Confidence Analysis with Wasserstein Distance on General Dimension

We develop a general framework for statistical inference with the Wasser...
research
06/30/2015

Fast Cross-Validation for Incremental Learning

Cross-validation (CV) is one of the main tools for performance estimatio...
research
07/02/2020

Efficient computation and analysis of distributional Shapley values

Distributional data Shapley value (DShapley) has been recently proposed ...
research
02/27/2020

A Distributional Framework for Data Valuation

Shapley value is a classic notion from game theory, historically used to...
research
10/28/2019

A First-Order Algorithmic Framework for Wasserstein Distributionally Robust Logistic Regression

Wasserstein distance-based distributionally robust optimization (DRO) ha...
research
04/02/2023

Optimizing Data Shapley Interaction Calculation from O(2^n) to O(t n^2) for KNN models

With the rapid growth of data availability and usage, quantifying the ad...

Please sign up or login with your details

Forgot password? Click here to reset