A Distributional Framework for Data Valuation

02/27/2020
by   Amirata Ghorbani, et al.
7

Shapley value is a classic notion from game theory, historically used to quantify the contributions of individuals within groups, and more recently applied to assign values to data points when training machine learning models. Despite its foundational role, a key limitation of the data Shapley framework is that it only provides valuations for points within a fixed data set. It does not account for statistical aspects of the data and does not give a way to reason about points outside the data set. To address these limitations, we propose a novel framework – distributional Shapley – where the value of a point is defined in the context of an underlying data distribution. We prove that distributional Shapley has several desirable statistical properties; for example, the values are stable under perturbations to the data points themselves and to the underlying data distribution. We leverage these properties to develop a new algorithm for estimating values from data, which comes with formal guarantees and runs two orders of magnitude faster than state-of-the-art algorithms for computing the (non-distributional) data Shapley values. We apply distributional Shapley to diverse data sets and demonstrate its utility in a data market setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/02/2020

Efficient computation and analysis of distributional Shapley values

Distributional data Shapley value (DShapley) has been recently proposed ...
research
05/22/2021

Statistical Testing under Distributional Shifts

Statistical hypothesis testing is a central problem in empirical inferen...
research
11/12/2018

What is my data worth? From data properties to data value

Data today fuels both the economy and advances in machine learning and A...
research
08/22/2019

Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

Given a data set D containing millions of data points and a data consume...
research
08/18/2023

Attesting Distributional Properties of Training Data for Machine Learning

The success of machine learning (ML) has been accompanied by increased c...
research
04/28/2023

LAVA: Data Valuation without Pre-Specified Learning Algorithms

Traditionally, data valuation is posed as a problem of equitably splitti...
research
04/05/2019

Data Shapley: Equitable Valuation of Data for Machine Learning

As data becomes the fuel driving technological and economic growth, a fu...

Please sign up or login with your details

Forgot password? Click here to reset