Data Shapley: Equitable Valuation of Data for Machine Learning

04/05/2019
by   Amirata Ghorbani, et al.
0

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on n data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.

READ FULL TEXT
research
10/26/2021

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning

Data Shapley has recently been proposed as a principled framework to qua...
research
03/27/2022

Image quality assessment for machine learning tasks using meta-reinforcement learning

In this paper, we consider image quality assessment (IQA) as a measure o...
research
12/22/2021

Algorithmic Probability of Large Datasets and the Simplicity Bubble Problem in Machine Learning

When mining large datasets in order to predict new data, limitations of ...
research
07/02/2020

Efficient computation and analysis of distributional Shapley values

Distributional data Shapley value (DShapley) has been recently proposed ...
research
05/02/2023

Data valuation: The partial ordinal Shapley value for machine learning

Data valuation using Shapley value has emerged as a prevalent research d...
research
02/27/2020

A Distributional Framework for Data Valuation

Shapley value is a classic notion from game theory, historically used to...
research
07/31/2021

Citations or dollars? Early signals of a firm's research success

Scientific and technological progress is largely driven by firms in many...

Please sign up or login with your details

Forgot password? Click here to reset