Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments

06/20/2022
by   Jinkun Lin, et al.
0

We develop a new, principled algorithm for estimating the contribution of training data points to the behavior of a deep learning model, such as a specific prediction it makes. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data, sampled from a given distribution. When subsets are sampled from the uniform distribution, the AME reduces to the well-known Shapley value. Our approach is inspired by causal inference and randomized experiments: we sample different subsets of the training data to train multiple submodels, and evaluate each submodel's behavior. We then use a LASSO regression to jointly estimate the AME of each data point, based on the subset compositions. Under sparsity assumptions (k ≪ N datapoints have large AME), our estimator requires only O(klog N) randomized submodel trainings, improving upon the best prior Shapley value estimators.

READ FULL TEXT

page 9

page 29

page 30

page 33

page 34

research
04/28/2021

Finding High-Value Training Data Subset through Differentiable Convex Programming

Finding valuable training data points for deep neural networks has been ...
research
03/18/2021

Decision Theoretic Bootstrapping

The design and testing of supervised machine learning models combine two...
research
05/17/2020

Robust subset selection

The best subset selection (or "best subsets") estimator is a classic too...
research
02/09/2022

Agree to Disagree: Diversity through Disagreement for Better Transferability

Gradient-based learning algorithms have an implicit simplicity bias whic...
research
12/07/2020

Explainable Artificial Intelligence: How Subsets of the Training Data Affect a Prediction

There is an increasing interest in and demand for interpretations and ex...
research
08/08/2019

Robust Causal Inference for Incremental Return on Ad Spend with Randomized Geo Experiments

Evaluating the incremental return on ad spend (iROAS) of a prospective o...
research
08/06/2019

Semiparametric Wavelet-based JPEG IV Estimator for endogenously truncated data

A new and an enriched JPEG algorithm is provided for identifying redunda...

Please sign up or login with your details

Forgot password? Click here to reset