Data Valuation Without Training of a Model

01/03/2023
by   Nohyun Ki, et al.
0

Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model, either by analyzing the behavior of the model during training or by measuring the performance gap of the model when the instance is removed from the dataset. Such approaches reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding 'irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics.

READ FULL TEXT

page 23

page 24

page 25

research
02/08/2020

Exploring the Memorization-Generalization Continuum in Deep Learning

Human learners appreciate that some facts demand memorization whereas ot...
research
05/19/2021

When Deep Classifiers Agree: Analyzing Correlations between Learning Order and Image Statistics

Although a plethora of architectural variants for deep classification ha...
research
07/22/2020

InstanceFlow: Visualizing the Evolution of Classifier Confusion on the Instance Level

Classification is one of the most important supervised machine learning ...
research
05/19/2022

Dataset Pruning: Reducing Training Data by Examining Generalization Influence

The great success of deep learning heavily relies on increasingly larger...
research
09/17/2020

Finding Influential Instances for Distantly Supervised Relation Extraction

Distant supervision has been demonstrated to be highly beneficial to enh...
research
03/23/2022

An Empirical Study of Memorization in NLP

A recent study by Feldman (2020) proposed a long-tail theory to explain ...
research
06/18/2019

Information matrices and generalization

This work revisits the use of information criteria to characterize the g...

Please sign up or login with your details

Forgot password? Click here to reset