Dataset Pruning: Reducing Training Data by Examining Generalization Influence

05/19/2022
by   Shuo Yang, et al.
104

The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct a smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40 dataset, halves the convergence time with only 1.3 which is superior to previous score-based sample selection methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/12/2018

It was the training data pruning too!

We study the current best model (KDG) for question answering on tabular ...
research
05/28/2023

Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning

Methods for carefully selecting or generating a small set of training da...
research
08/07/2023

Revealing the Underlying Patterns: Investigating Dataset Similarity, Performance, and Generalization

Supervised deep learning models require significant amount of labelled d...
research
09/01/2018

Data Dropout: Optimizing Training Data for Convolutional Neural Networks

Deep learning models learn to fit training data while they are highly ex...
research
01/03/2023

Data Valuation Without Training of a Model

Many recent works on understanding deep learning try to quantify how muc...
research
05/22/2023

Relabel Minimal Training Subset to Flip a Prediction

Yang et al. (2023) discovered that removing a mere 1 often lead to the f...
research
10/18/2018

Removing the influence of a group variable in high-dimensional predictive modelling

Predictive modelling relies on the assumption that observations used for...

Please sign up or login with your details

Forgot password? Click here to reset