Finding High-Value Training Data Subset through Differentiable Convex Programming

04/28/2021
by   Soumi Das, et al.
14

Finding valuable training data points for deep neural networks has been a core research challenge with many applications. In recent years, various techniques for calculating the "value" of individual training datapoints have been proposed for explaining trained models. However, the value of a training datapoint also depends on other selected training datapoints - a notion that is not explicitly captured by existing methods. In this paper, we study the problem of selecting high-value subsets of training data. The key idea is to design a learnable framework for online subset selection, which can be learned using mini-batches of training data, thus making our method scalable. This results in a parameterized convex subset selection problem that is amenable to a differentiable convex programming paradigm, thus allowing us to learn the parameters of the selection model in end-to-end training. Using this framework, we design an online alternating minimization-based algorithm for jointly learning the parameters of the selection model and ML model. Extensive evaluation on a synthetic dataset, and three standard datasets, show that our algorithm finds consistently higher value subsets of training data, compared to the recent state-of-the-art methods, sometimes  20 methods. The subsets are also useful in finding mislabelled training data. Our algorithm takes running time comparable to the existing valuation functions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2022

Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments

We develop a new, principled algorithm for estimating the contribution o...
research
03/24/2021

Convex Online Video Frame Subset Selection using Multiple Criteria for Data Efficient Autonomous Driving

Training vision-based Urban Autonomous driving models is a challenging p...
research
06/23/2021

Training Data Subset Selection for Regression with Controlled Generalization Error

Data subset selection from a large number of training instances has been...
research
03/14/2022

CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection

Data valuation and subset selection have emerged as valuable tools for a...
research
01/30/2023

MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning

Training deep networks and tuning hyperparameters on large datasets is c...
research
07/28/2022

Adaptive Second Order Coresets for Data-efficient Machine Learning

Training machine learning models on massive datasets incurs substantial ...
research
02/01/2022

Datamodels: Predicting Predictions from Training Data

We present a conceptual framework, datamodeling, for analyzing the behav...

Please sign up or login with your details

Forgot password? Click here to reset