How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

07/04/2022
by   Rafid Mahmood, et al.
63

Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? This question is of critical importance in applications such as autonomous driving or medical imaging where collecting data is expensive and time-consuming. Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget. Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements. Finally, we show that incorporating a tuned correction factor and collecting over multiple rounds significantly improves the performance of the data estimators. Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/03/2022

Optimizing Data Collection for Machine Learning

Modern deep learning systems require huge data sets to achieve impressiv...
research
06/24/2023

Active Data Acquisition in Autonomous Driving Simulation

Autonomous driving algorithms rely heavily on learning-based models, whi...
research
02/16/2021

Recommending Training Set Sizes for Classification

Based on a comprehensive study of 20 established data sets, we recommend...
research
06/11/2021

Scaling Laws for Acoustic Models

There is a recent trend in machine learning to increase model quality by...
research
03/02/2023

A Meta-Learning Approach to Predicting Performance and Data Requirements

We propose an approach to estimate the number of samples required for a ...
research
04/07/2023

Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field

This work explores the use of 3D generative models to synthesize trainin...
research
05/02/2022

Jack and Masters of All Trades: One-Pass Learning of a Set of Model Sets from Foundation AI Models

For deep learning, size is power. Massive neural nets trained on broad d...

Please sign up or login with your details

Forgot password? Click here to reset