Data Budgeting for Machine Learning

10/03/2022
by   Xinyi Zhao, et al.
0

Data is the fuel powering AI and creates tremendous value for many domains. However, collecting datasets for AI is a time-consuming, expensive, and complicated endeavor. For practitioners, data investment remains to be a leap of faith in practice. In this work, we study the data budgeting problem and formulate it as two sub-problems: predicting (1) what is the saturating performance if given enough data, and (2) how many data points are needed to reach near the saturating performance. Different from traditional dataset-independent methods like PowerLaw, we proposed a learning method to solve data budgeting problems. To support and systematically evaluate the learning-based method for data budgeting, we curate a large collection of 383 tabular ML datasets, along with their data vs performance curves. Our empirical evaluation shows that it is possible to perform data budgeting given a small pilot study dataset with as few as 50 data points.

READ FULL TEXT
research
10/04/2016

Micro-Data Learning: The Other End of the Spectrum

Many fields are now snowed under with an avalanche of data, which raises...
research
05/25/2021

Improving Machine Learning-Based Modeling of Semiconductor Devices by Data Self-Augmentation

In the electronics industry, introducing Machine Learning (ML)-based tec...
research
07/17/2018

Analyzing Hypersensitive AI: Instability in Corporate-Scale Machine Learning

Predictive geometric models deliver excellent results for many Machine L...
research
08/27/2017

Gatherplots: Generalized Scatterplots for Nominal Data

Overplotting of data points is a common problem when visualizing large d...
research
07/13/2021

DIVINE: Diverse Influential Training Points for Data Visualization and Model Refinement

As the complexity of machine learning (ML) models increases, resulting i...
research
06/16/2020

NodeNet: A Graph Regularised Neural Network for Node Classification

Real-world events exhibit a high degree of interdependence and connectio...
research
10/16/2020

On Automatic Feasibility Study for Machine Learning Application Development with ease.ml/snoopy

In our experience working with domain experts who are using today's Auto...

Please sign up or login with your details

Forgot password? Click here to reset