Investigating minimizing the training set fill distance in machine learning regression

07/20/2023
by   Paolo Climaco, et al.
0

Many machine learning regression methods leverage large datasets for training predictive models. However, using large datasets may not be feasible due to computational limitations or high labelling costs. Therefore, sampling small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining computational efficiency. In this work, we study a sampling approach aimed to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error that linearly depends on the training set fill distance, conditional to the knowledge of data features. For empirical validation, we perform experiments using two regression models on two datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing the bound, significantly reduces the maximum prediction error of various regression models, outperforming existing sampling approaches by a large margin.

READ FULL TEXT
research
12/17/2019

Performance of regression models as a function of experiment noise

A challenge in developing machine learning regression models is that it ...
research
05/28/2023

Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning

Methods for carefully selecting or generating a small set of training da...
research
06/05/2019

Data Sketching for Faster Training of Machine Learning Models

Many machine learning problems reduce to the problem of minimizing an ex...
research
06/15/2023

On the Interplay of Subset Selection and Informed Graph Neural Networks

Machine learning techniques paired with the availability of massive data...
research
10/14/2021

Towards Understanding the Data Dependency of Mixup-style Training

In the Mixup training paradigm, a model is trained using convex combinat...
research
10/05/2017

InfiniViz: Interactive Visual Exploration using Progressive Bin Refinement

Interactive visualizations can accelerate the data analysis loop through...
research
01/05/2017

Overlapping Cover Local Regression Machines

We present the Overlapping Domain Cover (ODC) notion for kernel machines...

Please sign up or login with your details

Forgot password? Click here to reset