Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models

08/27/2022
by   Ethan Pickering, et al.
0

Not all data are equal. Misleading or unnecessary data can critically hinder the accuracy of Machine Learning (ML) models. When data is plentiful, misleading effects can be overcome, but in many real-world applications data is sparse and expensive to acquire. We present a method that substantially reduces the data size necessary to accurately train ML models, potentially opening the door for many new, limited-data applications in ML. Our method extracts the most informative data, while ignoring and omitting data that misleads the ML model to inferior generalization properties. Specifically, the method eliminates the phenomena of "double descent", where more data leads to worse performance. This approach brings several key features to the ML community. Notably, the method naturally converges and removes the traditional need to divide the dataset into training, testing, and validation data. Instead, the selection metric inherently assesses testing error. This ensures that key information is never wasted in testing or validation.

READ FULL TEXT
research
08/24/2018

Unknown Examples & Machine Learning Model Generalization

Over the past decades, researchers and ML practitioners have come up wit...
research
08/10/2022

Capturing Dependencies within Machine Learning via a Formal Process Model

The development of Machine Learning (ML) models is more than just a spec...
research
08/16/2018

Identifying Implementation Bugs in Machine Learning based Image Classifiers using Metamorphic Testing

We have recently witnessed tremendous success of Machine Learning (ML) i...
research
07/30/2020

Machine learning for complete intersection Calabi-Yau manifolds: a methodological study

We revisit the question of predicting both Hodge numbers h^1,1 and h^2,1...
research
12/25/2019

A Study of the Learnability of Relational Properties (Model Counting Meets Machine Learning)

Relational properties, e.g., the connectivity structure of nodes in a di...
research
09/20/2023

Machine Learning Data Suitability and Performance Testing Using Fault Injection Testing Framework

Creating resilient machine learning (ML) systems has become necessary to...
research
10/28/2022

Estimating oil recovery factor using machine learning: Applications of XGBoost classification

In petroleum engineering, it is essential to determine the ultimate reco...

Please sign up or login with your details

Forgot password? Click here to reset