A novel algorithm can generate data to train machine learning models in conditions of extreme scarcity of real world data

05/01/2023
by   Olivier Niel, et al.
0

Training machine learning models requires large datasets. However, collecting, curating, and operating large and complex sets of real world data poses problems of costs, ethical and legal issues, and data availability. Here we propose a novel algorithm to generate large artificial datasets to train machine learning models in conditions of extreme scarcity of real world data. The algorithm is based on a genetic algorithm, which mutates randomly generated datasets subsequently used for training a neural network. After training, the performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset used for its training. As selection pressure is applied to the population of generated datasets, unfit individuals are discarded, and the fitness of the fittest individuals increases through generations. The performance of the data generation algorithm was measured on the Iris dataset and on the Breast Cancer Wisconsin diagnostic dataset. In conditions of real world data abundance, mean accuracy of machine learning models trained on generated data was comparable to mean accuracy of models trained on real world data (0.956 in both cases on the Iris dataset, p = 0.6996, and 0.9377 versus 0.9472 on the Breast Cancer dataset, p = 0.1189). In conditions of simulated extreme scarcity of real world data, mean accuracy of machine learning models trained on generated data was significantly higher than mean accuracy of comparable models trained on scarce real world data (0.9533 versus 0.9067 on the Iris dataset, p < 0.0001, and 0.8692 versus 0.7701 on the Breast Cancer dataset, p = 0.0091). In conclusion, this novel algorithm can generate large artificial datasets to train machine learning models, in conditions of extreme scarcity of real world data, or when cost or data sensitivity prevent the collection of large real world datasets.

READ FULL TEXT
research
05/16/2023

Diffusion Dataset Generation: Towards Closing the Sim2Real Gap for Pedestrian Detection

We propose a method that augments a simulated dataset using diffusion mo...
research
09/03/2020

A general approach to bridge the reality-gap

Employing machine learning models in the real world requires collecting ...
research
09/27/2022

Identifying and Extracting Football Features from Real-World Media Sources using Only Synthetic Training Data

Real-world images used for training machine learning algorithms are ofte...
research
11/30/2016

Reliable Evaluation of Neural Network for Multiclass Classification of Real-world Data

This paper presents a systematic evaluation of Neural Network (NN) for c...
research
12/23/2020

Generating Comprehensive Data with Protocol Fuzzing for Applying Deep Learning to Detect Network Attacks

Network attacks have become a major security concern for organizations w...
research
08/02/2023

MammoDG: Generalisable Deep Learning Breaks the Limits of Cross-Domain Multi-Center Breast Cancer Screening

Breast cancer is a major cause of cancer death among women, emphasising ...
research
05/12/2022

Evolving SimGANs to Improve Abnormal Electrocardiogram Classification

Machine Learning models are used in a wide variety of domains. However, ...

Please sign up or login with your details

Forgot password? Click here to reset