Neural Network Training with Highly Incomplete Datasets

07/01/2021
by   Yu-Wei Chang, et al.
0

Neural network training and validation rely on the availability of large high-quality datasets. However, in many cases only incomplete datasets are available, particularly in health care applications, where each patient typically undergoes different clinical procedures or can drop out of a study. Since the data to train the neural networks need to be complete, most studies discard the incomplete datapoints, which reduces the size of the training data, or impute the missing features, which can lead to artefacts. Alas, both approaches are inadequate when a large portion of the data is missing. Here, we introduce GapNet, an alternative deep-learning training approach that can use highly incomplete datasets. First, the dataset is split into subsets of samples containing all values for a certain cluster of features. Then, these subsets are used to train individual neural networks. Finally, this ensemble of neural networks is combined into a single neural network whose training is fine-tuned using all complete datapoints. Using two highly incomplete real-world medical datasets, we show that GapNet improves the identification of patients with underlying Alzheimer's disease pathology and of patients at risk of hospitalization due to Covid-19. By distilling the information available in incomplete datasets without having to reduce their size or to impute missing values, GapNet will permit to extract valuable information from a wide range of datasets, benefiting diverse fields from medicine to engineering.

READ FULL TEXT

page 4

page 5

research
06/23/2023

Minibatch training of neural network ensembles via trajectory sampling

Most iterative neural network training methods use estimates of the loss...
research
08/14/2020

A Dynamic Deep Neural Network For Multimodal Clinical Data Analysis

Clinical data from electronic medical records, registries or trials prov...
research
10/05/2021

Networked Time Series Prediction with Incomplete Data

A networked time series (NETS) is a family of time series on a given gra...
research
05/14/2020

Simultaneous imputation and disease classification in incomplete medical datasets using Multigraph Geometric Matrix Completion (MGMC)

Large-scale population-based studies in medicine are a key resource towa...
research
06/03/2022

PROMISSING: Pruning Missing Values in Neural Networks

While data are the primary fuel for machine learning models, they often ...
research
11/28/2020

Learning from Incomplete Data by Simultaneous Training of Neural Networks and Sparse Coding

Handling correctly incomplete datasets in machine learning is a fundamen...
research
07/11/2022

Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts

Many recent breakthroughs in deep learning were achieved by training inc...

Please sign up or login with your details

Forgot password? Click here to reset