Learning from Data-Rich Problems: A Case Study on Genetic Variant Calling

11/12/2019
by   Ren Yi, et al.
0

Next Generation Sequencing can sample the whole genome (WGS) or the 1-2 the genome that codes for proteins called the whole exome (WES). Machine learning approaches to variant calling achieve high accuracy in WGS data, but the reduced number of training examples causes training with WES data alone to achieve lower accuracy. We propose and compare three different data augmentation strategies for improving performance on WES data: 1) joint training with WES and WGS data, 2) warmstarting the WES model from a WGS model, and 3) joint training with the sequencing type specified. All three approaches show improved accuracy over a model trained using just WES data, suggesting the ability of models to generalize insights from the greater WGS data while retaining performance on the specialized WES problem. These data augmentation approaches may apply to other problem areas in genomics, where several specialized models would each see only a subset of the genome.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2018

Improved Mixed-Example Data Augmentation

In order to reduce overfitting, neural networks are typically trained wi...
research
01/21/2021

DataLoc+: A Data Augmentation Technique for Machine Learning in Room-Level Indoor Localization

Indoor localization has been a hot area of research over the past two de...
research
04/24/2020

G-DAUG: Generative Data Augmentation for Commonsense Reasoning

Recent advances in commonsense reasoning depend on large-scale human-ann...
research
03/31/2021

Few-shot learning through contextual data augmentation

Machine translation (MT) models used in industries with constantly chang...
research
04/30/2020

When does data augmentation help generalization in NLP?

Neural models often exploit superficial ("weak") features to achieve goo...
research
10/25/2022

On Robust Incremental Learning over Many Multilingual Steps

Recent work in incremental learning has introduced diverse approaches to...
research
04/11/2021

ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms

Machine learning models are being used extensively in many important are...

Please sign up or login with your details

Forgot password? Click here to reset