ProDOMA: improve PROtein DOMAin classification for third-generation sequencing reads using deep learning

09/26/2020
by   Du Nan, et al.
0

Motivation: With the development of third-generation sequencing technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in third-generation sequencing data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in third-generation sequencing data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. Results: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for third-generation sequencing reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject unrelated DNA reads such as those from noncoding regions. In the experiments on simulated reads of protein coding sequences and real reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction. Availability: The source code and the trained model are freely available at https://github.com/strideradu/ProDOMA. Contact: yannisun@cityu.edu.hk

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

11/17/2016

DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins

Transcription factors (TFs) are macromolecules that bind to cis-regulato...
04/24/2017

GaKCo: a Fast GApped k-mer string Kernel using COunting

String Kernel (SK) techniques, especially those using gapped k-mers as f...
01/29/2019

Comprehensive Evaluation of Deep Learning Architectures for Prediction of DNA/RNA Sequence Binding Specificities

Motivation: Deep learning architectures have recently demonstrated their...
07/19/2017

EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation

During the past decade, with the significant progress of computational p...
10/08/2017

Protein identification with deep learning: from abc to xyz

Proteins are the main workhorses of biological functions in a cell, a ti...
12/06/2020

Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

Background: The inception of next generations sequencing technologies ha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.