ProtiGeno: a prokaryotic short gene finder using protein language models

07/19/2023
by   Tony Tu, et al.
0

Prokaryotic gene prediction plays an important role in understanding the biology of organisms and their function with applications in medicine and biotechnology. Although the current gene finders are highly sensitive in finding long genes, their sensitivity decreases noticeably in finding shorter genes (<180 nts). The culprit is insufficient annotated gene data to identify distinguishing features in short open reading frames (ORFs). We develop a deep learning-based method called ProtiGeno, specifically targeting short prokaryotic genes using a protein language model trained on millions of evolved proteins. In systematic large-scale experiments on 4,288 prokaryotic genomes, we demonstrate that ProtiGeno predicts short coding and noncoding genes with higher accuracy and recall than the current state-of-the-art gene finders. We discuss the predictive features of ProtiGeno and possible limitations by visualizing the three-dimensional structure of the predicted short genes. Data, codes, and models are available at https://github.com/tonytu16/protigeno.

READ FULL TEXT
research
01/23/2022

OntoProtein: Protein Pretraining With Gene Ontology Embedding

Self-supervised protein language models have proved their effectiveness ...
research
12/28/2020

Mechanism of Evolution Shared by Gene and Language

We propose a general mechanism for evolution to explain the diversity of...
research
01/07/2016

Large Collection of Diverse Gene Set Search Queries Recapitulate Known Protein-Protein Interactions and Gene-Gene Functional Associations

Popular online enrichment analysis tools from the field of molecular sys...
research
12/07/2022

Unsupervised language models for disease variant prediction

There is considerable interest in predicting the pathogenicity of protei...
research
05/29/2003

Seven clusters in genomic triplet distributions

In several recent papers new gene-detection algorithms were proposed for...
research
07/29/2023

GeneMask: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Large-scale language models such as DNABert and LOGO aim to learn optima...
research
09/18/2023

DeepHEN: quantitative prediction essential lncRNA genes and rethinking essentialities of lncRNA genes

Gene essentiality refers to the degree to which a gene is necessary for ...

Please sign up or login with your details

Forgot password? Click here to reset