DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks

by   Daoan Zhang, et al.

The success of the GPT series proves that GPT can extract general information from sequences, thereby benefiting all downstream tasks. This motivates us to use pre-trained models to explore the hidden information in DNA sequences. However, data and task requirements in DNA sequence analysis are complexity and diversity as DNA relevant data includes different types of information, such as sequences, expression levels, etc, while there is currently no model specifically designed for these characteristics. Hereby, we present DNAGPT, a generalized foundation model pre-trained on over 10 billion base pairs from 9 species which can be fine-tuned for any DNA sequence analysis task. Our model can simultaneously process or output DNA sequences and numbers. In addition, our unique token design allows users to design prompts according to their own task requirements, making it applicable to any type of task. We have evaluated our model on classification, regression, and generation tasks. We demonstrate that DNAGPT benefits from pre-training, and therefore can bring performance gains to any downstream task. Our model is not only a new attempt in the field of genomes analysis, but also provides a new direction for the application of foundation models in biology.


page 8

page 10

page 22

page 25


Rethinking Visual Prompt Learning as Masked Visual Token Modeling

Prompt learning has achieved great success in efficiently exploiting lar...

DPCIPI: A pre-trained deep learning model for estimation of cross-immunity between drifted strains of Influenza A/H3N2

Motivation: This study aims to develop a novel model called DNA Pretrain...

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability

In this paper, we investigate whether the power of the models pre-traine...

Generative Language Models on Nucleotide Sequences of Human Genes

Language models, primarily transformer-based ones, obtained colossal suc...

Knowledge distillation for fast and accurate DNA sequence correction

Accurate genome sequencing can improve our understanding of biology and ...

Model Provenance via Model DNA

Understanding the life cycle of the machine learning (ML) model is an in...

GeNet: Deep Representations for Metagenomics

We introduce GeNet, a method for shotgun metagenomic classification from...

Please sign up or login with your details

Forgot password? Click here to reset