Generative Language Models on Nucleotide Sequences of Human Genes

07/20/2023
by   Musa Nuri Ihtiyar, et al.
0

Language models, primarily transformer-based ones, obtained colossal success in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3 for NLG are very crucial. DNA sequences are very close to natural language in terms of structure, so if the DNA-related bioinformatics domain is concerned, discriminative models, like DNABert, exist. Yet, the generative side of the coin is mainly unexplored to the best of our knowledge. Consequently, we focused on developing an autoregressive generative language model like GPT-3 for DNA sequences. Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale, focusing on nucleotide sequences of human genes, unique parts in DNA with specific functionalities, instead of the whole DNA. This decision did not change the problem structure a lot due to the fact that both DNA and genes can be seen as 1D sequences consisting of four different nucleotides without losing much information and making too much simplification. First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best while simple techniques like N-grams were also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural language. How essential using real-life tasks beyond the classical metrics such as perplexity is observed. Furthermore, checking whether the data-hungry nature of these models can be changed through selecting a language with minimal vocabulary size, four owing to four different types of nucleotides, is examined. The reason for reviewing this was that choosing such a language might make the problem easier. However, what we observed in this study was it did not provide that much of a change in the amount of data needed.

READ FULL TEXT

page 8

page 10

research
12/17/2017

Generating and designing DNA with deep generative models

We propose generative neural network methods to generate DNA sequences a...
research
12/14/2021

Epigenomic language models powered by Cerebras

Large scale self-supervised pre-training of Transformer language models ...
research
07/11/2023

DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks

The success of the GPT series proves that GPT can extract general inform...
research
04/27/2017

DNA Steganalysis Using Deep Recurrent Neural Networks

The technique of hiding messages in digital data is called a steganograp...
research
04/05/2022

SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model

A large number of inorganic and organic compounds are able to bind DNA a...
research
06/25/2023

Revolutionizing Cyber Threat Detection with Large Language Models

Natural Language Processing (NLP) domain is experiencing a revolution du...
research
06/06/2019

Evolution of Hierarchical Structure Reuse in iGEM Synthetic DNA Sequences

Many complex systems, both in technology and nature, exhibit hierarchica...

Please sign up or login with your details

Forgot password? Click here to reset