HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

06/27/2023
by   Eric Nguyen, et al.
0

Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001 of long-range interactions in DNA. In addition, these methods rely on tokenizers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyenas new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level, an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics for simple adaptation to novel tasks without updating pretrained model weights. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 17 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on all 8 datasets on average by +9 accuracy points.

READ FULL TEXT
research
09/19/2021

Do Long-Range Language Models Actually Use Long-Range Context?

Language models are generally trained on short, truncated input sequence...
research
08/31/2023

YaRN: Efficient Context Window Extension of Large Language Models

Rotary Position Embeddings (RoPE) have been shown to effectively encode ...
research
05/08/2023

Toeplitz Neural Network for Sequence Modeling

Sequence modeling has important applications in natural language process...
research
09/13/2023

EarthPT: a foundation model for Earth Observation

We introduce EarthPT – an Earth Observation (EO) pretrained transformer....
research
07/29/2023

GeneMask: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Large-scale language models such as DNABert and LOGO aim to learn optima...
research
04/27/2023

Data navigation on the ENCODE portal

Spanning two decades, the Encyclopaedia of DNA Elements (ENCODE) is a co...
research
09/13/2023

Pretraining on the Test Set Is All You Need

Inspired by recent work demonstrating the promise of smaller Transformer...

Please sign up or login with your details

Forgot password? Click here to reset