Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

10/11/2021
by   Shentong Mo, et al.
15

In the genome biology research, regulatory genome modeling is an important topic for many regulatory downstream tasks, such as promoter classification, transaction factor binding sites prediction. The core problem is to model how regulatory elements interact with each other and its variability across different cell types. However, current deep learning methods often focus on modeling genome sequences of a fixed set of cell types and do not account for the interaction between multiple regulatory elements, making them only perform well on the cell types in the training set and lack the generalizability required in biological applications. In this work, we propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. Specifically, we simultaneously take the 1d sequence of genome data and a 2d matrix of (transcription factors x regions) as the input, where three pre-training tasks are proposed to improve the robustness and generalizability of our model. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences. We evaluate our GeneBERT on regulatory downstream tasks across different cell types, including promoter classification, transaction factor binding sites prediction, disease risk estimation, and splicing sites prediction. Extensive experiments demonstrate the effectiveness of multi-modal and self-supervised pre-training for large-scale regulatory genomics data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/14/2022

SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Self-supervised pre-training methods have brought remarkable breakthroug...
research
12/14/2021

Epigenomic language models powered by Cerebras

Large scale self-supervised pre-training of Transformer language models ...
research
12/22/2021

Fine-grained Multi-Modal Self-Supervised Learning

Multi-Modal Self-Supervised Learning from videos has been shown to impro...
research
03/01/2023

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

In this paper, we present StrucTexTv2, an effective document image pre-t...
research
08/27/2015

Nucleosome positioning: resources and tools online

Nucleosome positioning is an important process required for proper genom...
research
04/26/2023

UNADON: Transformer-based model to predict genome-wide chromosome spatial position

The spatial positioning of chromosomes relative to functional nuclear bo...
research
05/08/2023

SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding

Hand gesture serves as a crucial role during the expression of sign lang...

Please sign up or login with your details

Forgot password? Click here to reset