Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models

12/01/2020
by   Pascal Sturmfels, et al.
6

For protein sequence datasets, unlabeled data has greatly outpaced labeled data due to the high cost of wet-lab characterization. Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks. However, the optimal pre-training strategy remains an open question. Instead of strictly borrowing from natural language processing (NLP) in the form of masked or autoregressive language modeling, we introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments. Using a set of five, standardized downstream tasks for protein models, we demonstrate that our pre-training task along with a multi-task objective outperforms masked language modeling alone on all five tasks. Our results suggest that protein sequence models may benefit from leveraging biologically-inspired inductive biases that go beyond existing language modeling techniques in NLP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2020

Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks

Less than 1 annotated. Natural Language Processing (NLP) community has r...
research
01/31/2021

Adversarial Contrastive Pre-training for Protein Sequences

Recent developments in Natural Language Processing (NLP) demonstrate tha...
research
10/29/2021

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

Understanding protein sequences is vital and urgent for biology, healthc...
research
04/12/2020

Pre-training Text Representations as Meta Learning

Pre-training text representations has recently been shown to significant...
research
03/18/2021

Rethinking Relational Encoding in Language Model: Pre-Training for General Sequences

Language model pre-training (LMPT) has achieved remarkable results in na...
research
08/02/2019

Deep learning languages: a key fundamental shift from probabilities to weights?

Recent successes in language modeling, notably with deep learning method...
research
06/08/2023

Multi-task Bioassay Pre-training for Protein-ligand Binding Affinity Prediction

Protein-ligand binding affinity (PLBA) prediction is the fundamental tas...

Please sign up or login with your details

Forgot password? Click here to reset