Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks

12/05/2020
by   Modestas Filipavicius, et al.
20

Less than 1 annotated. Natural Language Processing (NLP) community has recently embraced self-supervised learning as a powerful approach to learn representations from unlabeled text, in large part due to the attention-based context-aware Transformer models. In this work we present a modification to the RoBERTa model by inputting during pre-training a mixture of binding and non-binding protein sequences (from STRING database). However, the sequence pairs have no label to indicate their binding status, as the model relies solely on Masked Language Modeling (MLM) objective during pre-training. After fine-tuning, such approach surpasses models trained on single protein sequences for protein-protein binding prediction, TCR-epitope binding prediction, cellular-localization and remote homology classification tasks. We suggest that the Transformer's attention mechanism contributes to protein binding site discovery. Furthermore, we compress protein sequences by 64 vocabulary consisting of 10K subwords, each around 3-4 amino acids long. Finally, to expand the model input space to even larger proteins and multi-protein assemblies, we pre-train Longformer models that support 2,048 tokens. Further work in token-level classification for secondary structure prediction is needed. Code available at: https://github.com/PaccMann/paccmann_proteomics

READ FULL TEXT

page 4

page 9

page 17

page 18

page 19

research
12/01/2020

Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models

For protein sequence datasets, unlabeled data has greatly outpaced label...
research
07/13/2020

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Computational biology and bioinformatics provide vast data gold-mines fr...
research
01/31/2021

Adversarial Contrastive Pre-training for Protein Sequences

Recent developments in Natural Language Processing (NLP) demonstrate tha...
research
07/28/2022

HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative

AI-based protein structure prediction pipelines, such as AlphaFold2, hav...
research
08/08/2023

PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer

Phosphorylation is central to numerous fundamental cellular processes, i...
research
06/30/2011

On Prediction Using Variable Order Markov Models

This paper is concerned with algorithms for prediction of discrete seque...
research
03/18/2021

Rethinking Relational Encoding in Language Model: Pre-Training for General Sequences

Language model pre-training (LMPT) has achieved remarkable results in na...

Please sign up or login with your details

Forgot password? Click here to reset