Enhancing Protein Language Models with Structure-based Encoder and Pre-training

03/11/2023
by   Zuobai Zhang, et al.
13

Protein language models (PLMs) pre-trained on large-scale protein sequence corpora have achieved impressive performance on various downstream protein understanding tasks. Despite the ability to implicitly capture inter-residue contact information, transformer-based PLMs cannot encode protein structures explicitly for better structure-aware protein representations. Besides, the power of pre-training on available protein structures has not been explored for improving these PLMs, though structures are important to determine functions. To tackle these limitations, in this work, we enhance the PLMs with structure-based encoder and pre-training. We first explore feasible model architectures to combine the advantages of a state-of-the-art PLM (i.e., ESM-1b1) and a state-of-the-art protein structure encoder (i.e., GearNet). We empirically verify the ESM-GearNet that connects two encoders in a series way as the most effective combination model. To further improve the effectiveness of ESM-GearNet, we pre-train it on massive unlabeled protein structures with contrastive learning, which aligns representations of co-occurring subsequences so as to capture their biological correlation. Extensive experiments on EC and GO protein function prediction benchmarks demonstrate the superiority of ESM-GearNet over previous PLMs and structure encoders, and clear performance gains are further achieved by structure-based pre-training upon ESM-GearNet. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/23/2022

OntoProtein: Protein Pretraining With Gene Ontology Embedding

Self-supervised protein language models have proved their effectiveness ...
research
12/09/2021

Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction

Protein-protein interactions (PPIs) are essentials for many biological p...
research
01/28/2023

Physics-Inspired Protein Encoder Pre-Training via Siamese Sequence-Structure Diffusion Trajectory Prediction

Pre-training methods on proteins are recently gaining interest, leveragi...
research
01/16/2023

Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling

As opposed to scaling-up protein language models (PLMs), we seek improvi...
research
10/29/2021

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

Understanding protein sequences is vital and urgent for biology, healthc...
research
01/28/2023

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Current protein language models (PLMs) learn protein representations mai...
research
06/26/2020

BERTology Meets Biology: Interpreting Attention in Protein Language Models

Transformer architectures have proven to learn useful representations fo...

Please sign up or login with your details

Forgot password? Click here to reset