Physics-Inspired Protein Encoder Pre-Training via Siamese Sequence-Structure Diffusion Trajectory Prediction

01/28/2023
by   Zuobai Zhang, et al.
15

Pre-training methods on proteins are recently gaining interest, leveraging either protein sequences or structures, while modeling their joint energy landscape is largely unexplored. In this work, inspired by the success of denoising diffusion models, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure multimodal diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the multimodal diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a physics-inspired method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom- and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. The source code will be released upon acceptance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/11/2023

Enhancing Protein Language Models with Structure-based Encoder and Pre-training

Protein language models (PLMs) pre-trained on large-scale protein sequen...
research
12/09/2021

Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction

Protein-protein interactions (PPIs) are essentials for many biological p...
research
03/11/2022

Protein Representation Learning by Geometric Structure Pretraining

Learning effective protein representations is critical in a variety of t...
research
01/28/2023

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Current protein language models (PLMs) learn protein representations mai...
research
04/05/2023

EigenFold: Generative Protein Structure Prediction with Diffusion Models

Protein structure prediction has reached revolutionary levels of accurac...
research
08/20/2022

Few-Shot Learning of Accurate Folding Landscape for Protein Structure Prediction

Data-driven predictive methods which can efficiently and accurately tran...
research
06/22/2021

G-VAE, a Geometric Convolutional VAE for ProteinStructure Generation

Analyzing the structure of proteins is a key part of understanding their...

Please sign up or login with your details

Forgot password? Click here to reset