PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction

06/24/2022
by   Sirui Liu, et al.
11

Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to satisfy the needs of modern protein sequence-structure related research. To solve this problem, we present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP. This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB). We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset. We validate the utility of this dataset for training by participating CAMEO contest in which our model won the first place. We hope our PSP dataset together with the training benchmark can enable a broader community of AI/biology researchers for AI-driven protein related research.

READ FULL TEXT

page 11

page 13

research
08/10/2023

OpenProteinSet: Training data for structural biology at scale

Multiple sequence alignments (MSAs) of proteins encode rich biological i...
research
01/18/2023

Beating the Best: Improving on AlphaFold2 at Protein Structure Prediction

The goal of Protein Structure Prediction (PSP) problem is to predict a p...
research
07/07/2023

Solvent: A Framework for Protein Folding

Consistency and reliability are crucial for conducting AI research. Many...
research
05/11/2022

MAS2HP: A Multi Agent System to predict protein structure in 2D HP model

Protein Structure Prediction (PSP) is an unsolved problem in the field o...
research
10/05/2022

AlphaFold Distillation for Improved Inverse Protein Folding

Inverse protein folding, i.e., designing sequences that fold into a give...
research
08/15/2023

APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics

The prediction of protein 3D structure from amino acid sequence is a com...
research
10/16/2021

DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction

How and where proteins interface with one another can ultimately impact ...

Please sign up or login with your details

Forgot password? Click here to reset