Modeling Protein Using Large-scale Pretrain Language Model

08/17/2021
by   Yijia Xiao, et al.
1

Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at https://github.com/THUDM/ProteinLM.

READ FULL TEXT

page 5

page 6

research
06/02/2023

Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation

The field of protein folding research has been greatly advanced by deep ...
research
07/28/2022

HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative

AI-based protein structure prediction pipelines, such as AlphaFold2, hav...
research
08/16/2023

PEvoLM: Protein Sequence Evolutionary Information Language Model

With the exponential increase of the protein sequence databases over tim...
research
07/03/2022

Advancing protein language models with linguistics: a roadmap for improved interpretability

Deep neural-network-based language models (LMs) are increasingly applied...
research
12/03/2022

iEnhancer-ELM: Improve Enhancer Identification by Extracting Multi-scale Contextual Information based on Enhancer Language Models

Motivation: Enhancers are important cis-regulatory elements that regulat...
research
08/02/2019

Deep learning languages: a key fundamental shift from probabilities to weights?

Recent successes in language modeling, notably with deep learning method...
research
11/05/2019

OMXWare, A Cloud-Based Platform for Studying Microbial Life at Scale

The rapid growth in biological sequence data is revolutionizing our unde...

Please sign up or login with your details

Forgot password? Click here to reset