Log In Sign Up

Modeling Protein Using Large-scale Pretrain Language Model

by   Yijia Xiao, et al.
Tsinghua University

Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at


page 5

page 6


HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative

AI-based protein structure prediction pipelines, such as AlphaFold2, hav...

ProGen2: Exploring the Boundaries of Protein Language Models

Attention-based models trained on protein sequences have demonstrated in...

Advancing protein language models with linguistics: a roadmap for improved interpretability

Deep neural-network-based language models (LMs) are increasingly applied...

ProteiNN: Intrinsic-Extrinsic Convolution and Pooling for Scalable Deep Protein Analysis

Proteins perform a large variety of functions in living organisms, thus ...

Deep learning languages: a key fundamental shift from probabilities to weights?

Recent successes in language modeling, notably with deep learning method...

OMXWare, A Cloud-Based Platform for Studying Microbial Life at Scale

The rapid growth in biological sequence data is revolutionizing our unde...

1. Introduction

As an indispensable part of life activities, protein is responsible for catalysis (such as enzymes), transportation (such as hemoglobin), etc. Therefore, understanding the structure and functionality of protein is critical to the study of life science, as well as disease detection and drug discovery. Traditional protein analysis paradigms can be divided into experimental and analytical. Experimental methods usually require purification, crystallization, and X-ray crystallography. Analytical methods, like sequence alignment(Ma, 2015), and molecular dynamics simulation (Geng et al., 2019), tend to be incapable of handling large-scale protein data. Sequence alignment and similarity analysis leverage the idea that ”structure determines properties”, that sequential molecules with similar sequence order tend to have common ancestors and are relatively similar in structure and functionality. So similarity analysis often requires a large-scale annotated database, the properties of the sequence to be analyzed can be inferred from the labels of aligned sequences in the database. However, labeling such large databases requires lots of manpower and material resources. Molecular dynamics simulation (MD) and Monte Carlo (MC) simulations can be applied to protein analysis(Gsponer and Caflisch, 2002)(Karplus and Kuriyan, 2005), and can be quite accurate (simulation at atom-scale). However, requires a lot of computing resources and is time-consuming.

Generally speaking, most of the proteins that exist stably in nature have undergone millions of years of natural selection and evolution, and are in a low-energy stable state. The polarity of some amino acids makes certain amino acid arrangements in a lower energy state, and motifs in proteins are also made up of specific amino acid stretches folded. Such patterns can be captured by deep learning models. Researchers have explored various strategies. Inspired by Word2Vec(Mikolov et al., 2013), BioVec(Asgari and Mofrad, 2015)

proposed ProtVec for proteins GeneVec for gene sequences modeling. However, the vocabulary size grows exponentially with dependence range (n-gram), making the cost of modeling long-range dependencies unbearable (n-grams representation). With the rise of representation learning, sequence representation learning

(Alley et al., 2019)

and transfer learning

(Heinzinger et al., 2019) are also introduced to protein analysis. Recent years, the emergence of the attention mechanism(Vaswani et al., 2017)

, which can compute hidden representations in parallel, allows researchers to better model long sequential data. ProtTrans

(Elnaggar et al., 2020) also show that large-scale auto-regressive language models can model protein sequences quite well. Besides, the information encoded in an individual sequence is limited, MSA Transformer(Rao et al., 2021), ESM(Rives et al., 2019) leverage sequence alignment information to model protein even better. Other research like Neural Potts Model(Sercu et al., 2021) obtained inspiration from the Potts model.

Thanks to the advancement of high-throughput sequencing technology, we have larger amino acid sequence databases than ever before. However, most of these data are unlabeled primary structures of proteins, the labeled sequences (like structure, stability) are relatively scarce. The amazing achievements of BERT(Devlin et al., 2018)

reveal the fact that data patterns can be extracted using unsupervised learning from massive unlabeled data, which inspired us to train language models on massive protein sequences. We have trained multiple large-scale models on the PFAM

(El-Gebali et al., 2018) dataset, the largest with 3 billion parameters, outperforming TAPE(Rao et al., 2019)’s performance.

2. Related Works

2.1. Standardized Datasets and Tasks

There are plenty of data in the computational proteomics field, however, current literature is fragmented in terms of unified datasets and evaluation metrics. The methods and models introduced by researchers are often evaluated on different datasets with different evaluation metrics. To solve this dilemma, TAPE

(Rao et al., 2019) put forward a set of five biologically related tasks (secondary structure prediction(Klausen et al., 2019)(Berman et al., 2000)(Moult et al., 2018), contact prediction(Fox et al., 2013)(Berman et al., 2000)(Moult et al., 2018), remote homology detection(Fox et al., 2013), fluorescence(Sarkisyan et al., 2016), and stability(Rocklin et al., 2017)). Commonly used models, like LSTM(Hochreiter and Schmidhuber, 1997), Transformer(Vaswani et al., 2017), and ResNet(Yu et al., 2017) are implemented for these tasks, serving as benchmarks for semi-supervised representation learning. One of their conclusions is that self-supervised training is beneficial for almost all models on all tasks, doubling performance in some downstream tasks. Our work is based on standardized datasets and evaluation metrics provided in TAPE(Rao et al., 2019).

2.2. Large-scale Pretraining

The success of pretraining makes researchers wonder whether the in language model scale can always bring about improved performance. ProtTrans(Elnaggar et al., 2020) is one of the representatives, the researchers trained a series of language models with tens of billions of parameters, the largest one ProtT5-XXL with 11B parameters, and achieved excellent performance on downstream tasks such as secondary structure prediction and solubility prediction.

2.3. Efficient Pretraining of Language Models

Different from the usual pretraining, large-scale pretraining requires distributed training techniques, including model parallelism, data parallelism, memory optimization, data synchronization, etc. Fortunately, Megatron-LM(Shoeybi et al., 2020) provides us with an efficient training framework for language models. We have implemented and trained our protein language model within this framework, as well as downstream classification and regression tasks.

3. Methodology

3.1. Pretrain Tasks

The goal for protein pretraining is modeling data patterns in massive unlabeled sequences. One closed-related model is BERT(Devlin et al., 2018)

from natural language processing. We made some modifications to its loss function and model structure.

Our work takes the dataset put forward by TAPE(Rao et al., 2019). So some date descriptions are inherited. PFAM(El-Gebali et al., 2018) is a widely-used database consisting of more than 32 million protein sequences. Sequences in PFAM are clustered into evolutionarily related groups (protein families). Leveraging this family information, TAPE constructed a test set (about 1% of the data) of fully held out families. The remaining data are used for constructing training and test sets using a random 95%/5% split. We use preprocessed PFAM from TAPE as the pretrain corpus.

Training Objective  
BERT(Devlin et al., 2018) original loss consists of masked language model loss and next sentence prediction loss.

For protein pretraining, we inherited the masking strategy of the masked language model (MLM) in BERT(Devlin et al., 2018), randomly masking 15% of all the amino acid tokens, and then train the protein model to be able to predict the masked token from the rest of tokens. As for next sentence prediction (NSP), considering that the input sequences are randomly shuffled, we assume there is no evident semantic/biological correlation between sequences. So we discard the next sentence prediction loss, only keep the masked language model loss.

As for the training objective function, we modified the loss function of BERT: BERT’s loss function includes masked language model and next sentence prediction. Considering that there is no obvious contextual semantic relationship between protein and protein, we only retain masked language model loss.

In terms of model structure, Megatron-LM (Shoeybi et al., 2020) proposes that when the scale of the model grows huge, the position of the layernorm becomes critical. Therefore, the sublayers in the transformer layer have been reordered. The original layernorm is in the output layer, but now it is placed ahead of the input layer to prevent the input data from drifting.

3.2. Downstream Classification Tasks

There are three classification tasks, corresponding to token, sequence, and token-pair classification.

3.2.1. Secondary Structure

Figure 1. Secondary Structure Task


Secondary structure classification is a token-level task. The input is protein sequence, the output is a sequence of labels with the same length, indicating the secondary structure position of the corresponding amino acid. As for Q3, the labels are Helix, Strand, and Other. As for Q8, the labels are helix (G), -helix (H), -helix (I), -stand (E), bridge (B), turn (T), bend (S), and others (C).




The dataset for secondary structure task is the CB513(Cuff and Barton, 1999) dataset.

Training Objective

A one-dimensional convolution layer can be applied to secondary structure prediction(Wang et al., 2016)

. However, due to the powerful modeling capabilities of our model, the encoding output from the protein language model already contains sufficient information for this task, so we take ProteinLM followed by a multilayer perceptron as the secondary structure classifier.

3.2.2. Remote Homology

Figure 2. Remote Homology Task


Remote homology detection is a sequence-level classification task. This task is introduced to measures a model’s ability to detect structural similarity across distantly related inputs. The input is a protein sequence, and the target is to predict which fold family this sequence belongs to. There are 1195 classes in all. Similar to the token-level prediction tasks, we adopt the multilayer perceptron for classification.

here, means encoding results for token , AC[i] means the amino acid in protein sequence.

Training Objective

This is a classical classification task, we take the naive cross-entropy loss.

3.2.3. Contact Prediction

Figure 3. Contact Prediction Task


Contact prediction means predicting whether or not amino acid pairs are in ”contact” in folded structure (in ”contact” means the distance in folded structure within 8 angstroms); facilitating 3-dimensional free modeling of protein. It is a classification task, assigning a binary label to amino acid pairs, indicating whether they are in ’contact’. The contact prediction task can evaluate the model’s ability to capture protein sequence’s global information. Unlike the commonly used residual connected 2D-convolution network, we adopted a simple predictor, concatenating embedding pairs and using multilayer perceptron to do this binary classification. Numerous hidden units, presentation layers, as well as huge corpus guarantee that our model can capture even more long-range dependence information than common models.


The dataset from ProteinNet(AlQuraishi, 2019). And evaluation metric is , , most likely contact prediction accuracy contacts ( is the length of protein sequence).

3.3. Downstream Regression Tasks

3.3.1. Fluorescence

Figure 4. Fluorescence Task


Distinguishing protein sequences with different mutations can be difficult, since the computational cost grows exponentially with the number of mutations . The computational complexity for a sequence with mutation away is . The fluorescence prediction task can evaluate the model’s capacity to distinguish between very similar protein sequences, as well as its ability to generalize to unseen combinations of mutations(Rao et al., 2019). Accurate predictions can facilitate the exploration of the protein landscape.


The train set(Sarkisyan et al., 2016) is made up of neighborhoods from the parent green fluorescent protein(Sarkisyan et al., 2016), while the test set sequences with four or more mutations.

3.3.2. Stability

Figure 5. Stability Task


Stability is very important in the design of protein drugs, because drugs with low stability are often degraded before they take effect. The stability of one protein sequence is measured experimentally and indirectly: the upper limit of concentration at which the protein can maintain its original folding structure(Rocklin et al., 2017). Therefore, for this task, The input is the amino acid sequence, while the output is a continuous value predicting to which extent the protein can maintain its fold structure.


The train set consists of proteins from four rounds of experimental design, while the test set contains Hamming distance-1 neighbors of top candidate proteins(Rao et al., 2019).

4. Results

Our model has obtained amazing results in downstream tasks. There are great improvements in four tasks: secondary structure prediction, distant homology detection, stability, and contact prediction. It is worth mentioning that the performance of the 3B model on contact prediction has almost doubled compared with the baseline model.

Besides, we used 10 sets of model hyper-parameters in total, and conducted very sufficient experiments on a series of tasks. The pretraining details can be found in 7. Supplementary Materials, and all the results can be found inTable 7.

4.1. Training

We pretrained two large models on a 480 GPUs (Tesla-V100-32GB) cluster for about three weeks. The MLM loss and PPL of the pretrained models can be found in Table 1.

The 3B parameters model reached language model loss , perplexity .

The 1.2B parameters model reached language model loss , perplexity .

In pretraining, although the overall training iteration for the 3 billion model is only half of that for the 1.2 billion model, it reached even smaller MLM loss and PPL. The training MLM loss, validation MLM loss and PPL curves can be found below.

  1. 1.2B model’s training MLM loss: Figure 6

  2. 3B model’s training MLM loss: Figure 7

  3. 1.2B model’s validation MLM loss and PPL Figure 8

  4. 3B model’s validation MLM loss and PPL Figure 9

Model Protein LM (1.2B) Protein LM (3B)
MLM Loss 1.335 1.318
PPL 3.802 3.736
Table 1. MLM loss and PPL
Figure 6. 1.2 Billion Model Training Loss
Figure 7. 3 Billion Model Training Loss
Figure 8. 1.2 Billion Model Validation
Figure 9. 3 Billion Model Validation

4.2. Evaluation

Out of the five tasks, the results of four tasks have been improved.

  1. Contact Prediction: Table 2

  2. Remote Homology: Table 3

  3. Secondary Structure: Table 4

  4. Fluorescence: Table 5

  5. Stability: Table 6

Task contact prediction
Metric P@L/5
TAPE 0.36
ProteinLM (200M) 0.52
ProteinLM (3B) 0.75
Table 2. Contact Prediction
Task remote homology
Metric Top 1 Accuracy
TAPE 0.21
ProteinLM (200M) 0.26
ProteinLM (3B) 0.30
Table 3. Remote Homology
Task secondary structure
Metric Accuracy (Q-3)
TAPE 0.73
ProteinLM (200M) 0.75
ProteinLM (3B) 0.79
Table 4. Secondary Structure
Task fluorescence
Metric Spearman’s rho
TAPE 0.68
ProteinLM (200M) 0.68
ProteinLM (3B) 0.68
Table 5. Fluorescence
Task stability
Metric Spearman’s rho
TAPE 0.73
ProteinLM (200M) 0.77
ProteinLM (3B) 0.79
Table 6. Stability
ModelPerformanceTask CC@L/5 CC@L/2 CC@L Fluorescence RH SS Q@3 SS Q@8 Stability
hidden-512-layer-32-head-8 0.503 0.477 0.409 0.679 0.205 0.716 0.578 0.758
hidden-768-layer-12-head-6 0.487 0.428 0.369 0.677 0.198 0.721 0.570 0.770
hidden-768-layer-16-head-16 0.534 0.469 0.396 0.676 0.205 0.722 0.575 0.762
hidden-768-layer-16-head-24 0.519 0.427 0.376 0.678 0.192 0.719 0.572 0.687
hidden-1024-layer-12-head-16 0.572 0.490 0.419 0.676 0.209 0.729 0.584 0.744
hidden-1024-layer-12-head-32 0.500 0.446 0.377 0.680 0.201 0.721 0.575 0.762
hidden-2048-layer-12-head-16 0.676 0.576 0.495 0.677 0.266 0.752 0.614 0.732
hidden-2048-layer-24-head-16 0.710 0.658 0.563 0.678 0.271 0.791 0.652 0.679
hidden-2048-layer-24-head-8 0.673 0.600 0.531 0.674 0.262 0.762 0.624 0.785
hidden-3072-layer-24-head-16 0.753 0.662 0.566 0.681 0.298 0.791 0.654 0.794
Table 7. Results for comparative experiments.

means .
means classification for secondary structure.
means remote homology.

5. Contact Map Visualization

Generally, the accuracy of predictions on the anti-diagonal can reflect the model’s ability to capture long-range dependency. Therefore, we also visualized the ground truth contact maps, as well as contact maps predicted by our model and TAPE. The contact map below demonstrates that our model is good at capturing long-range dependency.

5.1. Factual Contact Map

We visualize the contact map of protein #TBM-T0889 in Figure 10, and we can intuitively see that there are many long-range contacts (contacts that are separated by at least 24 residues). This protein sequence can be used to distinguish the ability to capture long-distance dependence of different models.

Figure 10. Factual Contact Map

5.2. TAPE Contact Map

Through the visualized predictions from TAPE (11), we can see that the TAPE model (small-scale transformer) can capture medium-range contacts (contacts that are separated by 12-23 residues). As for the long-range contact prediction, there are lots of missings on the anti-diagonal belt.

Figure 11. TAPE Prediction

5.3. ProteinLM Contact Map

ProteinLM-3B model shows very good performance in contact prediction, and the visualized prediction map confirmed this. ProtrinLM-3B can capture medium-range and long-range dependencies, with lots of hits on the anti-diagonal belt.

Figure 12. ProteinLM-3B Prediction

5.4. Analysis

In the visual contact map of our model, the contacts on the anti-diagonal line can be accurately predicted. The prediction results on the anti-diagonal line are more accurate.

6. Summary

We propose ProteinLM, a large-scale pretrain model for protein sequences. The optimizations we introduced into training make billion-level model training possible and efficient. And the significantly improved performance of downstream tasks shows that as the scale of the model increases, the biological information in the sequence and the long-term dependence can be captured more accurately. In addition, through a large number of controlled experiments, we also found some rules of thumb for hyperparameter selection.


7. Supplementary Materials

The curves of training MLM loss, validation MLM loss, and validation PPL are provided below.

Model Description MLM and PPL curve
hidden size = 256
transformer layers = 8
attention heads = 16
Figure 13
hidden size = 512
transformer layers = 32
attention heads = 8
Figure 14
hidden size = 768
transformer layers = 12
attention heads = 6
Figure 15
hidden size = 768
transformer layers = 16
attention heads = 16
Figure 16
hidden size = 768
transformer layers = 16
attention heads = 24
Figure 17
hidden size = 1024
transformer layers = 12
attention heads = 16
Figure 18
hidden size = 1024
transformer layers = 12
attention heads = 32
Figure 19
hidden size = 2048
transformer layers = 12
attention heads = 16
Figure 20
hidden size = 2048
transformer layers = 24
attention heads = 8
Figure 21

With the limited amount of computing resources and computing time, the selection of hyper-parameters is critical to the model’s performance. A trade-off is necessary. The depth (transformer layers) of the model has a greater impact on the performance than the width (hidden size).

On the one hand, the model that is too flat () performs poorly, even though it has a large hidden size. We trained a model with . Although its training speed (average time for each iteration) is the fastest among all models, it failed to converge after 5 days of training.

On the other hand, the model that is too deep is not a feasible choice in this scenario (limited training time). The training time of Figure 14 is 3.5 times that of Figure 20. And it takes about 25 days to train the model with 32 transformer layers Figure 14.

Our empirical conclusion is that the model parameters of are feasible and can well balance training efficiency and resource consumption.

Figure 13. 8 layers, hidden size = 256, 16 attention heads
Figure 14. 32 layers, hidden size = 512, 8 attention heads
Figure 15. 12 layers, hidden size = 768, 6 attention heads
Figure 16. 16 layers, hidden size = 768, 16 attention heads
Figure 17. 16 layers, hidden size = 768, 24 attention heads
Figure 18. 12 layers, hidden size = 1024, 16 attention heads
Figure 19. 12 layers, hidden size = 1024, 32 attention heads
Figure 20. 12 layers, hidden size = 2048, 16 attention heads
Figure 21. 24 layers, hidden size = 2048, 8 attention heads