1 Introduction
Protein-protein Interactions (PPIs) play an important role in most biological processes. In addition to direct physical binding, PPI also has many other, indirect ways of co-operation and mutual regulation, such as exchange reaction products, participate in signal relay mechanisms, or jointly contribute toward specific organismal functions [24]. It can be said that the study of PPIs and their interaction types are essential toward understanding cellular biological processes in normal and disease states, which in turn facilitate the therapeutic target identification and novel drug design [20]. There are many experimental methods to detect PPI, where the most conventional and widely used high-throughput methods are yeast two-hybrid screening [9]. However, the experiment-based methods are expensive and time-consuming, but more importantly, even if a single experiment has detected PPI, it cannot fully interpret its types [3]. Evidently, we urgently need reliable computational methods that are learned from the accumulated PPI data to predict the unknown PPIs accurately.

Despite long-term research works [10, 11, 2] make noticeable progress, existing methods suffer from significant performance degradation when tested on unseen dataset. Take the state-of-the-art model PIPR [2] as an example, compared tested on trainset-homologous SHS148k testset with on a larger STRING testset, micro F1 score drops from 92.42 to 53.85. For further investigation, we divide the STRING testset into BS, ES and NS subsets, where BS denotes Both of the pair proteins in interaction were Seen during training, ES denotes Either (but not both) of the pair proteins was Seen, and NS denotes Neither proteins were Seen during training. As clearly shown in Figure 1, poor performance in the ES and NS subsets (collectively termed as inter-novel-protein interactions in this paper) is the main reason for the performance degradation.
On the other hand, current evaluations on the trainset-homologous SHS148k testset apply a protein-irrepective per-interaction randomized strategy to divide the trainset and testset, and consequently, BS comprises over 92% of the whole testset and dominates the overall performance (see Appendix A for more discussions). The evaluations overlook the inter-novel-protein interactions, and are thus not instructive for the performance when tested on other datasets. As a result, in this paper we firstly design a new evaluation framework with two per-protein randomized data partition strategies. Instead of simple protein-independent randomization, we take also into consideration the distance between proteins and utilize Breadth-First and Depth-First Search methods to construct the testset. Comparison experiments between the trainset-homologous testset and the unseen STRING testset demonstrates the proposed evaluation can give consistent assessment across datasets.
Besides the evaluation, for the methodology existing works take PPIs as independent instances. Correlations between proteins have long been ignored. Intuitively, for predicting the type of interaction between protein A and B, the interaction between protein A and C, as well as B and C must provide useful information. The correlations can be naturally modeled and excavated with a graph, where proteins serve as the nodes and interactions as the edges. In this paper, the graph is processed with a graph neural network based model (GNN-PPI). As demonstrated in Figure 1, the introduction of correlations and the proposed GNN-PPI model have largely narrow the performance gap between BS, ES and NS subsets.
In summary, the contribution of this paper is three-fold:
-
We design a new evaluation framework that fully respects the inter-novel-protein interactions and give consistent assessment across datasets.
-
We propose to incorporate correlation between proteins into the PPI prediction problem. A graph neural network based method is presented to model the correlations.
-
The proposed GNN-PPI model achieves state-of-the-art performance in real datasets of different scales, especially for the inter-novel-protein interaction prediction.
2 Related Work
The primary amino acid sequences are confirmed to contain all the protein information [1]
and are extremely easy to obtain. Thus, there is a longstanding interest in using sequence-based methods to model protein-related tasks. The research work of PPI prediction and classification can be summarized into two stages. The early research is based on Machine Learning (ML)
[10, 25, 19, 18]. These methods provide feasible solutions, but their performance is limited by the PPI feature representation and model expressiveness. Deep Learning (DL) has recently been widely used in bioinformatics problems due to its powerful expressive ability, including PPI prediction and classification. These works
[14, 11, 2, 22]typically use Convolution Neural Networks or Recurrent Neural Networks to extract features from the amino acid sequence of the protein.
More recent work has focused on the feature representation of proteins. [17] proposes a novel deep multi-modal architecture, which extracts multi-modal information from protein structure and existing text information in biomedical literature. [16] proposes a Transformer based neural network to generate proteins pre-trained embedding. In the latest research, [27] considers the correlation of PPIs and first proposed to use GCN[13] to learn protein features in the PPI network automatically. However, their work cannot be extended to the multi-label PPI classification.
To the best of our knowledge, the current work of PPI has not been concerned with the problems of inter-novel-protein interactions. However, In the field of Drug-drug Interaction (DDI), [4] mentioned that the testset is divided according to whether the drug was seen during training, and the results show that the performance for the inter-novel-drug interactions is extremely degraded, but the original paper does not propose a solution.
3 Methodology

Development and evaluation of the GNN-PPI framework. Pairwise interaction data are firstly assembled to build the graph, where proteins serve as the nodes and interactions as the edges. The testset is constructed by firstly selecting the root node and then performing the proposed BFS or DFS strategy. The model is developed by firstly performing embedding for each protein to obtain predefined features, then processed by Convolution, Pooling, BiGRU and FC modules to extract protein-independent encoding (PIE) features, which are finally aggregated by graph convolutions and arrive at protein-graph encoding (PGE) features. Features of the pair proteins in interaction are multiplied and classified, supervised by the trainset labels.
3.1 Problem Formulation
Suppose we have protein set and PPI set . is the PPI indicator function, if , then it means that protein interacts with protein . Note that when , it may mean that protein and will not interact, or they have a potential interaction while it has not been discovered so far. In order to avoid unnecessary errors, we will not do any operation on unknown protein pairs (default ). We define PPI label space as with possible interaction types. For each PPI , its labels is represented as . In summary, the multi-type PPI dataset is defined as . Considering the correlation of PPIs, we use protein as nodes and PPIs as edges to build the PPI graph .
The task of multi-type PPI learning is to learn a model from the training set . For any protein pair , the model predict as the set of proper labels for . The above-mentioned and are obtained from based on the evaluation, where . Further, according to whether protein was seen during training, the protein set is divided into known and unknown . Moreover, as mentioned in section 1, can be divided into , and , which defined as follows:
Since inter-novel-protein interactions are the main bottlenecks, we require the testset of the evaluation framework to meet condition . Our goal is that under this evaluation, the model learned from can accurately predict the multi-label label of PPI in .
3.2 Overview
The GNN-PPI framework and evaluation are shown in Figure 2. We will introduce GNN-PPI from the following three aspects. First is Evaluation Framework
. We propose two sets of heuristic data partition schemes based on the PPI network, and the generated testset meets the conditions
. Secondly, Protein feature encoding. We design Protein-Independent Encoding (PIE) and Protein-Graph Encoding (PGE) modules to encode protein features. The last is Multi-label PPI prediction. For unknown PPIs, we combine their protein feature encoded by the previous process, calculate their scores in different PPI types, and output its multi-label prediction.
3.3 Evaluation Framework
Generally, existing machine learning algorithms usually randomly divide part of the dataset as a testset to evaluate the performance of the model. However, in the PPI-related tasks, we have the following Corollary, derived from Erdős–Rényi(ER) random graph model. [7, 8]:
Corollary 1.
Randomly divide the PPI dataset, select as the testset, then most of the proteins in the testset were seen in training.
The detailed proof of the corollary is elaborated in the Appendix A. It can be inferred from this corollary that the performance of the testset obtained by random division only reflects the predictive ability of the PPI between known . In the real world, there are still many proteins and their PPIs that have not been discovered. We perform empirical studies by comparison of two different time points, 2021/01/25 and 2020/04/11, of the Homo sapiens subset of BioGRID222https://thebiogrid.org/[21] database. We found that the newly discovered proteins exhibit some BFS-like or DFS-like local patterns (More details in the Appendix E). Even if the PPIs have been discovered, most of their types remain relatively unexplored. Therefore, we need a brand-new evaluation that can reflect the model’s predictive performance on the inter-novel-protein interactions. The next content will introduce the evaluation framework we design.
We design two heuristic evaluation schemes based on the PPI network, namely BFS and DFS. They simulated two scenarios of unknown proteins in reality:
-
interact tightly with each other, and they exist in the form of clusters in the PPI network (See Figure 3 (b)).
-
are sparsely distributed in the PPI network and have little interaction with each other (See Figure 3 (c)).
We select a root node , fix the size of the testset , and then use the Breadth-First Search (BFS) algorithm in the PPI network to obtain the proteins that meet the scenario 1. All PPIs related to these proteins are the generated testset. For scenario 2, we just need to simply randomly select proteins to form . However, in order to maintain the PPI network connectivity of and , we use the Depth-First Search (DFS) algorithm to simulate. The details of the data partition algorithm are shown in Algorithm 1, where we will not show the details of the BFS and DFS algorithms but use the function to return the next protein of the current protein in different search algorithms. The returns all neighbors of protein . And we controls the degree of the root node to simulate newly discovered proteins (Usually few proteins interact with them).
Input: Protein set ; PPI set ; Testset size ; Root node selection threshold ; Search order ;
Output: ; ;
3.4 Protein feature encoding
Previous work [2] has proved that protein features based on the amino acid sequence are beneficial to the performance improvement of PPI-related tasks. Therefore, we design a Protein-Independent Encoding (PIE) module, which contains Conv1d with pooling, BiGRU, and fully connected (FC) layer, to generate protein feature representations as input to the PPI network.
The subsequent Protein-Graph Encoding (PGE) module is the core of GNN-PPI. Inspired by PPI network being widely used in bioinformatics computing, we construct PPI network , and convert the original independent learning tasks into graph-related learning tasks . Recently, GNN is the most effective graph representation learning method, its main idea is the recursive neighborhood aggregation scheme, where each node computes a new feature by aggregating the previous features of its neighbor nodes. After iterations, a node is represented by its transformed feature, which captures the structural information within the node’s -hop neighborhood. More specifically, the GNN of the k-th iteration is
where is the feature of node at the k-th iteration. The design of Agg() and Update() are the keys to different GNN architectures. In this paper we use Graph Isomorphism Network (GIN) [26]
, where the sum of the neighbor node features is used as the aggregation function, and multi-layer perceptrons (MLPs) is used to update the aggregated features. Then, GIN updates node features as
where can be a learnable parameter or a fixed scalar.
3.5 Multi-label PPI Prediction
With the feature of protein learned from the previous stages for the PPI , we use the dot product operation to combine the features of and , and then use a fully connected layer (FC) as classifier for multi-label PPI prediction, expressed as . The PIE and PGE modules are jointly training in an end-to-end way. Given a training set and its ground-truth multi-label interaction
, we can use the multi-task binary cross-entropy as the loss function:
Different from the algorithm that considers PPI independently, GNN-PPI learns to combine protein neighbors to generate feature representations. Therefore, for the constructed by our proposed BFS or DFS, GNN-PPI can also be based on its neighbors to generate suitable feature representations for multi-type PPI prediction. On the other hand, even if the PPI network used in the training process is constructed with only , it can perform well for unknown PPI . (See details in Table 4)
Dataset | Partition | Methods | ||||||
Scheme | SVM | RF | LR | DPPI | DNN-PPI | PIPR | GNN-PPI | |
SHS27k | Random | 75.351.05 | 78.450.88 | 71.550.93 | 73.995.04 | 77.894.97 | 83.310.75 | 87.910.39 |
BFS | 42.986.15 | 37.671.57 | 43.065.05 | 41.430.56 | 48.907.24 | 44.484.44 | 63.811.79 | |
DFS | 53.075.16 | 35.552.22 | 48.511.87 | 46.123.02 | 54.341.30 | 57.803.24 | 74.725.26 | |
SHS148k | Random | 80.550.23 | 82.100.20 | 67.000.07 | 77.481.39 | 88.490.48 | 90.052.59 | 92.260.10 |
BFS | 49.145.30 | 38.961.94 | 47.451.42 | 52.128.70 | 57.409.10 | 61.8310.23 | 71.375.33 | |
DFS | 58.590.07 | 43.263.43 | 51.092.09 | 52.031.18 | 58.422.05 | 63.980.76 | 82.670.85 | |
STRING | Random | - | 88.910.08 | 67.740.16 | 94.850.13 | 83.080.11 | 94.430.10 | 95.430.10 |
BFS | - | 55.311.02 | 50.542.00 | 56.681.04 | 53.050.82 | 55.651.60 | 78.375.40 | |
DFS | - | 70.800.45 | 61.280.53 | 66.820.29 | 64.940.93 | 67.450.34 | 91.070.58 |
4 Experiment
4.1 Dataset
We use multi-type PPI data from the STRING database333https://string-db.org/ [23] to evaluate our proposed GNN-PPI. The STRING database collected, scored, and integrated most publicly available sources of protein-protein interaction information and built a comprehensive and objective global PPI network, including direct (physical) and indirect (functional) interactions. In this paper, we focus on the multi-type classification of PPI by STRING. It divides PPI into 7 types, namely reaction, binding, post-translational modifications (ptmod), activation, inhibition, catalysis, and expression. Each pair of interacting proteins contains at least one of them. [2] randomly select 1,690 and 5,189 proteins from the Homo sapiens subset of STRING that shares of sequence identity to generate two subsets, namely SHS27k and SHS148k, which contain 7,624 and 44,488 multi-label PPIs. At the same time, we use all PPIs of Homo sapiens as our third dataset, namely STRING, which contains 15,335 proteins and 593,397 PPIs. We will use these three PPI datasets of different sizes to evaluate GNN-PPI and other PPI methods in the following content.
4.2 Experimental Details
4.2.1 Experimental Settings and Metrics
We select 20% of PPIs for testing, using our proposed BFS, DFS, and original evaluation (Random). The BFS or DFS partition algorithm has completely different results for different root nodes. To simulate the realistic scene mentioned in Section 3.3, the root node’s degree should not be too large. We set the root node degree threshold . To eliminate the influence of the randomness of data partitioning on the performance of PPI methods, we repeat experimental results under 3 different random seeds. We use the protein features based on amino acid sequence, refer to [2] using embedding method to represent each amino acid (Details in the Appendix C). We adopt Adam algorithm [12] to optimize all trainable parameters. The other hyper-parameters settings are shown in Appendix Table 9.
We evaluate the multi-label PPI prediction performance using micro-F1. This is because micro-averaging will emphasize the common labels in the dataset, which gives each sample the same importance. Since the different PPI types in the datasets we used are very imbalanced, micro-F1 may be preferred. Even so, we still evaluate the F1 performance of each PPI type, and the results are shown in Appendix Table 10.
4.2.2 Baselines
We compare GNN-PPI against a variety of baselines, which can be categorized as follows:
1. Machine Learning based: We choose three representative machine learning (ML) algorithms, SVM [10], RF [25], and LR [19]. The input feature of the algorithms uniformly selects common handcrafted protein features, AC [10] and CTD [5], of which CTD use seven attributes for the division (See in Appendix B).
2. Deep Learning based: We choose three representative deep learning (DL) algorithms in PPI prediction, PIPR [2], DNN-PPI [14] and DPPI [11]. We construct the same architecture as the original papers and modify the output of the original implementation from a binary class to multi-label. The protein input feature based on the amino acid sequence is consistent with GNN-PPI. The other settings are the same as the original papers.
Dataset | Partition | |||||||||||
Scheme | PIPR | GNN-PPI | PIPR | GNN-PPI | PIPR | GNN-PPI | Proportion(BS/ES/NS) | PIPR | GNN-PPI | |||
SHS27k | Random | 83.12 | 88.31 | 64.48 | 74.28 | 35.29 | 33.33 | 92.2 | 7.5 | 0.3 | 81.58 | 87.11 |
BFS | - | - | 44.92 | 68.08 | 30.34 | 46.25 | 0.0 | 72.6 | 27.4 | 40.92 | 62.10 | |
DFS | - | - | 58.25 | 72.22 | 48.77 | 63.22 | 0.0 | 88.6 | 11.4 | 57.17 | 71.19 | |
SHS148k | Random | 92.82 | 92.24 | 78.80 | 73.09 | 40.72 | 36.36 | 97.2 | 2.7 | 0.1 | 92.42 | 91.68 |
BFS | - | - | 62.80 | 72.51 | 73.82 | 77.02 | 0.0 | 69.7 | 30.3 | 66.13 | 73.88 | |
DFS | - | - | 64.17 | 83.37 | 55.51 | 73.08 | 0.0 | 91.9 | 8.1 | 63.47 | 82.54 | |
STRING | Random | 94.32 | 95.42 | 61.65 | 77.68 | 33.33 | 57.14 | 99.7 | 0.3 | 0 | 94.23 | 95.37 |
BFS | - | - | 56.71 | 83.99 | 39.87 | 72.83 | 0.0 | 85.8 | 14.2 | 54.31 | 82.41 | |
DFS | - | - | 68.61 | 90.38 | 55.22 | 87.07 | 0.0 | 94.3 | 5.7 | 67.84 | 90.19 |
4.3 Results and Analysis
Table 1 compares the performance of different methods under different evaluations and different datasets. Firstly, consider the impact of different evaluations, we can see that any method in Table 1 perform well under Random partition. However, under BFS or DFS partition, except for GNN-PPI, the performance of other methods declines clearly. Moreover, the performance under the DFS is generally higher than that of the BFS, which means that the clustered distribution of unknown proteins in the PPI network is harder to learn than discrete distribution. Next, observe the performance on different datasets. Regardless of the evaluations, the performance of any method will improve as the data size increases. However, the problems mentioned above will not be trivially solved by increasing the amount of data. Finally, comparing different methods, we can see that DL-based methods are generally better than ML-based, and GNN-PPI can achieve state-of-the-art performance. However, under the Random partition, the advantage of GNN-PPI over DL-based methods will be smaller as the dataset size increases. The most prominent advantage of GNN-PPI is that under the BFS or DFS partition, and for the inter-novel-protein interactions, it can still learn useful feature representations from protein neighbors so as to obtain good performance in multi-label PPI prediction. In summary, the experimental results show that GNN-PPI can effectively improve the prediction accuracy of inter-novel-protein interactions. However, how to further push the performance to be comparable as Random partition is still a problem worthy of further discussion, and it is also our future work.
We make a more in-depth analysis of performance between PIPR and GNN-PPI on , as shown in Table 2. Observing the proportions of different subsets of the testset, we can find that under Random partition, more than 92% test samples belong to , which is consistent with our corollary 1. PIPR performs well on the randomly divided testset (81.58 in SHS27k, 92.42 in SHS148k, and 94.23 in STRING), but if we further investigate the testset, we will find that PIPR performs very poorly for inter-novel-protein interactions ( or ), but it is dominated by , which has accurate performance and a high proportion. According to the results of Table 1 and Table 2, with sufficient and data, we can assert that the methods which treats PPI as an independent sample (represented by PIPR), cannot accurately predict inter-novel-protein interactions. On the contrary, our proposed GNN-PPI can still perform well under BFS and DFS. Moreover, as the data size increases, the performance of GNN-PPI is better (e.g., 82.41 vs. 54.31 in STRING-BFS and 90.19 vs. 67.84 in STRING-DFS).
Methods | Trainset | Testset | Partition Scheme | ||
Random | BFS | DFS | |||
PIPR | SHS27k-Train | SHS27k-Test | 81.58 | 40.92 | 57.17 |
STRING | 42.79 | 48.55 | 57.44 | ||
SHS148k-Train | SHS148k-Test | 92.42 | 66.13 | 63.47 | |
STRING | 53.85 | 63.74 | 62.46 | ||
GNN-PPI | SHS27k-Train | SHS27k-Test | 87.11 | 62.10 | 71.19 |
STRING | 66.85 | 66.39 | 67.43 | ||
SHS148k-Train | SHS148k-Test | 91.68 | 73.88 | 82.54 | |
STRING | 73.12 | 67.43 | 70.64 |
Partition | Graph | Dataset | ||
Scheme | SHS27k | SHS148k | STRING | |
BFS | GCA | 63.811.79 | 71.375.33 | 78.375.40 |
GCT | 60.615.32 | 69.566.89 | 73.233.93 | |
DFS | GCA | 74.725.26 | 82.670.85 | 91.070.58 |
GCT | 73.425.50 | 80.352.20 | 89.041.06 |
Next, we study the ability of different evaluations to assess the model’s generalization. We take the trained model’s test performance on the larger dataset STRING as the model’s true generalization ability. If the gap between the trainset-homologous test performance and the generalization is smaller, then the evaluation can better reflect the model’s generalization. The experimental results are shown in Table 3. It can be seen that the previous evaluation (Random), whether it is for PIPR or GNN-PPI, the test performance on the STRING dataset has severely dropped. Like our speculation, it cannot reflect the generalization of the model. On the contrary, under the evaluation of BFS or DFS, its test performance can truly reflect the performance of the model, no matter it is good or bad (e.g. 66.13 vs. 63.74 in PIPR-SHS148k-BFS and 71.19 vs. 67.43 in GNN-PPI-SHS27k-DFS). In fact, the testset obtained by BFS or DFS is theoretically the same as the sample tested on STRING. The only difference is the proportion of different types of PPI (BS, ES and NS). Testing On the STRING, the proportion of NS is higher.
Finally, we study the impact of the PPI network graph construction method (mentioned in 3.5) in the GNN-PPI. There are two graph construction methods, graph construct by all data (GCA, ) and graph construct by trainset (GCT, ). The experimental results are shown in Table 4. It can be seen that the performance of GCA all exceeds that of GCT, which is reasonable because the graph construction of GCA accesses more complete information than GCT. Compared with BFS, in the case of DFS, the performance of GCT is closer to GCA, which seems to indicate that the protein neighbors are more complete, the performance will be better. What is more noteworthy is that GCT is still much higher than non-graph algorithms, which shows the superiority of GNN in processing the few-shot learning for multi-label PPI prediction task. Moreover, for unknown proteins, we often cannot know their neighbors in advance. The effectiveness of GCT shows that the trained model is robust to newly discovered proteins and their interactions.
5 Conclusion
In this paper, we study the significant performance degradation of existing PPI methods when tested in unseen dataset. Experimental results show that this problem is due to the poor performance of the model for inter-novel-protein interactions. However, current evaluation overlook the inter-novel-protein interactions, and are thus not instructive for the performance when tested on unseen datasets. Therefore, we design a new evaluation framework with two per-protein randomized data partition startegies, namely BFS and DFS, and propose a GNN based method GNN-PPI to model the correlations between PPIs. Our experimental results show that GNN-PPI outperforms state-of-the-art PPI prediction methods regardless of the evaluation is original or our proposed, especially for the inter-novel-protein interactions prediction.
References
- [1] (1972) The formation and stabilization of protein structure. Biochemical Journal 128 (4), pp. 737–749. Cited by: §2.
- [2] (2019) Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics 35 (14), pp. i305–i314. Cited by: §1, §2, §3.4, §4.1, §4.2.1, §4.2.2.
- [3] (2010) Protein–protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol 6 (6), pp. e1000807. Cited by: §1.
- [4] (2020) A multimodal deep learning framework for predicting drug-drug interaction events. Bioinformatics. Cited by: §2.
- [5] (2017) DeepPPI: boosting prediction of protein–protein interactions with deep neural networks. Journal of chemical information and modeling 57 (6), pp. 1499–1510. Cited by: §4.2.2.
- [6] (1995) Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences 92 (19), pp. 8700–8704. Cited by: Appendix B.
- [7] (1959) On random graphs i. Publ. math. debrecen 6 (290-297), pp. 18. Cited by: §A.1, §3.3.
- [8] (1960) On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5 (1), pp. 17–60. Cited by: §A.1, §A.1, §3.3.
- [9] (1989) A novel genetic system to detect protein–protein interactions. Nature 340 (6230), pp. 245–246. Cited by: §1.
-
[10]
(2008)
Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences
. Nucleic acids research 36 (9), pp. 3025–3030. Cited by: §1, §2, §4.2.2. - [11] (2018) Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 34 (17), pp. i802–i810. Cited by: §1, §2, §4.2.2.
- [12] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.1.
- [13] (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
- [14] (2018) Deep neural network based predictions of protein interactions using primary sequences. Molecules 23 (8), pp. 1923. Cited by: §2, §4.2.2.
- [15] (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26, pp. 3111–3119. Cited by: Appendix C.
-
[16]
(2020)
Transforming the language of life: transformer neural networks for protein prediction tasks
. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–8. Cited by: §2. - [17] (2020) Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6396–6407. Cited by: §2.
- [18] (2007) Predicting protein–protein interactions based only on sequences information. Proceedings of the National Academy of Sciences 104 (11), pp. 4337–4341. Cited by: Appendix C, §2.
- [19] (2014) A method for predicting protein-protein interaction types. PLoS One 9 (3), pp. e90904. Cited by: §2, §4.2.2.
- [20] (2008) Computational prediction of protein–protein interactions. Molecular biotechnology 38 (1), pp. 1–17. Cited by: §1.
- [21] (2006) BioGRID: a general repository for interaction datasets. Nucleic acids research 34 (suppl_1), pp. D535–D539. Cited by: §3.3.
- [22] (2017) Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC bioinformatics 18 (1), pp. 1–8. Cited by: §2.
- [23] (2019) STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research 47 (D1), pp. D607–D613. Cited by: §4.1.
- [24] (2016) The string database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research, pp. gkw937. Cited by: §1.
- [25] (2015) Detection of protein-protein interactions from amino acid sequences using a rotation forest model with a novel pr-lpq descriptor. In International Conference on Intelligent Computing, pp. 713–720. Cited by: §2, §4.2.2.
- [26] (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §3.4.
- [27] (2020) Graph-based prediction of protein-protein interactions with attributed signed graph embedding. BMC bioinformatics 21 (1), pp. 1–16. Cited by: §2.
Appendix A Corollary on Random Partition Strategy
a.1 ER Random Graph Model
Before the proof of corollary, we first introduce the Erdős–Rényi(ER) random graph model[7] in graph theory. There are two closely related variants of the model, introduced as follows:
-
In the model, a graph is chosen uniformly at random from the collection of all graphs which have nodes and edges.
-
In the
model, a graph is constructed by connecting nodes randomly. Each edge is included in the graph with probability p independent from every other edge.
The behavior of random graphs is often studied in the case where , the number of nodes, tends to infinity. Although and can be fixed in this case, they can also be functions depending on .
[8] described the behavior of very precisely for various values of when tends to infinity. Their results include the following lemma:
Lemma 1.
If , then a graph in will almost surely be connected.
The expected number of edges in is
, and by the law of large numbers any graph in
will almost surely have approximately this many edges (provided the expected number of edges tends to infinity). Therefore, a rough heuristic is that if then should behave similarly to with as increases[8]. Therefore, we can get the following lemma based on Lemma 1:Lemma 2.
If , then a graph in will almost surely be connected.
a.2 Random Partition Strategy in the PPI Network
As mentioned in section 3.3 of the original paper, we propose a corollary as follows:
Corollary 2.
Randomly divide the PPI dataset, select as the test set, then most of the proteins in the test set were seen in training.
The above corollary is equivalent to whether the training set protein includes most of the protein in the dataset. Review our problem formulation in the original paper: Given the Protein set and PPI set , where . The PPI network is denoted as , and assume is connected (The used in the original paper are all connected). After using the random data partition strategy, if the connectivity of the training PPI network is large, then the corollary 2 will be proved. In the real-world PPI dataset, the number of proteins is not infinity. Therefore, we can roughly judge whether our Corollary 2 is correct based on Lemma 2. The experimental results are shown in Table 5. No matter theoretical deductions() or real test results(Proportion of ), it shows our Corollary 2 is right.
It is worth mentioning that regarding dataset SHS148k and STRING, why but the proportion of still does not reach the proportion of connected graphs, which is equal to 1. This is because there are many proteins in the PPI network, and they only interact with one protein.(Shown in the column of Table 5)
Dataset | Proportion (%) | ||||||||
SHS27k | 1587.6 | 102.3 | 6099 | 1525 | 6276.7 | 92.66 | 6.95 | 0.39 | 409 |
SHS148k | 4971 | 218 | 35590 | 8898 | 22189.8 | 97.25 | 2.72 | 0.03 | 1016 |
STRING | 15082.3 | 252.6 | 474717 | 118680 | 73893.7 | 99.75 | 0.25 | 0 | 1044 |
No. | Property | Class1 | Class2 | Class3 |
1 | Hydrophobicity | Polar | Neutral | Hydrophobicity |
R,K,E,D,Q,N | G,A,S,T,P,H,Y | C,L,V,I,M,F,W | ||
2 | Normalized van der | 0-2.78 | 2.95-4.0 | 4.03-8.08 |
Waals volume | G,A,S,T,P,D | N,V,E,Q,I,L | M,H,K,F,R,Y,W | |
3 | Polarity | 4.9-6.2 | 8.0-9.2 | 10.4-13.0 |
L,I,F,W,C,M,V,Y | P,A,T,G,S | H,Q,R,K,N,E,D | ||
4 | Charge | Positive | Neutral | Negative |
K,R | A,N,C,Q,G,H,I,L,M,F,P,S,T,W,Y,V | D,E | ||
5 | Secondary Structure | Helix | Strand | Coil |
E,A,L,M,Q,K,R,H | V,I,Y,C,W,F,T | G,N,P,S,D | ||
6 | Solvent Accessibility | Buried | Exposed | Intermediate |
A,L,F,C,G,I,V,W | P,K,Q,E,N,D | M,P,S,T,H,Y | ||
7 | Polarizability | 0-1.08 | 0.128-0.186 | 0.219-0.409 |
G,A,S,D,T | C,P,N,V,E,Q,I,L | K,M,H,F,R,Y,W |
Appendix B Composition(C), Transition(T) and Distribution(D)
Appendix C Pre-train amino acid embeddings
We use the embedding method to represent each amino acid
as a vector. Each embedding vector is a concatenation of two sub-embedding, i.e.
. The first part measures the co-occurrence similarity of the amino acids, obtained by pre-training the Skip-Gram[15] model protein sequences. The skip-gram model is trained using negative sampling, where the vocabulary samples are overlapping 3-mer amino acids, and the word vector size is 5. The second partis a one-hot encoding based on the classification defined by the similarity of electrostaticity and hydrophobicity among amino acids, where 20 natural amino acids can be clustered into 7 classes
[18], shown in Table 7. For the amino acid U(Selenocysteine), the amino acid O(Pyrrolysine) and the unknown amino acids X are included in the eighth category. In summary, each amino acid is expressed as .No. | Dipole scale | Volume scale | Class |
1 | - | - | A, G, V |
2 | - | + | I, L, F, P |
3 | + | + | Y, M, T, S |
4 | ++ | + | H, N, E, W |
5 | +++ | + | R, K |
6 | +’+’+’ | + | D, E |
7 | +” | + | C |
Appendix D Ablation Study of PIE and PGE
We perform ablation studies on the PIE and PGE components. As shown in Table 8, both components are beneficial to the overall performance.
PIE | PGE | Partition Schemes | ||
Random | BFS | DFS | ||
✓ | 69.880.04 | 50.032.08 | 61.861.04 | |
✓ | 94.300.52 | 73.816.82 | 88.030.59 | |
✓ | ✓ | 95.380.12 | 78.375.40 | 91.070.58 |
Hyper-Parameters | Values | |
Fixed amino acid length | 2000 | |
Model | Protein-I Feature | 256 |
Architecture | Protein-G Feature | 50 |
Graph layers | 1 | |
learning rate(lr) | 0.001 | |
lr reduce rate | 0.5 | |
Model | lr reduce patience | 20 |
Training | l2 weight decay | 5e-4 |
batch size | 1024 | |
epochs | 300 |
Appendix E Real-world PPI network
We perform empirical studies by comparison of two different time points, 2021/01/25 and 2020/04/11, of the Homo sapiens subset of BioGRID database. Some qualitative results are shown in Figure 4, where green and red nodes denote already and newly discovered proteins, respectively. It can be seen that proteins are not discovered randomly. Instead, the newly discovered proteins exhibit some BFS-like or DFS-like local patterns. This may justify that the proposed partitions are more realistic.

Multi Labels | Type Ratio (%) | Random Partition | BFS Partition | DFS Partition | |||
PIPR | GNN-PPI | PIPR | GNN-PPI | PIPR | GNN-PPI | ||
Reaction | 51.08 | 96.390.12 | 97.620.07 | 55.967.59 | 83.194.01 | 68.091.19 | 93.281.44 |
Binding | 67.87 | 95.630.23 | 96.430.07 | 71.340.36 | 83.803.70 | 81.791.90 | 94.060.51 |
Ptmod | 6.92 | 86.940.26 | 87.280.27 | 25.9113.8 | 70.487.44 | 17.095.31 | 82.121.40 |
Activation | 17.53 | 86.310.31 | 87.960.37 | 39.1713.3 | 66.2015.5 | 27.2810.0 | 81.580.94 |
Inhibition | 6.58 | 90.360.34 | 91.490.14 | 12.087.55 | 65.5812.4 | 19.160.94 | 82.622.80 |
Catalysis | 48.59 | 96.280.19 | 97.580.08 | 58.846.39 | 83.944.53 | 66.244.32 | 92.790.51 |
Expression | 2.07 | 39.061.42 | 32.551.53 | 0.811.27 | 15.6710.8 | 1.041.80 | 23.229.21 |
Micro-Avg | - | 94.430.10 | 95.430.10 | 55.651.60 | 78.375.40 | 67.450.34 | 91.070.58 |
Appendix F Performance of different PPI types
We show separate results for the 7 labels in Table 10. Performance in “expression” is worse due to the limited samples (positive samples lower than 10%). Our model demonstrates consistent advantage over PIPR among all the 7 labels.
Appendix G Proportions of Protein
Table 2 in main text shows some quantitative analysis of the BFS and DFS partitions in terms of the proportions of BS, ES and NS edges. We also calculate the proportions of nodes in testset-only, both-sets and trainset-only as shown in Table 11. Compared with conventional random partition, BFS and DFS partitions lead to more ES and NS edges, and less nodes in Both-sets, and thus better evaluate the performance of models for unseen proteins.
Datasets | Partition | Protein Proportions (%) | ||
Schemes | Trainset-only | Testset-only | Both-sets | |
SHS27k | Random | 41.64 | 6.06 | 52.31 |
BFS | 60.16 | 9.63 | 30.22 | |
DFS | 63.69 | 5.60 | 30.71 | |
SHS148k | Random | 34.79 | 4.20 | 61.01 |
BFS | 52.54 | 9.62 | 37.84 | |
DFS | 51.51 | 5.72 | 42.77 | |
STRING | Random | 13.96 | 1.65 | 84.39 |
BFS | 31.72 | 5.03 | 63.25 | |
DFS | 26.94 | 4.75 | 68.31 |