Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction

05/14/2021
by   Guofeng Lv, et al.
SenseTime Corporation
0

The study of multi-type Protein-Protein Interaction (PPI) is fundamental for understanding biological processes from a systematic perspective and revealing disease mechanisms. Existing methods suffer from significant performance degradation when tested in unseen dataset. In this paper, we investigate the problem and find that it is mainly attributed to the poor performance for inter-novel-protein interaction prediction. However, current evaluations overlook the inter-novel-protein interactions, and thus fail to give an instructive assessment. As a result, we propose to address the problem from both the evaluation and the methodology. Firstly, we design a new evaluation framework that fully respects the inter-novel-protein interactions and gives consistent assessment across datasets. Secondly, we argue that correlations between proteins must provide useful information for analysis of novel proteins, and based on this, we propose a graph neural network based method (GNN-PPI) for better inter-novel-protein interaction prediction. Experimental results on real-world datasets of different scales demonstrate that GNN-PPI significantly outperforms state-of-the-art PPI prediction methods, especially for the inter-novel-protein interaction prediction.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

05/20/2022

EGR: Equivariant Graph Refinement and Assessment of 3D Protein Complex Structures

Protein complexes are macromolecules essential to the functioning and we...
04/30/2020

SkipGNN: Predicting Molecular Interactions with Skip-Graph Networks

Molecular interaction networks are powerful resources for the discovery....
04/20/2022

Graph neural networks and attention-based CNN-LSTM for protein classification

This paper focuses on three critical problems on protein classification....
11/30/2021

Decoding the Protein-ligand Interactions Using Parallel Graph Neural Networks

Protein-ligand interactions (PLIs) are fundamental to biochemical resear...
11/24/2021

ProLiVis: Protein-Protein Interaction Literature Visualization System

We provide a visualization model that targets the visualization of Prote...
07/27/2018

Identifying Protein-Protein Interaction using Tree LSTM and Structured Attention

Identifying interactions between proteins is important to understand und...
07/24/2021

Identifying similar functional modules by a new hybrid spectral clustering method

Recently, a large number of researches have focused on finding cellular ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Protein-protein Interactions (PPIs) play an important role in most biological processes. In addition to direct physical binding, PPI also has many other, indirect ways of co-operation and mutual regulation, such as exchange reaction products, participate in signal relay mechanisms, or jointly contribute toward specific organismal functions [24]. It can be said that the study of PPIs and their interaction types are essential toward understanding cellular biological processes in normal and disease states, which in turn facilitate the therapeutic target identification and novel drug design [20]. There are many experimental methods to detect PPI, where the most conventional and widely used high-throughput methods are yeast two-hybrid screening [9]. However, the experiment-based methods are expensive and time-consuming, but more importantly, even if a single experiment has detected PPI, it cannot fully interpret its types [3]. Evidently, we urgently need reliable computational methods that are learned from the accumulated PPI data to predict the unknown PPIs accurately.

Figure 1: Results of PIPR (baseline) and GNN-PPI (ours) when trained on the smaller dataset SHS148k and tested on the larger STRING dataset. The metric is micro F1 score for multi-label PPI type prediction. Avg is the overall result of the testset. For further investigation, we divide the testset into BS, ES and NS subsets, where BS denotes Both of the pair proteins in interaction were Seen during training, ES denotes Either (but not both) of the pair proteins was Seen, and NS denotes Neither proteins were Seen during training. We regard ES and NS as inter-novel-protein interactions.

Despite long-term research works [10, 11, 2] make noticeable progress, existing methods suffer from significant performance degradation when tested on unseen dataset. Take the state-of-the-art model PIPR [2] as an example, compared tested on trainset-homologous SHS148k testset with on a larger STRING testset, micro F1 score drops from 92.42 to 53.85. For further investigation, we divide the STRING testset into BS, ES and NS subsets, where BS denotes Both of the pair proteins in interaction were Seen during training, ES denotes Either (but not both) of the pair proteins was Seen, and NS denotes Neither proteins were Seen during training. As clearly shown in Figure 1, poor performance in the ES and NS subsets (collectively termed as inter-novel-protein interactions in this paper) is the main reason for the performance degradation.

On the other hand, current evaluations on the trainset-homologous SHS148k testset apply a protein-irrepective per-interaction randomized strategy to divide the trainset and testset, and consequently, BS comprises over 92% of the whole testset and dominates the overall performance (see Appendix A for more discussions). The evaluations overlook the inter-novel-protein interactions, and are thus not instructive for the performance when tested on other datasets. As a result, in this paper we firstly design a new evaluation framework with two per-protein randomized data partition strategies. Instead of simple protein-independent randomization, we take also into consideration the distance between proteins and utilize Breadth-First and Depth-First Search methods to construct the testset. Comparison experiments between the trainset-homologous testset and the unseen STRING testset demonstrates the proposed evaluation can give consistent assessment across datasets.

Besides the evaluation, for the methodology existing works take PPIs as independent instances. Correlations between proteins have long been ignored. Intuitively, for predicting the type of interaction between protein A and B, the interaction between protein A and C, as well as B and C must provide useful information. The correlations can be naturally modeled and excavated with a graph, where proteins serve as the nodes and interactions as the edges. In this paper, the graph is processed with a graph neural network based model (GNN-PPI). As demonstrated in Figure 1, the introduction of correlations and the proposed GNN-PPI model have largely narrow the performance gap between BS, ES and NS subsets.

In summary, the contribution of this paper is three-fold:

  1. We design a new evaluation framework that fully respects the inter-novel-protein interactions and give consistent assessment across datasets.

  2. We propose to incorporate correlation between proteins into the PPI prediction problem. A graph neural network based method is presented to model the correlations.

  3. The proposed GNN-PPI model achieves state-of-the-art performance in real datasets of different scales, especially for the inter-novel-protein interaction prediction.

2 Related Work

The primary amino acid sequences are confirmed to contain all the protein information [1]

and are extremely easy to obtain. Thus, there is a longstanding interest in using sequence-based methods to model protein-related tasks. The research work of PPI prediction and classification can be summarized into two stages. The early research is based on Machine Learning (ML)

[10, 25, 19, 18]

. These methods provide feasible solutions, but their performance is limited by the PPI feature representation and model expressiveness. Deep Learning (DL) has recently been widely used in bioinformatics problems due to its powerful expressive ability, including PPI prediction and classification. These works

[14, 11, 2, 22]

typically use Convolution Neural Networks or Recurrent Neural Networks to extract features from the amino acid sequence of the protein.

More recent work has focused on the feature representation of proteins. [17] proposes a novel deep multi-modal architecture, which extracts multi-modal information from protein structure and existing text information in biomedical literature. [16] proposes a Transformer based neural network to generate proteins pre-trained embedding. In the latest research, [27] considers the correlation of PPIs and first proposed to use GCN[13] to learn protein features in the PPI network automatically. However, their work cannot be extended to the multi-label PPI classification.

To the best of our knowledge, the current work of PPI has not been concerned with the problems of inter-novel-protein interactions. However, In the field of Drug-drug Interaction (DDI), [4] mentioned that the testset is divided according to whether the drug was seen during training, and the results show that the performance for the inter-novel-drug interactions is extremely degraded, but the original paper does not propose a solution.

3 Methodology

Figure 2:

Development and evaluation of the GNN-PPI framework. Pairwise interaction data are firstly assembled to build the graph, where proteins serve as the nodes and interactions as the edges. The testset is constructed by firstly selecting the root node and then performing the proposed BFS or DFS strategy. The model is developed by firstly performing embedding for each protein to obtain predefined features, then processed by Convolution, Pooling, BiGRU and FC modules to extract protein-independent encoding (PIE) features, which are finally aggregated by graph convolutions and arrive at protein-graph encoding (PGE) features. Features of the pair proteins in interaction are multiplied and classified, supervised by the trainset labels.

3.1 Problem Formulation

Suppose we have protein set and PPI set . is the PPI indicator function, if , then it means that protein interacts with protein . Note that when , it may mean that protein and will not interact, or they have a potential interaction while it has not been discovered so far. In order to avoid unnecessary errors, we will not do any operation on unknown protein pairs (default ). We define PPI label space as with possible interaction types. For each PPI , its labels is represented as . In summary, the multi-type PPI dataset is defined as . Considering the correlation of PPIs, we use protein as nodes and PPIs as edges to build the PPI graph .

The task of multi-type PPI learning is to learn a model from the training set . For any protein pair , the model predict as the set of proper labels for . The above-mentioned and are obtained from based on the evaluation, where . Further, according to whether protein was seen during training, the protein set is divided into known and unknown . Moreover, as mentioned in section 1, can be divided into , and , which defined as follows:

Since inter-novel-protein interactions are the main bottlenecks, we require the testset of the evaluation framework to meet condition . Our goal is that under this evaluation, the model learned from can accurately predict the multi-label label of PPI in .

3.2 Overview

The GNN-PPI framework and evaluation are shown in Figure 2. We will introduce GNN-PPI from the following three aspects. First is Evaluation Framework

. We propose two sets of heuristic data partition schemes based on the PPI network, and the generated testset meets the conditions

. Secondly, Protein feature encoding. We design Protein-Independent Encoding (PIE) and Protein-Graph Encoding (PGE) modules to encode protein features. The last is Multi-label PPI prediction. For unknown PPIs, we combine their protein feature encoded by the previous process, calculate their scores in different PPI types, and output its multi-label prediction.

Figure 3: Examples of different testset construction strategies. Random is the current scheme, while Breath-First Search (BFS) and Depth-First Search (DFS) are the proposed schemes.

3.3 Evaluation Framework

Generally, existing machine learning algorithms usually randomly divide part of the dataset as a testset to evaluate the performance of the model. However, in the PPI-related tasks, we have the following Corollary, derived from Erdős–Rényi(ER) random graph model. [7, 8]:

Corollary 1.

Randomly divide the PPI dataset, select as the testset, then most of the proteins in the testset were seen in training.

The detailed proof of the corollary is elaborated in the Appendix A. It can be inferred from this corollary that the performance of the testset obtained by random division only reflects the predictive ability of the PPI between known . In the real world, there are still many proteins and their PPIs that have not been discovered. We perform empirical studies by comparison of two different time points, 2021/01/25 and 2020/04/11, of the Homo sapiens subset of BioGRID222https://thebiogrid.org/[21] database. We found that the newly discovered proteins exhibit some BFS-like or DFS-like local patterns (More details in the Appendix E). Even if the PPIs have been discovered, most of their types remain relatively unexplored. Therefore, we need a brand-new evaluation that can reflect the model’s predictive performance on the inter-novel-protein interactions. The next content will introduce the evaluation framework we design.

We design two heuristic evaluation schemes based on the PPI network, namely BFS and DFS. They simulated two scenarios of unknown proteins in reality:

  1. interact tightly with each other, and they exist in the form of clusters in the PPI network (See Figure 3 (b)).

  2. are sparsely distributed in the PPI network and have little interaction with each other (See Figure 3 (c)).

We select a root node , fix the size of the testset , and then use the Breadth-First Search (BFS) algorithm in the PPI network to obtain the proteins that meet the scenario 1. All PPIs related to these proteins are the generated testset. For scenario 2, we just need to simply randomly select proteins to form . However, in order to maintain the PPI network connectivity of and , we use the Depth-First Search (DFS) algorithm to simulate. The details of the data partition algorithm are shown in Algorithm 1, where we will not show the details of the BFS and DFS algorithms but use the function to return the next protein of the current protein in different search algorithms. The returns all neighbors of protein . And we controls the degree of the root node to simulate newly discovered proteins (Usually few proteins interact with them).

Input: Protein set ; PPI set ; Testset size ; Root node selection threshold ; Search order ;
Output: ; ;

1:Build PPI graph
2: Root node selection
3:repeat
4:     Randomly select a protein as root node .
5:until  returns the neighbors
6: Testset construction
7:,
8:repeat
9:     
10:     
11:     
12:until 
13: Trainset construction
14:
15:return
Algorithm 1 Data Partition Algorithm

3.4 Protein feature encoding

Previous work [2] has proved that protein features based on the amino acid sequence are beneficial to the performance improvement of PPI-related tasks. Therefore, we design a Protein-Independent Encoding (PIE) module, which contains Conv1d with pooling, BiGRU, and fully connected (FC) layer, to generate protein feature representations as input to the PPI network.

The subsequent Protein-Graph Encoding (PGE) module is the core of GNN-PPI. Inspired by PPI network being widely used in bioinformatics computing, we construct PPI network , and convert the original independent learning tasks into graph-related learning tasks . Recently, GNN is the most effective graph representation learning method, its main idea is the recursive neighborhood aggregation scheme, where each node computes a new feature by aggregating the previous features of its neighbor nodes. After iterations, a node is represented by its transformed feature, which captures the structural information within the node’s -hop neighborhood. More specifically, the GNN of the k-th iteration is

where is the feature of node at the k-th iteration. The design of Agg() and Update() are the keys to different GNN architectures. In this paper we use Graph Isomorphism Network (GIN) [26]

, where the sum of the neighbor node features is used as the aggregation function, and multi-layer perceptrons (MLPs) is used to update the aggregated features. Then, GIN updates node features as

where can be a learnable parameter or a fixed scalar.

3.5 Multi-label PPI Prediction

With the feature of protein learned from the previous stages for the PPI , we use the dot product operation to combine the features of and , and then use a fully connected layer (FC) as classifier for multi-label PPI prediction, expressed as . The PIE and PGE modules are jointly training in an end-to-end way. Given a training set and its ground-truth multi-label interaction

, we can use the multi-task binary cross-entropy as the loss function:

Different from the algorithm that considers PPI independently, GNN-PPI learns to combine protein neighbors to generate feature representations. Therefore, for the constructed by our proposed BFS or DFS, GNN-PPI can also be based on its neighbors to generate suitable feature representations for multi-type PPI prediction. On the other hand, even if the PPI network used in the training process is constructed with only , it can perform well for unknown PPI . (See details in Table 4)

Dataset Partition Methods
Scheme SVM RF LR DPPI DNN-PPI PIPR GNN-PPI
SHS27k Random 75.351.05 78.450.88 71.550.93 73.995.04 77.894.97 83.310.75 87.910.39
BFS 42.986.15 37.671.57 43.065.05 41.430.56 48.907.24 44.484.44 63.811.79
DFS 53.075.16 35.552.22 48.511.87 46.123.02 54.341.30 57.803.24 74.725.26
SHS148k Random 80.550.23 82.100.20 67.000.07 77.481.39 88.490.48 90.052.59 92.260.10
BFS 49.145.30 38.961.94 47.451.42 52.128.70 57.409.10 61.8310.23 71.375.33
DFS 58.590.07 43.263.43 51.092.09 52.031.18 58.422.05 63.980.76 82.670.85
STRING Random - 88.910.08 67.740.16 94.850.13 83.080.11 94.430.10 95.430.10
BFS - 55.311.02 50.542.00 56.681.04 53.050.82 55.651.60 78.375.40
DFS - 70.800.45 61.280.53 66.820.29 64.940.93 67.450.34 91.070.58
Table 1: Performance of GNN-PPI against comparative methods over different datasets and data partition schemes. The reported results are meanstd micro-averaged F1 score over three repeated experiments. Results of SVM on STRING is omitted for unafforable running time.

4 Experiment

4.1 Dataset

We use multi-type PPI data from the STRING database333https://string-db.org/ [23] to evaluate our proposed GNN-PPI. The STRING database collected, scored, and integrated most publicly available sources of protein-protein interaction information and built a comprehensive and objective global PPI network, including direct (physical) and indirect (functional) interactions. In this paper, we focus on the multi-type classification of PPI by STRING. It divides PPI into 7 types, namely reaction, binding, post-translational modifications (ptmod), activation, inhibition, catalysis, and expression. Each pair of interacting proteins contains at least one of them. [2] randomly select 1,690 and 5,189 proteins from the Homo sapiens subset of STRING that shares of sequence identity to generate two subsets, namely SHS27k and SHS148k, which contain 7,624 and 44,488 multi-label PPIs. At the same time, we use all PPIs of Homo sapiens as our third dataset, namely STRING, which contains 15,335 proteins and 593,397 PPIs. We will use these three PPI datasets of different sizes to evaluate GNN-PPI and other PPI methods in the following content.

4.2 Experimental Details

4.2.1 Experimental Settings and Metrics

We select 20% of PPIs for testing, using our proposed BFS, DFS, and original evaluation (Random). The BFS or DFS partition algorithm has completely different results for different root nodes. To simulate the realistic scene mentioned in Section 3.3, the root node’s degree should not be too large. We set the root node degree threshold . To eliminate the influence of the randomness of data partitioning on the performance of PPI methods, we repeat experimental results under 3 different random seeds. We use the protein features based on amino acid sequence, refer to [2] using embedding method to represent each amino acid (Details in the Appendix C). We adopt Adam algorithm [12] to optimize all trainable parameters. The other hyper-parameters settings are shown in Appendix Table 9.

We evaluate the multi-label PPI prediction performance using micro-F1. This is because micro-averaging will emphasize the common labels in the dataset, which gives each sample the same importance. Since the different PPI types in the datasets we used are very imbalanced, micro-F1 may be preferred. Even so, we still evaluate the F1 performance of each PPI type, and the results are shown in Appendix Table 10.

4.2.2 Baselines

We compare GNN-PPI against a variety of baselines, which can be categorized as follows:

1. Machine Learning based: We choose three representative machine learning (ML) algorithms, SVM [10], RF [25], and LR [19]. The input feature of the algorithms uniformly selects common handcrafted protein features, AC [10] and CTD [5], of which CTD use seven attributes for the division (See in Appendix B).

2. Deep Learning based: We choose three representative deep learning (DL) algorithms in PPI prediction, PIPR [2], DNN-PPI [14] and DPPI [11]. We construct the same architecture as the original papers and modify the output of the original implementation from a binary class to multi-label. The protein input feature based on the amino acid sequence is consistent with GNN-PPI. The other settings are the same as the original papers.

Dataset Partition
Scheme PIPR GNN-PPI PIPR GNN-PPI PIPR GNN-PPI Proportion(BS/ES/NS) PIPR GNN-PPI
SHS27k Random 83.12 88.31 64.48 74.28 35.29 33.33 92.2 7.5 0.3 81.58 87.11
BFS - - 44.92 68.08 30.34 46.25 0.0 72.6 27.4 40.92 62.10
DFS - - 58.25 72.22 48.77 63.22 0.0 88.6 11.4 57.17 71.19
SHS148k Random 92.82 92.24 78.80 73.09 40.72 36.36 97.2 2.7 0.1 92.42 91.68
BFS - - 62.80 72.51 73.82 77.02 0.0 69.7 30.3 66.13 73.88
DFS - - 64.17 83.37 55.51 73.08 0.0 91.9 8.1 63.47 82.54
STRING Random 94.32 95.42 61.65 77.68 33.33 57.14 99.7 0.3 0 94.23 95.37
BFS - - 56.71 83.99 39.87 72.83 0.0 85.8 14.2 54.31 82.41
DFS - - 68.61 90.38 55.22 87.07 0.0 94.3 5.7 67.84 90.19
Table 2: In-depth analysis between PIPR and GNN-PPI over BS, ES and NS subsets.

4.3 Results and Analysis

Table 1 compares the performance of different methods under different evaluations and different datasets. Firstly, consider the impact of different evaluations, we can see that any method in Table 1 perform well under Random partition. However, under BFS or DFS partition, except for GNN-PPI, the performance of other methods declines clearly. Moreover, the performance under the DFS is generally higher than that of the BFS, which means that the clustered distribution of unknown proteins in the PPI network is harder to learn than discrete distribution. Next, observe the performance on different datasets. Regardless of the evaluations, the performance of any method will improve as the data size increases. However, the problems mentioned above will not be trivially solved by increasing the amount of data. Finally, comparing different methods, we can see that DL-based methods are generally better than ML-based, and GNN-PPI can achieve state-of-the-art performance. However, under the Random partition, the advantage of GNN-PPI over DL-based methods will be smaller as the dataset size increases. The most prominent advantage of GNN-PPI is that under the BFS or DFS partition, and for the inter-novel-protein interactions, it can still learn useful feature representations from protein neighbors so as to obtain good performance in multi-label PPI prediction. In summary, the experimental results show that GNN-PPI can effectively improve the prediction accuracy of inter-novel-protein interactions. However, how to further push the performance to be comparable as Random partition is still a problem worthy of further discussion, and it is also our future work.

We make a more in-depth analysis of performance between PIPR and GNN-PPI on , as shown in Table 2. Observing the proportions of different subsets of the testset, we can find that under Random partition, more than 92% test samples belong to , which is consistent with our corollary 1. PIPR performs well on the randomly divided testset (81.58 in SHS27k, 92.42 in SHS148k, and 94.23 in STRING), but if we further investigate the testset, we will find that PIPR performs very poorly for inter-novel-protein interactions ( or ), but it is dominated by , which has accurate performance and a high proportion. According to the results of Table 1 and Table 2, with sufficient and data, we can assert that the methods which treats PPI as an independent sample (represented by PIPR), cannot accurately predict inter-novel-protein interactions. On the contrary, our proposed GNN-PPI can still perform well under BFS and DFS. Moreover, as the data size increases, the performance of GNN-PPI is better (e.g., 82.41 vs. 54.31 in STRING-BFS and 90.19 vs. 67.84 in STRING-DFS).

Methods Trainset Testset Partition Scheme
Random BFS DFS
PIPR SHS27k-Train SHS27k-Test 81.58 40.92 57.17
STRING 42.79 48.55 57.44
SHS148k-Train SHS148k-Test 92.42 66.13 63.47
STRING 53.85 63.74 62.46
GNN-PPI SHS27k-Train SHS27k-Test 87.11 62.10 71.19
STRING 66.85 66.39 67.43
SHS148k-Train SHS148k-Test 91.68 73.88 82.54
STRING 73.12 67.43 70.64
Table 3: Performance comparison of tested on trainset-homologous testset vs. unseen testset, under different evaluations (partition schemes).
Partition Graph Dataset
Scheme SHS27k SHS148k STRING
BFS GCA 63.811.79 71.375.33 78.375.40
GCT 60.615.32 69.566.89 73.233.93
DFS GCA 74.725.26 82.670.85 91.070.58
GCT 73.425.50 80.352.20 89.041.06
Table 4: Performance of GNN-PPI with different PPI Graph construction method.

Next, we study the ability of different evaluations to assess the model’s generalization. We take the trained model’s test performance on the larger dataset STRING as the model’s true generalization ability. If the gap between the trainset-homologous test performance and the generalization is smaller, then the evaluation can better reflect the model’s generalization. The experimental results are shown in Table 3. It can be seen that the previous evaluation (Random), whether it is for PIPR or GNN-PPI, the test performance on the STRING dataset has severely dropped. Like our speculation, it cannot reflect the generalization of the model. On the contrary, under the evaluation of BFS or DFS, its test performance can truly reflect the performance of the model, no matter it is good or bad (e.g. 66.13 vs. 63.74 in PIPR-SHS148k-BFS and 71.19 vs. 67.43 in GNN-PPI-SHS27k-DFS). In fact, the testset obtained by BFS or DFS is theoretically the same as the sample tested on STRING. The only difference is the proportion of different types of PPI (BS, ES and NS). Testing On the STRING, the proportion of NS is higher.

Finally, we study the impact of the PPI network graph construction method (mentioned in 3.5) in the GNN-PPI. There are two graph construction methods, graph construct by all data (GCA, ) and graph construct by trainset (GCT, ). The experimental results are shown in Table 4. It can be seen that the performance of GCA all exceeds that of GCT, which is reasonable because the graph construction of GCA accesses more complete information than GCT. Compared with BFS, in the case of DFS, the performance of GCT is closer to GCA, which seems to indicate that the protein neighbors are more complete, the performance will be better. What is more noteworthy is that GCT is still much higher than non-graph algorithms, which shows the superiority of GNN in processing the few-shot learning for multi-label PPI prediction task. Moreover, for unknown proteins, we often cannot know their neighbors in advance. The effectiveness of GCT shows that the trained model is robust to newly discovered proteins and their interactions.

5 Conclusion

In this paper, we study the significant performance degradation of existing PPI methods when tested in unseen dataset. Experimental results show that this problem is due to the poor performance of the model for inter-novel-protein interactions. However, current evaluation overlook the inter-novel-protein interactions, and are thus not instructive for the performance when tested on unseen datasets. Therefore, we design a new evaluation framework with two per-protein randomized data partition startegies, namely BFS and DFS, and propose a GNN based method GNN-PPI to model the correlations between PPIs. Our experimental results show that GNN-PPI outperforms state-of-the-art PPI prediction methods regardless of the evaluation is original or our proposed, especially for the inter-novel-protein interactions prediction.

References

  • [1] C. B. Anfinsen (1972) The formation and stabilization of protein structure. Biochemical Journal 128 (4), pp. 737–749. Cited by: §2.
  • [2] M. Chen, C. J. Ju, G. Zhou, X. Chen, T. Zhang, K. Chang, C. Zaniolo, and W. Wang (2019) Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics 35 (14), pp. i305–i314. Cited by: §1, §2, §3.4, §4.1, §4.2.1, §4.2.2.
  • [3] J. De Las Rivas and C. Fontanillo (2010) Protein–protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol 6 (6), pp. e1000807. Cited by: §1.
  • [4] Y. Deng, X. Xu, Y. Qiu, J. Xia, W. Zhang, and S. Liu (2020) A multimodal deep learning framework for predicting drug-drug interaction events. Bioinformatics. Cited by: §2.
  • [5] X. Du, S. Sun, C. Hu, Y. Yao, Y. Yan, and Y. Zhang (2017) DeepPPI: boosting prediction of protein–protein interactions with deep neural networks. Journal of chemical information and modeling 57 (6), pp. 1499–1510. Cited by: §4.2.2.
  • [6] I. Dubchak, I. Muchnik, S. R. Holbrook, and S. Kim (1995) Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences 92 (19), pp. 8700–8704. Cited by: Appendix B.
  • [7] P. ERDdS and A. R&wi (1959) On random graphs i. Publ. math. debrecen 6 (290-297), pp. 18. Cited by: §A.1, §3.3.
  • [8] P. Erdős and A. Rényi (1960) On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5 (1), pp. 17–60. Cited by: §A.1, §A.1, §3.3.
  • [9] S. Fields and O. Song (1989) A novel genetic system to detect protein–protein interactions. Nature 340 (6230), pp. 245–246. Cited by: §1.
  • [10] Y. Guo, L. Yu, Z. Wen, and M. Li (2008)

    Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences

    .
    Nucleic acids research 36 (9), pp. 3025–3030. Cited by: §1, §2, §4.2.2.
  • [11] S. Hashemifar, B. Neyshabur, A. A. Khan, and J. Xu (2018) Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 34 (17), pp. i802–i810. Cited by: §1, §2, §4.2.2.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.1.
  • [13] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
  • [14] H. Li, X. Gong, H. Yu, and C. Zhou (2018) Deep neural network based predictions of protein interactions using primary sequences. Molecules 23 (8), pp. 1923. Cited by: §2, §4.2.2.
  • [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26, pp. 3111–3119. Cited by: Appendix C.
  • [16] A. Nambiar, M. Heflin, S. Liu, S. Maslov, M. Hopkins, and A. Ritz (2020)

    Transforming the language of life: transformer neural networks for protein prediction tasks

    .
    In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–8. Cited by: §2.
  • [17] S. Saha et al. (2020) Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6396–6407. Cited by: §2.
  • [18] J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li, and H. Jiang (2007) Predicting protein–protein interactions based only on sequences information. Proceedings of the National Academy of Sciences 104 (11), pp. 4337–4341. Cited by: Appendix C, §2.
  • [19] Y. Silberberg, M. Kupiec, and R. Sharan (2014) A method for predicting protein-protein interaction types. PLoS One 9 (3), pp. e90904. Cited by: §2, §4.2.2.
  • [20] L. Skrabanek, H. K. Saini, G. D. Bader, and A. J. Enright (2008) Computational prediction of protein–protein interactions. Molecular biotechnology 38 (1), pp. 1–17. Cited by: §1.
  • [21] C. Stark, B. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers (2006) BioGRID: a general repository for interaction datasets. Nucleic acids research 34 (suppl_1), pp. D535–D539. Cited by: §3.3.
  • [22] T. Sun, B. Zhou, L. Lai, and J. Pei (2017) Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC bioinformatics 18 (1), pp. 1–8. Cited by: §2.
  • [23] D. Szklarczyk, A. L. Gable, D. Lyon, A. Junge, S. Wyder, J. Huerta-Cepas, M. Simonovic, N. T. Doncheva, J. H. Morris, P. Bork, et al. (2019) STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research 47 (D1), pp. D607–D613. Cited by: §4.1.
  • [24] D. Szklarczyk, J. H. Morris, H. Cook, M. Kuhn, S. Wyder, M. Simonovic, A. Santos, N. T. Doncheva, A. Roth, P. Bork, et al. (2016) The string database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research, pp. gkw937. Cited by: §1.
  • [25] L. Wong, Z. You, S. Li, Y. Huang, and G. Liu (2015) Detection of protein-protein interactions from amino acid sequences using a rotation forest model with a novel pr-lpq descriptor. In International Conference on Intelligent Computing, pp. 713–720. Cited by: §2, §4.2.2.
  • [26] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §3.4.
  • [27] F. Yang, K. Fan, D. Song, and H. Lin (2020) Graph-based prediction of protein-protein interactions with attributed signed graph embedding. BMC bioinformatics 21 (1), pp. 1–16. Cited by: §2.

Appendix A Corollary on Random Partition Strategy

a.1 ER Random Graph Model

Before the proof of corollary, we first introduce the Erdős–Rényi(ER) random graph model[7] in graph theory. There are two closely related variants of the model, introduced as follows:

  1. In the model, a graph is chosen uniformly at random from the collection of all graphs which have nodes and edges.

  2. In the

    model, a graph is constructed by connecting nodes randomly. Each edge is included in the graph with probability p independent from every other edge.

The behavior of random graphs is often studied in the case where , the number of nodes, tends to infinity. Although and can be fixed in this case, they can also be functions depending on .

[8] described the behavior of very precisely for various values of when tends to infinity. Their results include the following lemma:

Lemma 1.

If , then a graph in will almost surely be connected.

The expected number of edges in is

, and by the law of large numbers any graph in

will almost surely have approximately this many edges (provided the expected number of edges tends to infinity). Therefore, a rough heuristic is that if then should behave similarly to with as increases[8]. Therefore, we can get the following lemma based on Lemma 1:

Lemma 2.

If , then a graph in will almost surely be connected.

a.2 Random Partition Strategy in the PPI Network

As mentioned in section 3.3 of the original paper, we propose a corollary as follows:

Corollary 2.

Randomly divide the PPI dataset, select as the test set, then most of the proteins in the test set were seen in training.

The above corollary is equivalent to whether the training set protein includes most of the protein in the dataset. Review our problem formulation in the original paper: Given the Protein set and PPI set , where . The PPI network is denoted as , and assume is connected (The used in the original paper are all connected). After using the random data partition strategy, if the connectivity of the training PPI network is large, then the corollary 2 will be proved. In the real-world PPI dataset, the number of proteins is not infinity. Therefore, we can roughly judge whether our Corollary 2 is correct based on Lemma 2. The experimental results are shown in Table 5. No matter theoretical deductions() or real test results(Proportion of ), it shows our Corollary 2 is right.

It is worth mentioning that regarding dataset SHS148k and STRING, why but the proportion of still does not reach the proportion of connected graphs, which is equal to 1. This is because there are many proteins in the PPI network, and they only interact with one protein.(Shown in the column of Table 5)

Dataset Proportion (%)
SHS27k 1587.6 102.3 6099 1525 6276.7 92.66 6.95 0.39 409
SHS148k 4971 218 35590 8898 22189.8 97.25 2.72 0.03 1016
STRING 15082.3 252.6 474717 118680 73893.7 99.75 0.25 0 1044
Table 5: The details of the real-world PPI dataset we used under random partition strategy; is equal to , where ; means there is noly one interaction related to protein .
No. Property Class1 Class2 Class3
1 Hydrophobicity Polar Neutral Hydrophobicity
R,K,E,D,Q,N G,A,S,T,P,H,Y C,L,V,I,M,F,W
2 Normalized van der 0-2.78 2.95-4.0 4.03-8.08
Waals volume G,A,S,T,P,D N,V,E,Q,I,L M,H,K,F,R,Y,W
3 Polarity 4.9-6.2 8.0-9.2 10.4-13.0
L,I,F,W,C,M,V,Y P,A,T,G,S H,Q,R,K,N,E,D
4 Charge Positive Neutral Negative
K,R A,N,C,Q,G,H,I,L,M,F,P,S,T,W,Y,V D,E
5 Secondary Structure Helix Strand Coil
E,A,L,M,Q,K,R,H V,I,Y,C,W,F,T G,N,P,S,D
6 Solvent Accessibility Buried Exposed Intermediate
A,L,F,C,G,I,V,W P,K,Q,E,N,D M,P,S,T,H,Y
7 Polarizability 0-1.08 0.128-0.186 0.219-0.409
G,A,S,D,T C,P,N,V,E,Q,I,L K,M,H,F,R,Y,W
Table 6: Seven attributes and the division of the amino acids.

Appendix B Composition(C), Transition(T) and Distribution(D)

[6] proposes to use these attributes to describe amino acids. The amino acids are divided into three classes according to attribute, and each amino acid is encoded by one of the indices 1,2, 3 according to which class it belongs. Table 6 shows that amino acid attributes and corresponding division.

Appendix C Pre-train amino acid embeddings

We use the embedding method to represent each amino acid

as a vector. Each embedding vector is a concatenation of two sub-embedding, i.e.

. The first part measures the co-occurrence similarity of the amino acids, obtained by pre-training the Skip-Gram[15] model protein sequences. The skip-gram model is trained using negative sampling, where the vocabulary samples are overlapping 3-mer amino acids, and the word vector size is 5. The second part

is a one-hot encoding based on the classification defined by the similarity of electrostaticity and hydrophobicity among amino acids, where 20 natural amino acids can be clustered into 7 classes

[18], shown in Table 7. For the amino acid U(Selenocysteine), the amino acid O(Pyrrolysine) and the unknown amino acids X are included in the eighth category. In summary, each amino acid is expressed as .

No. Dipole scale Volume scale Class
1 - - A, G, V
2 - + I, L, F, P
3 + + Y, M, T, S
4 ++ + H, N, E, W
5 +++ + R, K
6 +’+’+’ + D, E
7 +” + C
Table 7: Classification of amino acids. Dipole scale: -, Dipole1.0; +, 1.0Dipole2.0; ++, 2.0Dipole3.0; +++, Dipole3.0; +’+’+’, Dipole3.0 with opposite orientation; +”, Cys is separated from class 3 because of its ability to form disulfide bonds. Volume scale: -, Volume50; +, Volume50.

Appendix D Ablation Study of PIE and PGE

We perform ablation studies on the PIE and PGE components. As shown in Table 8, both components are beneficial to the overall performance.

PIE PGE Partition Schemes
Random BFS DFS
69.880.04 50.032.08 61.861.04
94.300.52 73.816.82 88.030.59
95.380.12 78.375.40 91.070.58
Table 8: Results of ablation studies on the PIE and PGE components.
Hyper-Parameters Values
Fixed amino acid length 2000
Model Protein-I Feature 256
Architecture Protein-G Feature 50
Graph layers 1
learning rate(lr) 0.001
lr reduce rate 0.5
Model lr reduce patience 20
Training l2 weight decay 5e-4
batch size 1024
epochs 300
Table 9: The hyper-parameter settings for GNN-PPI.

Appendix E Real-world PPI network

We perform empirical studies by comparison of two different time points, 2021/01/25 and 2020/04/11, of the Homo sapiens subset of BioGRID database. Some qualitative results are shown in Figure 4, where green and red nodes denote already and newly discovered proteins, respectively. It can be seen that proteins are not discovered randomly. Instead, the newly discovered proteins exhibit some BFS-like or DFS-like local patterns. This may justify that the proposed partitions are more realistic.

Figure 4: Comparison of two different time points of the Homo sapiens of BioGRID database, where green and red nodes denote already and newly discovered proteins, respectively.
Multi Labels Type Ratio (%) Random Partition BFS Partition DFS Partition
PIPR GNN-PPI PIPR GNN-PPI PIPR GNN-PPI
Reaction 51.08 96.390.12 97.620.07 55.967.59 83.194.01 68.091.19 93.281.44
Binding 67.87 95.630.23 96.430.07 71.340.36 83.803.70 81.791.90 94.060.51
Ptmod 6.92 86.940.26 87.280.27 25.9113.8 70.487.44 17.095.31 82.121.40
Activation 17.53 86.310.31 87.960.37 39.1713.3 66.2015.5 27.2810.0 81.580.94
Inhibition 6.58 90.360.34 91.490.14 12.087.55 65.5812.4 19.160.94 82.622.80
Catalysis 48.59 96.280.19 97.580.08 58.846.39 83.944.53 66.244.32 92.790.51
Expression 2.07 39.061.42 32.551.53 0.811.27 15.6710.8 1.041.80 23.229.21
Micro-Avg - 94.430.10 95.430.10 55.651.60 78.375.40 67.450.34 91.070.58
Table 10: Separate results in STRING dataset for the multi labels between PIPR and GNN-PPI over Random, BFS and DFS partition schemes.

Appendix F Performance of different PPI types

We show separate results for the 7 labels in Table 10. Performance in “expression” is worse due to the limited samples (positive samples lower than 10%). Our model demonstrates consistent advantage over PIPR among all the 7 labels.

Appendix G Proportions of Protein

Table 2 in main text shows some quantitative analysis of the BFS and DFS partitions in terms of the proportions of BS, ES and NS edges. We also calculate the proportions of nodes in testset-only, both-sets and trainset-only as shown in Table 11. Compared with conventional random partition, BFS and DFS partitions lead to more ES and NS edges, and less nodes in Both-sets, and thus better evaluate the performance of models for unseen proteins.

Datasets Partition Protein Proportions (%)
Schemes Trainset-only Testset-only Both-sets
SHS27k Random 41.64 6.06 52.31
BFS 60.16 9.63 30.22
DFS 63.69 5.60 30.71
SHS148k Random 34.79 4.20 61.01
BFS 52.54 9.62 37.84
DFS 51.51 5.72 42.77
STRING Random 13.96 1.65 84.39
BFS 31.72 5.03 63.25
DFS 26.94 4.75 68.31
Table 11: Proportion of proteins in trainset and testset over different datasets and data partition schemes.