Network and Sequence-Based Prediction of Protein-Protein Interactions

07/08/2021
by   Leonardo Martini, et al.
0

Background:Typically, proteins perform key biological functions by interacting with each other. As a consequence, predicting which protein pairs interact is a fundamental problem. Experimental methods are slow, expensive, and may be error prone.Many computational methods have been proposed to identify candidate interacting pairs. When accurate, they can serve as an inexpensive, preliminary filtering stage, to be followed by downstream experimental validation. Among such methods, sequence-based ones are very promising.Results:We present, a new algorithm that leverages both topological and biological information to predict protein-protein interactions. We comprehensively compare our Framework with state-of-the-art approaches on reliable PPIs datasets, showing that they have competitive or higher accuracy on biologically validated test sets.Conclusion:We shown that topological plus sequence-based computational methods can effectively predict the entire human interactome compared with methods that leverage only one source of biological information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

12/22/2017

Predicting protein-protein interactions based on rotation of proteins in 3D-space

Protein-Protein Interactions (PPIs) perform essential roles in biologica...
10/10/2019

Modeling of negative protein-protein interactions: methods and experiments

Protein-protein interactions (PPIs) are of fundamental importance for th...
06/17/2019

rna2rna: Predicting lncRNA-microRNA-mRNA Interactions from Sequence with Integration of Interactome and Biological Annotation Data

Long non-coding RNA, microRNA, and messenger RNA enable key regulations ...
09/18/2015

Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets

Protein-protein interaction (PPI) prediction is an important problem in ...
10/16/2021

DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction

How and where proteins interface with one another can ultimately impact ...
05/30/2015

Learning quantitative sequence-function relationships from massively parallel experiments

A fundamental aspect of biological information processing is the ubiquit...
11/12/2017

A Sequence-Based Mesh Classifier for the Prediction of Protein-Protein Interactions

The worldwide surge of multiresistant microbial strains has propelled th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Protein-Protein Interactions (PPIs) play a crucial role in several biological processes since, in many cases, proteins perform vital functions by interacting with each other in the formation of protein complexes. The identification of new Protein-Protein interactions is thus crucial in understanding cells’ biological mechanisms. Furthermore, knowledge of the interactions can be exploited for applications such as drug repurposing , which leverages network topology to predict drug-disease associations, or network-based approaches to disease-gene prioritization , which leverage the PPI network to find new candidate disease genes. Consequently, charting protein-protein interaction maps remains a fundamental goal in biological research.

Protein-protein interactions can be most readily identified by protein affinity chromatography or pull-down experiments, yeast two-hybrid screens, or purifying protein complexes that have been tagged in vivo. These methods are all labor and time consuming and have a high cost associated with them. Each of them has inherent advantages and disadvantages. For instance, the yeast two-hybrid system has the advantage of identifying the direct interaction between protein pairs. However, data gathered using this method may contain a high (as much as 50%) rate of false positives. Therefore, in the absence of other lines of evidence, this data alone cannot be considered biologically significant. The high cost and the technical limitations associated with such biochemical approaches have resulted in a growing need for the development of computational tools capable of identifying prospective protein-protein interactions.

Several tools have been designed for this purpose over the past few years. Some of these approaches predict protein interactions using the primary structure of proteins themselves. In this line of research, SPRINT[li2017sprint] and PIPE[pitre2006pipe] are two well-known algorithms that rely on the same underlying hypothesis: a pair of proteins similar to a pair of interacting proteins have a higher chance to interact. In this line of thought, some very promising approaches, like L3 and SIM, leverage network topology to define the notion as mentioned earlier of similarity.

We propose a new framework that can exploit topological and biological information to predict protein-protein interactions. The algorithm relies on the underlying hypothesis that two proteins interact in proportion to the structural similarity that one has with the most similar of the interactors of the other. From a topological perspective, the structural similarity between two proteins is proportional to the number of common neighbors. Instead, from a biological perspective, the structural similarity between two proteins is proportional to the similarity of their primary sequences.

We compare our framework with the state-of-art approaches on several synthetic and Human PPI networks. We show that it outperforms many heuristics in predicting protein interactions. Moreover, candidate protein pairs prioritized by our algorithm are involved in the same biological processes, molecular functions, and cellular components. Finally, a subset of the candidate protein pairs predicted by our framework has been experimentally validated by Xu-Wen Wang et al.

2 Materials and Methods

2.1 Scoring Protein Interaction

One way to detect protein-protein interaction consists of look at each interactor’s tertiary structure to find complementary binding sites. However, compared to topological and protein’s primary structure, the protein’s shape information (tertiary and quaternary structure) is scarce. This lack of information is consequence of the expensive cost to discover the tertiary structure of proteins using new technologies. Network-based approaches, that score the likelihood to have a direct interaction between a protein pair, do not need any protein’s structure information, but they can infer it leveraging network topology. Furthermore, the increasing coverage of the interactome has inspired the development of network-based algorithms, which exploit the patterns characterizing already mapped interactions to identify missing interactions. The problem of identifying new links in the interactome is known as link prediction problem . To predict new links in the network several types of measures have been studies and proposed. The most promising measures used in link prediction problem relies on two different hypothesis: the hypothesis that nodes topological similar to each others (i.e nodes that shares several neighbours ) and the hypothesis that a node should be linked a candidate if it is similar to the known interactors. Figure 1 shows the different between measures that leverage the former hypothesis and those ones that exploit the latter idea.

(a) TCP predicts (P) links based on node similarity (S), quantifying the number of shared neighbors between each node pair
(b) PPIs often require complementary interfaces. Hence, two proteins, X and Y, with similar interfaces share many of their neighbors. An additional interaction partner of X (protein U) might be also shared with protein Y (blue link)
Figure 1: comparison of frameworks based on 2-path lengths and those based on 3-path length

More precisely, figure 1 a) shows measures that scores a possible interaction between node and computing the number of neighbours and

shares. Metrics that use this information are Jaccard Index, Common Neighbours, Adamich Adar,

and . While figure 1 b) shows groups metrics that link and if is similar to ’s partner and vice-versa. Furthermore, metrics that leverage common neighbours have been widely used in link prediction problem on social network, while, metrics that leverage partner similarity are more biological driven and they are based on the hypothesis that if is similar to an ’s interactor, then and are likely to interact (i.e they shares complementary binding sites).

To compare these two approaches that relies on different ideas, we computed the probability of a pair

to be connected in the Protein-Protein Interaction (PPI) network when Jaccard Index (figure 2 a)) and the number of three path length between (figure 2

b)) are increasing. To estimate this probability, we first computed the Jaccard Index and the number of three path length of each pair of nodes in the PPI Network G(V,E). We estimated the probability of connection in the following way:

(1)
(a) High Jaccard similarity indicates a lower chance for the proteins to interact
(b) We observe a strong positive trend in HuRI between the probability of two interacting proteins and the number of paths between them
Figure 2: Network similarity does not imply connectivity. a and b show the difference between TCP paradigm and L3 principle.

Where is 1 if and 0 otherwise, while S is the set that contains all pairs with a score .

Figure 2 demonstrate that, if we use as score the Jaccard Similarity fig. 2 a) and the number of three path length fig.2 b), protein pairs with high Jaccard Similarity do not tend to interact if we compare them with (i.e. the number of 3 path length between and ). This pattern is visible on several Human and not Human PPI Network as shown in Supp. Fig. 10

2.1.1 Jaccard Index as a Measure of Protein Interface Similarity

As discussed so far, protein pairs with high jaccard similarity do not tend to interact. This could be due to the fact that proteins that shares several neighbours should have similar binding sites (i.e similar tertiary structure). To assert the validity of this hypothesis, we considered the correlation between sequence similarity and the Jaccard Index of a pair . More formally, we define the Protein sequence Similarity of as the distance between the longest sequence and the global alignment of their protein primary structures computed using the Needleman–Wunsch algorithm .

(a) HuRI
(b) STRING
Figure 3: Jaccard similarity as a measure to infer protein structure similarity.

Figure 3 a) shows the correlation between Jaccard Similarity and Protein sequence similarity. To better understand the correlation between the score induced by the global alignment and the Jaccard Similarity of a protein pair, we grouped each pairs by their Jaccard Index creating four different buckets as shown in the X axes of fig. 3 a) . For each bucket, we drawn the box plot representing the overall distribution of the global alignment score of protein pairs with similar Jaccard Index. As illustrated in fig. 3 a), when the Jaccard Similarity is in the interval , protein pairs shares similar sequences. Finally, fig 3 b) illustrates the number of samples belonging to each bucket. Every bucket, with the exception of the bucket with interval , has a similar number of samples.

2.1.2 Jaccard Index as a Measure of Gene Duplication Phenomena

Jaccard Index, not only is useful to find proteins with similar structures, but it can be applied to identify proteins originated by the process of gene duplication. Indeed, in the process of evolution, genes may produce new proteins, which may retain many of the biological functions of the original ones. As we know, the structure of a proteins is related to its biological functions and thus, proteins born from the process of gene duplication should share similar interactors. To statistically quantify if proteins created by this evolutionary process show this behaviour, we downloaded from [ouedraogo2012duplicated] a data set consisting of groups of protein products generated by the gene duplication process. Firstly, we filtered out the smallest groups, keeping only groups consisting of a large number of proteins (i.e number of proteins ) that appear in the PPI network. Secondly, for each group , we defined the Mean Jaccard Index of group () as:

(2)

Where is the size of . Finally, we compared the value of each group’s Mean Jaccard Index with a random distribution of the score computed using random set of proteins with the same size of the original group .

Figure 4: Proteins originated by Gene Duplication Phenomena shares more common naighbors than Random Expectation.

Figure 4 shows the scores of Mean Jaccard Index of the original groups and compare them with a random distribution. As illustrated by each plot, the value of the of each group is very large if we compare with the associated random distribution. Indeed, each vertical line, representing the score of each group is in the tail of each distribution. As, discussed before, each group consists of a number of proteins in the interval .

2.1.3 Jaccard Index to Model Evolutionary and Functional Similarity

The relationship between protein sequences and structures has long been a widely accepted tenet of biochemistry[kidera1985relation, guzzo1965influence]. Indeed, one of the foundations of molecular biology is that a protein’s sequence determines its structure, which determines how the protein functions.

Protein sequence–structure-function relationships have been investigated and quantified in various ways. Several studies[chothia1986relation, wilson2000assessing] have established the correlation between structural similarity and sequence similarity. Other ones[rost1999twilight, yang2000integrated] have studied the level of sequence similarity at which structural similarity is likely to be observed. Consequently, protein primary sequence similarity indices have been widely used to capture evolutionary relationships between paralogs (i.e., homologous proteins related to a gene duplication phenomena)[jeffryes2018rapid].

Consequently, we modeled a protein pairwise similarity function, named Biological Jaccard Index, that leverages the primary sequence to score the resemblance of a protein pair. In more detail, given two protein and and their associated protein sequences and , we first identify the set of k-mers 111K-mers are substrings of length contained within a general biological sequence. of each sequence and and we define the Jaccard Similarity between sets of k-mers as:

(3)

To statistically validate if this index was able to keep evolutionary relationships among set of proteins (Gene Duplication Phenomena), we followed the same approach discussed in Section 2.1.2. As shown in supplementary figure 11, duplicated gene sets have mean jaccard similarity grater than random expectation. Furthermore, the mean of pairwise similarities of each group is greater if Biological Jaccard Index is considered instead of the topological index. This finding might be related to the incompleteness of the interactome.

Figure 5: Correlation between Sequence Similarity and Functional Similarity. Protein pairs with high Sequence Similarity tend to be involved in the same biological processes and be localized in the same cellular components.

Finally, as shown by several studies[hegyi1999relationship, joshi2007quantitative, sangar2007quantitative], protein pairs with higher sequence similarity tend to be more functionally similar than dissimilar proteins. Indeed, Figure 5 shows the correlation between the Biological Jaccard Index and the functional similarity of a protein pair. We considered Gene Ontologies (Biological Process, Cellular Component and Molecular Function) downloaded from the Gene Ontology Consortium and we plotted the distribution of pairwise Gene Ontology similarity over the interval of the Biological Jaccard Index. All protein pairs with a Jaccard Index in the interval have several Gene Ontologies in Common if compared with pairs with low JI.

2.1.4 Model Definition

From the discussion introduced in sections 2.1.12.1.2 and 2.1.3, the Similarity between a protein pairs can be used to measure the interface similarity, paralogy and functional similarity. In other words, given a protein pair , we can use the Jaccard functions to compute the similarity between and ’s neighbors. If is highly topologically or biologically similar to at least one neighbour of , then there is a good chance that shares the same binding sites (i.e has complementary binding site of ) or that it is involved in the same function of a neighbour of . Thus, we can model the probability of connection between and in two different ways: using the topological Jaccard Index or the Jaccard Index defined in equation 3.

Figure 6: Topological and Biological Models of our framework: a) shows a candidate protein pair in which is topologically similar (i.e High Topological Jaccard similarity) to , a (i.e ’s neighbour). b) shows a candidate protein pair in which is highly biologically similar to ’s neighbour.

Figure 6 shows how we score the likelihood of interaction between and . Figure 6 A) illustrate the model from a topological point of view in which protein is very similar to protein (i.e. ’s neighbor) and consequently, as discussed previously, may share complementary binding site with . On the other side, figure 6 B) shows the model from a biological point of view in which, instead of considering the topological Jaccard Similarity, Jaccard Sequence similarity is used to understand if is functionally similar to at least a neighbor of .

Indeed, as shown in figure 6 B), has similar tertiary structure to , but not with . In conclusion, we can formalize the model in the following way:

(4)

Where is the set of direct neighbors of i and we can replace the Jaccard Index J, with the one analyzed in section 2.1.3. Surprisingly, the protein pair scoring function that leverages protein’s primary structures is not linked to the topology of the PPI network. Indeed, it is able to rank candidate interacting pair that are not close (i.e Shortest path length greater than 3) in the network. Thus, is not affected by the bias induced by network topology and by the network incompleteness (i.e missing data).

3 Results

3.1 Evaluation-Scheme and Accuracy-Measures

To evaluate the prediction of our frameworks and comparing it with other algorithms, we selected several Protein-Protein Interaction Networks:

  • From STRING[szklarczyk2015string] we downloaded PPI networks of different organism (Yeast, C. Elegans, Arabidopsis and Humo Sapiens) and we selected only experimental validated physical interactions. In other words, we removed those protein - protein interactions with an overall score lower than 900.

  • We downloaded a Human Interactome from from [luck2020reference], that consists of experimental validated PPIs using Yeast to Hybrid screening. From BioGRID, we downloaded another Human Interactome only considered ”physical” and proteins assigned to the specific to the studied species. Finally, we also included the Interactome3D dataset, summarizing currently available interactions with structural evidence

Finally, for each protein in each Interactome, we downloaded its associated protein sequence from Uniprot Knowledge Based, and in each network, we removed those proteins associated with more than one protein sequence (i.e. those proteins not manually curated by Swiss Prot. Institute). The complete list of PPI Networks used in this manuscript is visible in Supp. Table 2. To assess accuracy in predicting protein interactions, we performed 10-fold cross validation on each dataset, picking a portion of edges at random to create the Training Graph (90%) and using the remaining edges (10%) to Test algorithm’s performances. To assess accuracy, we considered four standard prediction indices in Data Mining. For a given set of truly interacting pairs (Test Set), and an algorithm ranking protein pairs with respect to their interaction likelihood:

  • Precision@500: this is the fraction of the test set that is successfully retrieved in the top positions of the ranking computed by the algorithm.

  • nDCG[wang2013theoretical]: The normalized Discounted Cumulative Gain (nDCG) is proven to be able to select the better ranking between any two, substantially different rankings. For binary classification the nDCG is given by:

    Where summation in the numerator runs over all positive instances, while summation in the denominator quantifies the ideal case, in which positive instances appear in the top positions of the algorithm’s ranked list.

For the seek of completeness, we also compare all the algorithms with the AUROC and AUPRC, that are widely used to compare algorithm’s performances, even if it is well established that AUROC measure overestimates algorithm’s performances.

3.2 Competing Methods

Network-based approaches leverage network topology to estimate the likelihood of protein interactions. The advantages of network-based methods are high efficiency, easy access to input data (only network topology is needed), and good generalization schemes. Table 1 collects all PPIs prediction methods that have been implemented and tested.

PPIs Prediction Methods
Method Reference Index Path Length
Common Neighbor Newman (2001) [newman2001] l = 2
Jaccard Index Jaccard (1912) [jaccard1912] l = 2
Adamich Adar Adamic and Adar (2003) [aa] l = 2
Resource Allocation Zhou et al. (2009) [zhou2009predicting] l = 2
CH1_L2 Cannistraci et al. (2018) [muscoloni2018local] l = 2
CH2_L2 Cannistraci et al. (2018) [muscoloni2018local] l = 2
A3 Barabâsi et al. (2019) [kovacs2019network] l = 3
L3 Barabâsi et al. (2019) [kovacs2019network] l = 3
L3E Yuen. (2020) [yuen2020better] l = 3
CH1_L3 Cannistraci et al. (2018) [muscoloni2018local] l = 3
CH2_L3 Cannistraci et al. (2018) [muscoloni2018local] l = 3
SIM Chen et al. (2020) [chen2020protein] l = 3
Preferencial Attachment Barabâsi et al. (2002) [barabasi2002] any
MPS(B) Wang et al. (2002) [wang2021assessment] None any
SPRINT Li, et al. (2020) [li2017sprint] None any
Table 1: All PPIs prediction methods that have been implemented and tested on several PPI Networks. ,  [muscoloni2018local], and  [muscoloni2018local] represent the degree, the internal degree, and the external degree of a node in the network.
The Preferential Attachment method:

The basic premise is that the probability that a new link has node i as an endpoint is proportional to , the current number of neighbors of i.

Methods Based on Paths of Length 2:

These heuristics leverage the idea, very popular in social network mining, that two nodes and that share many common neighbors are more likely to interact. The most relevant metrics that are based on this hypothesis are: Common Neighbor (CN), defined as the number of common interacting nodes between and , Jaccard Index[jaccard1912](JC), a commonly used similarity. Intuitively, it measures the probability that a feature that either or possess is actually shared by both. In our case, a feature is an interacting protein. Resource Allocation [zhou2009predicting], a similarity index that indicates how well a node can transmits information to node using its neighborhood. Finally, Cannistraci et al.[muscoloni2018local] designed a family of metrics ( and ), based on Resource Allocation, in which a node is associated an internal degree that is the number of links that common neighbors and share between them, and an external degree that is defined as the number of links that shares with nodes that are not common neighbors of and .

Methods Based on Paths of Length 3:

These methods are based on the hypothesis that proteins should have complementary binding sites to be able to interact with each other. Following this idea, Barabasi et al.[kovacs2019network] showed that nodes that share a large number of paths of length share complementary tertiary structures and are more likely to interact. This idea, i.e., the L3 principle, is at the basis of several heuristiscs proposed since Barabasi et al.’s work. Indeed, Chen et al.[chen2020protein] proposed a network-based link prediction method, named Sim, for PPI networks. This index is designed from two perspectives: the complementarity of protein interaction interfaces and gene duplication. Cannistraci et al.[muscoloni2018local] designed a family of metrics ( and ) that plug in the concept of Local Community Paradigm in the L3 formula designed by Barabasi et al. (2019). Finally, Yuen et al. [yuen2020better] designed a metrics based on the L3 principle and uses Jaccard Index as a penalty to score a candidate interaction.

Methods Based on Protein Primary Structure:

SPRINT is a well know method to predict protein’s interactions using their amino-acid sequence. It relies on the idea that proteins similar with interacting proteins are likely to interact as well. In a way or another, this is essentially the idea behind the brute force calculation of PIPE as well as the machine learning algorithms of Martin, Shen, and Guo. Since, all these methods relies on the same concept, we decided to take in consideration SPRINT that has been shown to have a better prediction power and to be faster than the other algorithms.

Methods Implemented for Protein Interaction prediction Challenge:

The proposed Methods has been implemented to participate to the Protein Interaction prediction Challenge organized by the Network Medicine Consortium 222https://www.network-medicine.org/. At the challenge, we also proposed a biological score, named MPS(B), to estimate the likelihood of protein interaction that leverages protein sequence in a different way from what we have discussed so far. We took inspiration from PIPE[pitre2006pipe] and we designed a simplified and faster framework in which we don’t directly consider all co-occurrences of two sub sequence pairs , but we use them to define a protein similarity measure.

3.3 Algorithm Comparison

Figure 7:

Heatmap plots show the performance of each method on each Human interactome with the following evaluation metrics: Area Under the Receiver Operating Characteristic (AUROC), Area Under the Precision-Recall Curve (AUPRC), Precision of the top500 predicted PPIs (P@500) and Normalized Discounted Cumulative Gain (NDCG).

This section presents the overall comparison of sixteen different methods using several Human and Not-Human Protein-Protein Interaction networks.

To begin, Figure 7 shows the algorithm’s performances on four different Human PPI networks. If we look at frameworks that leverage only topological information, we can see that algorithms based on the L3 paradigm outperform metrics that can predict only interactions between proteins within two hops (TCP paradigm). Indeed, this pattern is visible on all the data sets considered: , sim, CH2L3, CH1L3, L3E, L3, and A3 consistently outperform Resource Allocation, Adamich Addar, Jaccard Coefficient, CH1L2, and CH2L2. Furthermore, outperforms all the other heuristics in terms of AUPCR on three networks (INTERACTOME 3D, HuRI, and STRING) and returns the highest number of true positives (Precision@500) on two PPI networks (INTERACTOME 3D, and HuRI). Interestingly, Methods that leverage protein’s primary structures show lower computational performances than topology-based frameworks .

Since most topology-based algorithms show similar values of P@500, we plotted the precision curve of the best performers. Figure 8 a) shows the precision of the best algorithms when the number of top predictions is varying. On the INTERACTOME 3D CH1L3, CH2L3 and show similar behavior, but beats when we consider the top 500 predicted interactions. The same signal is visible on STRING: our topological framework is the best predictor when we consider a larger number of top ranked candidate protein pairs. Finally, on HuRI, is the best oracle. Surprisingly, shows promising results on this network. This result might be related to the positive correlation between Jaccard Index and Sequence similarity among protein pairs in HuRI as shown in Luck, K. et al.[luck2020reference].

Besides, we verify the validity of our approaches also on not Human PPI networks. Supp. Figure 12 shows the framework’s validation on a Synthetic, and three different not Human (Yeast, C. Elegans, and Arabidopsis) interactomes. Results are coherent with the previous ones: L3 approaches outperforms methods that only predict candidate protein pairs within 2 hops(i.e. Resource Allocation, Jaccard Coefficient and Adamich Adar). Furthermore, methods that leverages protein sequence information shows better performances on these networks than Human interactomes. This observation may be due to a greater incompleteness of the Human interactomes than the non-Human ones[hart2006complete].

Figure 8: Algorithm Comparison. CH1L2, CH1L3, CH2L3, , , and sim are are compared using two different measures: a) shows the Precision@k plot and b) shows the MeanGOSimScore.

To further validate these heuristics, we compared their top candidate protein’s interaction using Gene Ontologies annotation[ashburner2000gene, gene2019gene] (Biological Process, Molecular Function, and Cellular Component) downloaded from GO Consortium[gene2015gene]. These annotations are routinely applied to validate computationally predicted links in the literature[you2010using] instead of performing high-throughput validations or pairwise testing experiments. To induce a biological score for each framework, we consider the top 500 candidate pairs of each framework and we compute the GO sim score as in Kovaks et al.[kovacs2019network]. More formally, for each candidate protein pair () in the top k positions we defined and the sets of annotation u and v are respectively involved into, and we estimate their similarity as:

(5)

Where are the total number of protein involved in annotation . Finally, the framework’s score consists of the mean of all associated to the top k candidate predicted protein pairs:

(6)

Figure 8 b) shows the performances of some of the best network-based and sequence-based approaches. Despite their prediction power shown in figure 7, candidate pairs seem to be less biologically similar than those ranked by Resource Allocation and Adamich Adar (see Supp. Figure 13). Kovaks et al.[kovacs2019network] observed the same pattern when they analyzed TCP-based methods and compared them with L3 using the same similarity metric. However, It is worth noticing that SPRINT get the best performances if compared with all the other frameworks, and almost all predicted interactions do not share a large number of common neighbors if compared with Resource Allocation (see Supp. Fig. 14). In conclusion, if we take in consideration methods based on L3 principle, and return candidate pairs more similar than those ones returned by other proposed methods (A3, L3, L3E, CH1L3, CH2L3 and sim) as shown in Figure 8 b) and Supp. Figure 13.

3.4 An Unsupervised Approach to Combine Sequence Based and Topological Information

Section 3.3 allowed us to understand framework’s limitations. First, topology-based approaches are unable to predict candidate protein pairs far away in the Interactome (i.e., Shortest path distance greater than 3). Secondly, as shown in Fig 7 and Supp. Fig. 13, they are very biased on the topology of the Protein-Protein Interaction network and fail to rank highly similar candidate protein pairs. On the other hand, methods that rely on primary sequence can score the likelihood of interaction of candidate proteins localized in distant areas of the network. Also, they rely on the information of the protein’s primary structure, reducing the bias of the Interactome.

Consequently, we investigate the effect of combining topological and biological frameworks to predict new protein interactions to moderate their topological bias and rank biologically similar protein pairs. Given the topological feature and the biological feature , we combine them using a linear combination of the two:

(7)

To choose the value of and , we first normalized/standardized and

. Then, we projected them on a 1-dimensional space with the help of principal components analysis (PCA), a classical technique for extracting patterns and performing dimensionality reduction from unlabeled data. It computes a linear combination of two features, forming the direction that captures the most variance in the data set.

Figure 9: Algorithm Comparison. CH1L2, CH1L3, CH2L3, , , and sim are are compared using two different measures: a) shows the Precision@k plot and b) shows the GOSimScore.

Figure 9 illustrates the performance of the combined metrics compared to individual ones. We analyzed the performances on HuRI Interactome: The network in which both topological and biological metrics have an excellent prediction power on P@K and GoSimScore. Each plot represents a topological framework and its combinations. As a first result, can improve the retrieval of the top 200 candidate protein pairs of each topological algorithm considered, as shown in Figure 9 a). Furthermore, biological frameworks help network-based approaches to retrieve more biologically similar protein pairs, as shown in Figure 9 b).

4 Conclusion

We presented two network-based approaches, named and , for predicting candidate interacting protein pairs. The exceptional success of our models depends on their ability to capture the structural and evolutionary principles that drive protein-protein interactions that Jaccard Indices infer, as discussed in Material and Methods.

We compared and with state-of-the-art network-based approaches using a 10-fold cross-validation approach on several Human and Not Human Protein-Protein Interaction networks. The ability of network-based approaches to predict protein interactions strongly depends on the network’s topology. Indeed, L3 approaches have similar performances, and outperforms the other heuristics in the majority of the network considered. On the other hand, Sequence-based approaches that do not rely exclusively on network topology fail in predicting candidate interacting pairs. Furthermore, to better understand the biological similarity of the top candidate protein pairs ranked by each framework, we implemented the GOSimScore as Kovaks et al.. Surprisingly, SPRINT, a well-known sequence-based approach, returned the candidate protein pairs involved in similar processes, molecular function, and in the same area.

Surprised by the excellent prediction power of on HuRI Interactome, we investigated possible combinations with the best network-based methods. We discover that combining topological and biological methods help in ranking interacting protein pairs without decreasing their biological similarity.

However, the framework is not without limitations. Like all L3-based methods, alone cannot find interacting partners for proteins without known links. For such proteins, we integrated information on sequence combining with , SPRINT or MPS(B).

Another limitation concerns the computational validation: Fig. 7 and Supp. Fig. 13 represent two different ways to validate candidate protein pairs predicted by the different frameworks. It is worth noticing that frameworks based on network topology perform better when compared with classical data-mining measures such as P@K and NDCG. At the same time, frameworks that rely on protein’s primary sequence outperform the others when GOSimScore is considered.

We can still do much work to improve our framework: we can think of a more reasonable way to combine topology and sequence-based methods. We can extend the framework to use several biological information such as Co-Expression of protein pairs, their functional similarities, and phylogenetic profile similarity, evolutionary history or 3D structure that we have not considered in this manuscript but that are integrated in several well-known bioinformatics tools[zhang2012structure, keskin2016predicting, szilagyi2005prediction, wuchty2006topology]. In conclusion, and with their combinations are promising tools for the completion of the human interactome, allowing us to exploit network effects as we aim to uncover the mechanistic roots of human disease[menche2015uncovering, huttlin2017architecture].

Acknowledgement

This work is partially supported by the ERC Advanced Grant 788893 AMDROMA “Algorithmic and Mechanism Design Research in Online Markets”, the EC H2020RIA project “SoBigData++” (871042), and the MIUR PRIN project ALGADIMAR “Algorithms, Games, and Digital Markets”.

References

Appendix A Appendix

a.1 Connection Probability

Figure 10: Network similarity does not imply connectivity.

a.2 Protein-Protein Interaction Network Dataset

Considered Protein-Protein Interaction Networks
Data Set Organism Reference Number of Nodes () Number of Edges ()
Synthetic Vázquez et al. (2003)[vazquez2003modeling]
STRING Yeast Szklarczyk et al. (2015)[szklarczyk2015string]
STRING C. Elegans Szklarczyk et al. (2015)[szklarczyk2015string]
STRING Arabidopsis Szklarczyk et al. (2015)[szklarczyk2015string]
STRING Homo Sapiens Szklarczyk et al. (2015)[szklarczyk2015string]
BioGRID Homo Sapiens Oughtred et al. (2019) [oughtred2019biogrid]
INTERACTOME 3D Homo Sapiens Mosca et al (2013)[mosca2013interactome3d]
HuRI Homo Sapiens Luck, K.et al (2020)[luck2020reference]
Table 2: Considered Protein Protein Interaction Networks: A synthetic network created using the method discussed in[vazquez2003modeling] and several networks downloaded from STRING DB[szklarczyk2015string]

a.3 Jaccard Index to Model Evolutionary and Functional Similarity

To statistically quantify if proteins created by this evolutionary process show this behaviour, we downloaded from [ouedraogo2012duplicated] a data set consisting of groups of protein products generated by the gene duplication process. Firstly, we filtered out the smallest groups, keeping only groups consisting of a large number of proteins (i.e number of proteins ) that appear in the PPI network. Secondly, for each group , we defined the Mean Jaccard Index of group () as:

(8)

Where is the size of and is the Biological Jaccard index defined in section 2.1.3 of the main article. Finally, we compared the value of each group’s Mean Jaccard Index with a random distribution of the score computed using random set of proteins with the same size of the original group .

Figure 11: Proteins originated by Gene Duplication Phenomena have more similar primary structures than Random Expectation

a.4 Algorithm Comparison

Figure 12: Heatmap plots show the performance of each method on each Human interactome with the following evaluation metrics: Area Under the Receiver Operating Characteristic (AUROC), Area Under the Precision-Recall Curve (AUPRC), Precision of the top500 predicted PPIs (P@500) and Normalized Discounted Cumulative Gain (NDCG).
Figure 13: Heatmap plots show the performance of each method on each Human interactome with the GO Sim Score computed on Biological Process (B.P), Molecular Function(M.F) and Cellular Component(C.C) .
Figure 14: Analysis of top 500 candidate pairs predicted respectively by SPRINT and Resource Allocation. For each algorithm we plot the histogram representing the distribution of the number of common neighbours (CN) for the top 500 candidate pairs. It is easy to see that candidate pairs ranked by resource allocation show a greter number of CN if compared with those ones ranked by SPRINT. We have taken in consideration interactomes in which Resource Allocation is one of the best predictors.