A comprehensive statistical study of metabolic and protein-protein interaction network properties

by   D. Gamermann, et al.

Understanding the mathematical properties of graphs underling biological systems could give hints on the evolutionary mechanisms behind these structures. In this article we perform a complete statistical analysis over thousands of graphs representing metabolic and protein-protein interaction (PPI) networks. The focus of the analysis is, apart from the description of the main properties of the graphs, to identify those properties that deviate from the expected values had the networks been build by randomly linking nodes with the same degree distribution. This survey identifies the properties of biological networks which are not solely the result of the degree distribution of the networks, but emerge from the evolutionary pressures under which the network evolves. The findings suggest that, while PPI networks have properties that differ from their expected values in their randomized versions with great statistical significance, the differences for metabolic networks have a smaller statistical significance, though it is possible to identify some drift. We also investigate the quality of fits obtained for the nodes degree distributions to power-law functions. The fits for the metabolic networks do describe the distributions if one disregards nodes with degree equal to one, but in the case of PPI networks the power-law distribution poorly describes the data except for the far right tail covering around half or less of the total distribution.


Computational prediction and analysis of protein-protein interaction networks

Biological networks provide insight into the complex organization of bio...

Characterization of graphs for protein structure modeling and recognition of solubility

This paper deals with the relations among structural, topological, and c...

Edge-based analysis of networks: Curvatures of graphs and hypergraphs

The relations, rather than the elements, constitute the structure of net...

Exact p-values for global network alignments via combinatorial analysis of shared GO terms

Network alignment aims to uncover topologically similar regions in the p...

Sparse graphs using exchangeable random measures

Statistical network modeling has focused on representing the graph as a ...

Aligning Statistical Dynamics Captures Biological Network Functioning

Empirical studies of graphs have contributed enormously to our understan...

Preferential attachment hypergraph with vertex deactivation

In the area of complex networks so far hypergraph models have received s...

1 Introduction

Networks are an intuitive way to pictorially represent elements and their interactions in many complex systems. On top of its visual appeal, graph theory is a well established mathematical field which allows these structures to be quantitatively analyzed and have its properties objectively evaluated. As soon as the abstract graph theory began to be applied in order to describe real world networks, it became clear that graphs representing real systems differed vastly from what would be expected from the naive Ërdos-Rénny random model for networks studied in the fifties [1].

Graphs representing many different real world systems such as for example, author citations relations [2], biological networks [3, 4] or flight connections [5] present many similar topological properties that significantly distinguish them from random graphs. Some of these common characteristics, some times claimed to be ubiquitous in real world networks, are the presence of hubs (a few highly connected nodes), the so called small world property [6], scale-freeness or self-similarity as a consequence of the network’s nodes degree distribution often be similar to a power-law function () [7, 8, 9], high clusterization and hierarchical organization [10].

Biological systems are the result of evolution and natural selection. Therefore, the characteristics one observes in biological networks are indications of the evolutionary pressures under which these systems developed. The dense tails of the degree distributions in these networks, for example, has been suggested to give raise to the robustness of these systems against random node deletions [11, 12, 13], which would reflect the fact that live organisms are resilient to random mutations or to deprivation situations. In fact, many works study evolutionary models that attempt to generate graphs with similar degree distributions than the ones observed in real systems [14, 15]. These works usually focus in obtaining a random model that results in graphs with degree distributions following power-law functions.

Though it is a very common claim that biological networks are scale-free (meaning that their node’s degree distribution follows a power-law function), there are some studies that dispute this conclusion [16, 17]. Most works that fit a power-law to the degree distribution of a given network, overlook the quality of the fit. However, in [18], an objective statistical study of several different real world networks is made in order tho tackle this issue. In the study, the authors use robust statistical tools not only to fit the distributions but also to evaluate the quality of the fits (p-value), shedding light on which networks might and might not be called scale-free. Though a single metabolic and a single protein-protein interaction (PPI) network are analyzed in this paper, these data sets composed of single elements are not enough to extract strong general conclusions about these biological networks.

In the present work two huge datasets of biological networks are analyzed: over 3000 metabolic and more than 1000 PPI networks. The main graph attributes of each network are evaluated, which allows us to draw a general picture about the characteristics of these graphs for a wide range of organisms. In order to identify which properties are simply expected given the degree distributions of the networks and which ones are the result of possible evolutionary pressures over the underlying biological systems, the same properties are computed over randomized versions of each network. With this study, it is possible to evaluate the differences between the property in the real networks with respect to their randomized versions, to assess the statistical significance on the existence of such differences and, therefore, to identify attributes that deviate from their expected values. Finally, we use the same tools as in [18] to fit the degree distribution of each network to a power-law and to evaluate the quality of these fits.

The work is organized as follows: in the next section we define the graphs we study and from where the data in order to build them was retrieved. The section after that is dedicated to describe all the analysis done. Finally, we present the results and discussion and in the last section a brief overview and our conclusions.

2 Data and Graphs

In this section we define the graphs we analyze and describe the procedures followed in order to obtain the data from which we build the networks.

2.1 Metabolic Network

For an organism, we define its metabolic network as the graph resulting from connecting the molecules or metabolites appearing in its metabolism based on the biochemical reactions that keep its cells (or cell) alive. Two metabolites (nodes in the graph) are connected if they appear as a substrate-product pair in any chemical reaction in its metabolism.

Therefore, the data needed to build the metabolic network for one organism is the list of all biochemical reactions that can be found in its metabolism. This data was obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [19, 20]. First, a list of all genes in an organism is obtained, from this list of genes, those annotated as coding for enzymes are identified, as well as the corresponding chemical reactions catalyzed by each enzyme. In a complementary step, the pathways identified in each organism are obtained and the corresponding KGML files (KEGG Markup Language) are retrieved. These files allow to identify the non-enzymatic reactions associated with known pathways. After the retrieval process, one has a list of chemical reactions from which one builds the network by listing all single metabolites appearing in the reactions and stipulating an undirected link between two metabolites whenever they appear in opposite sides of a reaction (as substrate-product).

An automated python script which connects to the database rest API was written, in order to obtain KEGG’s list of organisms and run through it retrieving the data needed to build the networks [21]. The networks for 3481 organisms were successfully build by the procedure.

2.2 PPI networks

In a protein-protein interaction network, every protein found in an organism’s proteome represents a graph node and two nodes are linked if the proteins have some kind of interaction between them.

Data for the production of PPI networks was downloaded from the STRING database [22]. From this database, for hundreds of organisms, one obtains lists of pairs of proteins present in the organisms and several scorings for each pair representing the confidence on the existence of some interaction between them (different scores are associated to different sources of evidence for the existence of the interaction). For all organisms downloaded, we built the network for each organism by setting the threshold on the minimal confidence level that an interaction must have in order to define an undirected link in the network. The threshold considered was 0.90 for the combined score.

With this procedure we built 1073 PPI networks.

3 Statistical Analysis

In this section we explain the graph parameters and characteristics that were analyzed in each network and the statistical tools employed in the analysis.

3.1 Graph Properties

The theory on measurements related to graphs and the study of network characteristics and parameters can be found in several books and reviews see, for example, [23, 24].

For every single graph produced (3481 metabolic networks and 1073 PPI networks), first the values for basic network parameters are obtained. Most of the obtained graphs contained small disconnected components, so we also count, for each graph, the number of disconnected components, the size of the biggest component and the average size of the smaller components. The most straight forward properties that are obtained from the graphs are its number of nodes and its number of links .

Also, for every network it is straightforward to obtain its node’s degree distribution i.e. , the number of nodes in each network that have links, for every possible integer . This is evaluated by first obtaining the degree for each node , which is the number of links node has:


where is the adjacency matrix of the graph (a square symmetrical matrix where each element is 1 if node is connected to node and 0 otherwise).

Another local property of each node is its clustering coefficient :


where is the number of links between the neighbors of node . This coefficient is the ratio between the number of triangles node actually forms with its neighbors and the total number of all possible ones given its degree .

The local parameters (properties associated with each node in a network) can be averaged over all nodes in the network in order to establish an average network parameter. In the case of the two above mentioned parameters, one has the network average degree and average clustering coefficient . It is also possible to define a global parameter representing the clustering of a network as the ratio between all triangles (size 3 clicks111A click is a complete subgraph of the network.) the network (as a whole) actually has and the number of possible triangles it could have, based on the number of connected triples (length 2 paths):


where is the number of triangles (tree nodes connected in a cycle) and the number of 2-paths (connected triples).

We also study two parameters related to node correlations, namely nodes distances and network assortativity.

First we evaluate the symmetric distance matrix, a matrix where every element is shortest path length between node to node , via the Dijkstra’s algorithm [25]. Since we consider the unweighted network (every link has weight 1), the size of the path is set as the distance between the two nodes. The average of all elements222Since some networks have disconnected components, the distance between nodes in different components is infinity (one is not reachable from the other). These distances are left out from the sum evaluating the average. in the distance matrix is the network’s average distance :


The network’s assortativity, , is a correlation coefficient between the node’s excess degree and its expected value in a random network:



is the probability that a node has degree

, is the fraction of links in the network connecting nodes of degree with , is called the excess degree distribution and

is its standard deviation.

Positive coefficient means an assortative network i.e. high degree nodes tend to be connected to other high degree nodes; while a negative coefficient means a dissortative network that is, a network where high degree nodes tend to connect with low degree nodes.

3.2 Degree Distribution

It is often claimed that biological networks have scale-free (power-law) node degree distribution. The discrete power-law distribution has the form:


where is the zeta Riemann function (modified such that the sum starts at a minimum value ). This distribution has and as parameters. The parameter is an integer indicating the smallest number in the distribution.

According to the above, we attempt to fit a power-law (scale-free) distribution to the nodes degree distribution of the studied networks. In order to do that, given the set of numbers representing the degree of every node in the network, we find the values of the parameter that maximize the likelihood333Since the logarithm is a monotonically increasing function, its maximum is at the same point as the maximum value of its argument., for a given value of :


where the sum is made over the degrees of every node in the network bigger than . To find the parameter that maximizes the likelihood, one must solve the equation:


where is the number of nodes in the network with degree bigger or equal to .

Once is established, for many possible values , we evaluate the goodness of fit through a test: the statistic is calculated and the left-cumulative distribution of the Pearson’s distribution at this point is obtained. The result is the p-value i.e. the probability of obtaining a statistical fluctuation bigger than the observed one if the ’s distribution does come from a power-law with parameters and . Therefore, big values of the p-value indicate a good fit.

For each network, we follow the same procedure: having its degree distribution, for every possible value of between 1 and some444As will be clear in the Results section, the p-values increase until some given value of and decrease afterwards. we solve Eq. (11) and find the value of that maximize the likelihood for the given . We also evaluate the p-value and count the amount (fraction) of nodes in the network with degree smaller than (these nodes did not participate in the fit procedure). We also evaluate an upper and lower uncertainty for the parameter by finding the two points around the maximum likelihood where it decreases half point (0.5) [26].

3.3 Network Randomization

We want to distinguish properties of the network that come solely as a natural consequence of its degree distribution and those that require some underlying mechanism to be achieved. To accomplish that, after having evaluated the network parameters, we compare those to averages of the same parameters evaluated over sets of randomized networks i.e. networks with the same degree distribution for their nodes, but where the links have been randomly exchanged.

The randomization process we implemented is the following: first two links of the network are randomly selected. The links are broken and the nodes participating in one of the original links are connected to the nodes participating in the other original link. Since we work with simple undirected networks, sometimes this process fails, because given the randomly selected links, the relinking of the network would either generate a node connected with itself or two nodes sharing multiple connections. Repeating multiple times this process, one can also estimate (by bootstrap) the amount of times the process failed estimating, in this way, the probability of success in the process, which will be a property solely of the degree distribution of the network. We call this parameter

. Note that this randomization process does not change the degree distribution of the network i.e. every node keeps its constant during the process.

One is left to decide how many times to repeat this randomization process in order to obtain a truly random network (with the same degree distribution). We adopt the paradigm that each link should have a 99.9% probability of having been touched at least once by the process.

Since in each step of the randomization process two links are selected, the probability that any given link is touched in a given step is . Therefore, the probability that a link is not touched is . If the randomization process is repeated times, the probability that a given link is never touched is . So the number of times we must repeat the randomization process in order that there is a 99.9% probability that any given link was touched by the process (probability of not being touched) is:


to give an idea of his number, in a typical metabolic network with 2400 links, this value is .

For each network, we obtain 10 randomized versions555The calculations become computationally intensive for big networks (some PPI networks have over 10000 nodes) and therefore, choosing a bigger number of random samples, would result in unreasonable running time for the calculations over the entire dataset, though it would increase the confidence level of the conclusions. of it, evaluate each network parameter in each randomized network and estimate an expected value and its uncertainty by evaluating the average and standard deviation of the parameter over the ten randomized network samples. In this way, we are able to evaluate, for each network, a measure of the parameter deviation in the real network from the expected value in the randomized versions of it, by computing the statistical for each parameter :


where is the value obtained for the parameter in the real network, is the average value of the parameter over the ten randomized versions of the network and the standard deviation of the parameter in the random samples.

The statistical

can be used to assess the statistical significance on the existence of a difference between the parameter value in the real network and its expected value in randomized versions of it, by evaluating the cumulative student’s t-distribution with 9 degrees of freedom at point

. Two time this value is interpreted as the p-value for the null hypothesis that the observed network has the

parameter equal to its expected value. High values of this p-value would indicate that the difference is not significant (high probability of making a mistake if the null hypothesis is rejected). Parameters for which the p-value is small would indicate a significant difference between the observed and expected value, hinting that evolution favors (selects) networks in which the parameter value is bigger (if is negative) or lower (if is positive) than what is expected in a random version of the network. Therefore, a good dynamical evolutionary model for these biological networks would have to incorporate mechanisms that result in networks with such characteristics.

4 Results and Discussion

For each dataset of networks studied (metabolic and PPI), we present first the results of the descriptive statistics for the network parameters, followed by the results from the comparison between real and its randomized versions and finally the results in the fitting procedure for the node’s degree distributions.

4.1 Metabolic Networks

For each one of the 3481 metabolic networks in our dataset, we evaluated the graph properties and characteristics described in subsection 3.1. Histograms depicting the distributions of the main parameters over our dataset are shown in figure 1

. The distributions show the bulk of the data distributed around a central value, but all of them also present a significant number of outliers.