Evolution of a Modular Software Network

11/22/2011 ∙ by Miguel A. Fortuna, et al. ∙ CSIC 0

"Evolution behaves like a tinkerer" (Francois Jacob, Science, 1977). Software systems provide a unique opportunity to understand biological processes using concepts from network theory. The Debian GNU/Linux operating system allows us to explore the evolution of a complex network in a novel way. The modular design detected during its growth is based on the reuse of existing code in order to minimize costs during programming. The increase of modularity experienced by the system over time has not counterbalanced the increase in incompatibilities between software packages within modules. This negative effect is far from being a failure of design. A random process of package installation shows that the higher the modularity the larger the fraction of packages working properly in a local computer. The decrease in the relative number of conflicts between packages from different modules avoids a failure in the functionality of one package spreading throughout the entire system. Some potential analogies with the evolutionary and ecological processes determining the structure of ecological networks of interacting species are discussed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Results

We have compiled the binary i386 packages, including their dependencies and conflicts, of the first major stable versions of the Debian/GNU operating system released since the project began in 1993 (see Methods and Supplementary Information). The growth of the Debian/GNU Linux operating system from one release to the subsequent is summarized in three steps: some packages are deprecated, others are kept between versions, and new ones are added (see Fig.2). The number of packages that are deprecated between releases and those that persisted increased exponentially over time (, and , respectively, see Material and Methods). The number of new packages added in the most recent version was slightly smaller than in the previous one. If we discard it from the analysis, the number of new packages also increased exponentially over time (). The total number of packages, dependencies and conflicts increased exponentially with each version, ranging from 448 to 28245 (), from 539 to 101521 (), and from 28 to 4755 (

), respectively (see Table 1 in Supplementary Information). Data from the most recent release seem to indicate the beginning of an asymptotic stationary behavior for the growth of both, packages and interactions (their exclusion from the regression analysis increased the fit to

, and to , for dependencies and conflicts, respectively). Neither the ratio between the number of dependencies and the number of conflicts, nor the fraction of packages without any interactions showed a linear tendency over time (;

mean and standard deviation, and

; mean and standard deviation, respectively).

The cumulative degree distribution for the outgoing dependencies (number of packages necessary for i to work) fit an exponential function (, , , , , , , , , and , for all releases respectively; in all cases, see Fig.3). This means that there is a well-defined average number of packages that are used by others (see Fig.S1 in Supplementary Information). However, the cumulative degree distribution for the incoming dependencies (number of packages that need i to work) fit a power law function (, , , , , , , , and , for all releases respectively; in all cases, see Fig.3). This means that a small number of packages are used by the vast majority while many programs are needed only by a few packages (see also Fig.S1 in Supplementary Information). In other words, the network of dependencies showed a scale-free distribution for the incoming dependencies over time, indicating that the new packages added on successive releases depended mainly on the most connected ones (i.e., those packages used by many others).

The modular structure of the network of dependencies was statistically significant for all releases (ranging between and ;

in all cases). The z-score obtained for allowing the comparison of the modularity across networks (see Methods) increased exponentially from the first version to the sixth (

, and , respectively; , ). Since then, it has remained around a lower stationary value (, mean and standard deviation, respectively;

for a linear regression). Although a significant linear relationship between the number of modules and the number of packages with dependencies is found for each version (

), the number of modules containing at least 5% of the total number of packages for each version remained constant through time (, mean and standard deviation, respectively; ; see Fig.S2 in Supplementary Information). Therefore, the new modules originated in subsequent releases contained few packages, indicating that the bulk of new packages added over time joined to the few large modules created in the earliest versions (see Fig.4 and Fig.S3 in Supplementary Information).

Yet, the fraction of conflicts within modules increased linearly over time (ranging from to ; , ) while the fraction of dependencies within modules remained constant (, mean and standard deviation, respectively; , ; see Fig.4 and Fig.S4 in Supplementary Information). Therefore, the increase in the modularity of the dependencies has not avoided the conflicts within modules during the exponential growth of the operating system. This means that, as the modular structure of the network of dependencies increased (up to the sixth release), the fraction of conflicts between modules decreased (Fig.4). Since then, although the modular structure has not grown significantly, the fraction of conflicts between modules has continued decreasing.

The dynamical implications of this result are shown by a random process of package installation in a local computer (see Fig.1 and Methods). The fraction of packages that can be installed by a random process decreased linearly through time (ranging from to ;

). A priori, we might think that the higher the modularity of the network the lower the functionality of the operating system, measured as number of packages installed from the pool of available software. However, other factors, such as the number of conflicts between packages (which also increased over time) may be responsible for the reduction in the fraction of software installed. To rule this effect out, we performed a random process of package installation in which the modular structure of the network of dependencies is deliberately broken by a local rewiring algorithm (see Methods). Hence, we can estimate the effect of the modular structure on the installation process. In almost all versions, the modularity of the network of dependencies increased significantly the fraction of packages installed in a local computer compared with what is expected from the randomization (

, except for releases 2.0 and 3.0, and , respectively). The z-score calculated to compare this effect across releases (see Methods) did not show a significant linear increase over time until the release 3.0 (, mean and standard deviation, respectively; , ). Since then, the z-score increased notably (ranging from to ; see Fig.5).

In summary, from the release 1.1 to the release 2.2 the significant exponential increase of the modularity of the network of dependencies (measured by the z-score) did not cause a significant positive effect on the fraction of software packages installed. However, from the release 3.1 to the last release analyzed (5.0), the lower and non-increasing modularity was responsible for a positive strong effect on the fraction of software installed in a local computer (Fig.5). Debian 3.1, released in 2005, proved to be a break point between these two opposite tendencies. Although versions 3.0 and 3.1 increased the amount of software to practically double the size of the previous release, Debian 3.1 updated 73% of the packages found in Debian 3.0. These and other important changes are mainly the result of the long time elapsed since the previous version was released (almost 3 years, the longest interval between releases in the history of Debian; see http://en.wikipedia.org/wiki/Debian).

Discussion

The increase of the modular structure of the operating system over time detected in this paper seems to be an effective strategy for allowing the growth of software minimizing the risk of collapse due to failures in the functionality of some packages. This strategy has also been reported for the ecological and evolutionary processes structuring food webs [19]. The failure in the functionality of a software package, or the extinction of a species in an ecological community, would not propagate its negative effects to packages (species) from other modules, minimizing the risk of a collapse of the entire system. Therefore, understanding the evolution of a computer operating system can shed light on the evolutionary and ecological constraints shaping communities of interacting species. For example, we can investigate how species richness increases without jeopardizing the coexistence of the entire community. Minimizing the risks of competitive exclusion between species playing the same role in a community is equivalent to reducing software incompatibilities between modules of dependencies to increase functionally. The spatial segregation in the distribution of species represents an effective modular process analogous to the compartmentalization of the software network: it allows a higher regional species richness (software packages pool) at the expense of reducing local diversity minimizing competitive exclusion.

The Debian GNU/Linux operating system provides a unique opportunity to make this and other analogies within the evolutionary and ecological framework determining the structure of ecological networks of interacting species. Both processes occur at different time scales. On the evolutionary time scale, speciation and extinction, i.e. macroevolution, can be translated into the creation of new packages and the deprecation of those rendered obsolete from one version to the next. On the ecological time scale, colonization and local extinction, i.e. community assembly, would be equivalent to the package installation process in a local computer. Dependencies and conflicts between packages mimic predator-prey interactions and competitive exclusion relationships, respectively. Due to them, only a subset of the available packages can be installed in a computer, as only a subset of the species pool can coexist in a local ecological community. Moreover, there is an interplay between macroevolution and community assembly, because the interactions introduced by the new species (packages) alter the dynamics of the colonization/extinction (installation) in a local community (computer).

Conclusions

During the exponential growth of the Debian GNU/Linux operating system, the reuse of existing code showed a scale-free distribution for the incoming dependencies and an exponential one for the outgoing dependencies. The modularity of the network of dependencies between packages as well and the number of structural modules increased over time. However, this increase in modularity did not avoid the increase in software incompatibilities within modules. Far for being a failure of software design, the modular structure of the network allows a larger fraction of the pool of available software to work properly in a local computer when the installation follows a random process. Decreasing conflicts between modules impedes the exclusion of entire modules of packages from the installation process. This positive effect of the modular structure was much larger in the three last releases, although the increase in modularity was not as high as it was for the first ones.

Further research on network evolution and local assembly dynamics in this and in other engineer systems will open a new opportunity window for biologists and computer scientists to collaborate addressing fundamental problems in biology. Let us keep in mind the words of Uri Alon [20]: “The similarity between the creations of a tinkerer and engineer also raises a fundamental scientific challenge: understanding the laws of nature that unite evolved and designed systems”.

Methods

Data set

In the Debian GNU/Linux operating system (www.debian.org) most software packages depend on or have conflicts with other packages in order to be installed on a local computer. By ”dependencies” (package i depends on package j) we mean that package j has to be installed first on the computer for i to work. By ”conflicts” (package i has a conflict with package j) we mean that package i cannot be installed if j is already on the computer. This does not necessarily mean that the package j has also a conflict with the package i: sometimes the package j is an improved version of the package i in a way that if i is already installed in the system then j improves it, but if j is installed then it already contains i and the latter cannot be installed. We have compiled the list of software packages, along with the network of dependencies and conflicts, of the ten major versions released since 1996. The list of packages and interactions can be downloaded from the website of this journal (see Supplementary Information for more details).

Statistical analysis

We have performed exponential regressions to quantify the increase in the number of packages that were deprecated, the new ones that were added, and those that persisted among the ten releases analyzed. We also characterized the increase in the number of dependencies and conflicts through releases using exponential regressions. The change of the ratio between the number of dependencies and the number of conflicts through releases was tested using a linear regression. The fits of the cumulative degree distributions for dependencies and conflicts that are showed are those with the highest

-test statistic between the two applied functions (exponential and power law) using multiplicative bins (see Fig. S1). The increase of the fraction of dependencies and conflicts within modules through releases was tested using linear regressions. We used linear and exponential regressions to test the change in the z-score (obtained for allowing the comparison of the modularity across networks) through releases. Linear regressions were also used to characterize the relationship between the number of modules and the number of packages with dependencies through releases. Finally, the decrease in the number of packages installed by a random process through releases and its relationship with the z-score of the modularity were tested using linear regressions.

Modularity analysis

We have used a heuristic method, based on modularity optimization

[21], to extract the modular structure of the network of dependencies of software packages constituting the different releases of the Debian GNU/Linux operating system. The ”Louvain” method [22] is a greedy algorithm implemented in C++ that allows one to study very large networks (the code is available at http://www.lambiotte.be). The excellent results in terms of modularity optimization given by the well-known ”Netcarto” software based on simulated annealing [23, 24] is limited when dealing with large networks, where extracting modularity optimization is a computationally hard problem. It has been shown that the Louvain method outperforms Netcarto in terms of computation time [22]. In addition, the Louvain method is able to reveal the potential hierarchical structure of the network, thereby giving access to different resolutions of community detection [25]. The statistical significance of the modularity was calculated by performing, for each release, 1000 randomizations of the network of dependencies keeping exactly the same number of dependencies per package, but reshuffling them randomly using a local rewiring algorithm [26]). The p-value was calculated as the fraction of random networks with a modularity value equal to or higher than the value obtained for the compiled network. In order to rule out the differences (in terms of connectance, number of packages, etc.) in the comparison of the modularity across networks, we calculated a z-score defined as the modularity of the compiled network of dependencies minus the mean modularity of the randomizations, divided by the standard deviation of the randomizations.

Local installation process

The aim of the local installation process is to calculate the distribution of the maximum number of packages that can be correctly installed in a computer by a random process of software installation. We have performed 1000 replicates of the local installation process for each release of the Debian/GNU Linux operating system ensuring that the asymptotic behavior of the variance was reached. Only packages with interactions (dependencies and/or conflicts) have been used in the process, and no subset of basic packages has previously been installed (both conditions differ from the algorithm applied by Fortuna & Melián

[27]). The algorithm selects randomly a package and checks whether the packages it depends on or has a conflict with those that have already been installed. If the package has a conflict with an already installed one, it is discarded. If it has no conflict with installed packages, the algorithm checks whether any of the packages it depends on directly or indirectly (by successive dependencies), has been discarded or has a conflict with an already installed package. In that case, it is discarded too. Otherwise, it is installed with all the packages it depends directly as well as indirectly. The process continues until no more packages are available to be installed. In the few cases where a package depends on two packages having a reciprocal conflict (because one or the other is needed for the installation of the selected package), we choose randomly one of them and discard the other. The randomization of the network of dependencies used for testing the effect of the modularity on the local installation process was the same describe above (Modularity analysis). The number of conflicts between packages and the identity of who has a conflict with whom have remained unchanged, as in the compiled networks. We performed 1000 replicates of the installation process for each randomization, and generated 100 random networks of dependencies for each release. The fraction of random networks in which the fraction of packages installed was equal to or higher than the value obtained for the modular network was used as p-value. A z-score was calculated for comparing, across releases, the fraction of packages installed using the modular networks with that of randomizations. The z-score was defined as the mean fraction of packages installed using the modular network minus the mean fraction of packages installed using the randomizations, divided by the standard deviation of the randomizations.

Acknowledgments

We thank Colin Twomey for useful discussions, and Nicholas Pippenger and Luís A. Nunes Amaral for their comments and suggestions that have largely improved the ms. This work was funded by a Marie Curie International Outgoing Fellowship within the 7th European Community Framework Programme (to M.A.F.), and the Defense Advanced Research Projects Agency (DARPA) under grant HR0011-09-1-055 (to S.A.L.).

References

  • [1] Strogatz, S. H. (2001). Exploring complex networks. Nature 410, 268-276.
  • [2] Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics of ’small-world’ networks. Nature 393, 440-442.
  • [3] Barabási, A.-L. and Albert, R. (1999) Emerging of scaling in random networks. Science 286, 509-512.
  • [4] Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. and Barabási, A.-L. (2000). The large-scale organization of metabolic networks. Nature 407, 651-654.
  • [5] Liljeros, F., Edling, C. R., Amaral, L. A. N., Stanley, H. E. and Aberg, Y. (2001). The web of human sexual contacts. Nature 411, 907-908.
  • [6] Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proc. Nat. Acad. Sci. USA 98, 404-409.
  • [7] Albert, R., Jeong, H. and Barabási, A.-L. (2000) Error and attack tolerance of complex networks. Nature 406, 378-382.
  • [8] Solé, R. V. and Montoya, J. M. (2001). Complexity and fragility in ecological networks. Proc. R. Soc. B. 268, 2039-2045.
  • [9] Kumar, R., Novak, J., Raghavan, P., and Tomkins, A. (2003). On the bursty evolution of blogspace. World Wide Web Internet and web information systems 8, 568-576.
  • [10] Aizen, J., Huttenlocher, D., Kleinberg, J. and Novak, A. (2004). Traffic-based feedback on the web. Proc. Nat. Acad. Sci. USA 101, 5254-5260.
  • [11] Guimerá, R., Uzzi, B., Spiro, J. and Amaral, L. A. N. (2005) Team assembly mechanisms determine collaboration network structure and team performance. Science 308, 696-702.
  • [12] Huberman, B. A. and Adamic, L. A. (1999). Growth dynamics of the World-Wide Web. Nature 401, 131.
  • [13] Albert, R., Albert, I. and Nakarado, G. (2004) Structural vulnerability of the North American power grid. Physical Review E 69, 1-4.
  • [14] Montoya, J. M., Pimm, S. L. and Solé, R. V. (2006). Ecological networks and their fragility. Nature 442, 259-264.
  • [15] Yan, K.-K., Fang, G., Bhardwaj, N., Alexander, R. P. and Gerstein, M. (2010). Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks. Proc. Nat. Acad. Sci. USA 107, 9186-9191.
  • [16] Myers, C. (2003). Software systems as complex networks: structure, function, and evolvability of software collaboration graphs. Phys. Rev. E 68, 1-15.
  • [17] Simon, H. A. (1958). The architecture of complexity. Proc. Am. Philos. Soc. 106, 467-482.
  • [18] Levin, S. A. (1999). Fragile dominion: complexity and the commons. Perseus Books. Reading, Massachusetts, USA.
  • [19] Stouffer, D. B. and Bascompte, J. (2011). Compartmentalization increases food-web persistence. Proc. Nat. Acad. Sci. USA 108, 3648-3652.
  • [20] Alon, U. (2003) Biological networks: the tinkerer as an engineer. Science 301, 1866-1867.
  • [21] Newman, M. E. J. (2004). Fast algorithm for detecting community structure in networks. Phys. Rev. E. 69, 1-5.
  • [22] Blondel, V., Guillaume, J. L., Lambiotte, R. and Lefebvre, E. (2008). Fast unfolding of communities in large networks. J. Stat. Mech. Theor. Exp., P10008.
  • [23] Guimerá, R. and Amaral, L. A. N. (2005). Cartography of complex networks: modules and universal roles. J. Stat. Mech. Theor. Exp., P02001.
  • [24] Guimerá, R. and Amaral, L. A. N. (2005). Functional cartography of complex metabolic networks. Nature 433, 895-900.
  • [25] Fortunato, S. and Barthélemy, M. (2007). Resolution limit in community detection. Proc. Nat. Acad. Sci. USA 104, 36-41.
  • [26] Gale, D. (1957). A theorem of flows in networks. Pacific Journal of Mathematics 7, 1073-1082.
  • [27] Fortuna, M. A. and Melián, C. J. (2007). Do scale-free regulatory networks allow more expression than random ones? J. Theor. Biol. 247, 331-336.