1 Introduction
Hierarchically modular designs enhance evolvability in natural systems [15, 16, 19], make the maintenance easier in technological systems, and provide agility and better abstraction of the system design [9, 18].
In prior work in [21], we present EvoLexis, a modeling framework for the emergence and evolution of hierarchical structure in complex modular systems. There are many hypotheses in the literature regarding the factors that contribute to either the hierarchy or modularity properties. Local resource constraints in social networks and ecosystems [17], modularly varying goals [7, 13, 14], selection for more robust phenotypes [4, 24], and selection for lower connection costs in a network [15] are some of the mechanisms that have been previously explored and shown to lead to hierarchically modular systems. The main hypothesis that EvoLexis follows is along the lines of [15], which assumes that systems in both nature and technology care to minimize the cost of their interconnections or dependencies between modules. We also studied the hourglass effect via EvoLexis. Informally, an hourglass architecture means that the system of interest produces many outputs from many inputs through a relatively small number of highly central intermediate modules, referred to as the “waist” of the hourglass. It has been observed that hierarchically modular systems often exhibit the architecture of an hourglass; for reference, in fields like computer networking [2]
[11, 10], embryogenesis [5], metabolism [8, 23], and many others [19, 22], this phenomena is observed. A comprehensive survey of the literature on hierarchical systems evolution, and the hourglass effect is presented in [19].The motivation for this paper is that the EvoLexis model is quite general and abstract, and it does not attempt to capture any domainspecific aspects of biological or technological evolution. As such, it makes several assumptions that can be criticized for being unrealistic, such as the fact that all targets have the same length, or their length stays constant, or the fitness of a sequence is strictly based on its hierarchical cost. We believe that such abstract modeling is still valuable because it can provide insights into the qualitative properties of the resulting hierarchies under different target generation models. However, we also believe that the predictions of the EvoLexis model should be tested using real data from evolving systems in which the outputs can be well represented by sequences. One such system is the iGEM synthetic DNA dataset [1]. The target DNA sequences in the iGEM dataset are built from standard “BioBrick parts” (more elementary DNA sequences) that collectively form a library of synthetic DNA sequences. These sequences are submitted to the registry of standard biological parts in the annual iGEM competition. Previous research in [3, 20] has provided some evidence that these synthetic DNA sequences are designed by reusing existing components, and as such, it has a hierarchical organization. In this paper, we investigate how to apply the EvoLexis framework in the time series of iGEM sequences, and whether the resulting iGEM hierarchies exhibit the same qualitative properties we observed in [21] which was solely based on abstract target generation models. We ask the following questions in this paper:

How can we analyze the iGEM dataset using the evolutionary framework of EvoLexis? How are the batches of targets formed? What properties of the iGEM batches are different than EvoLexis’s setting?

When formed incrementally over the iGEM dataset, which are the architectural properties of LexisDAGs, and why?
2 Preliminaries
To develop EvoLexis, we extend the previously proposed optimization framework Lexis in [20]. Lexis models the most elementary modules of the system as symbols (“sources”) and the modules at the highest level of the hierarchy as sequences of those symbols (“targets”). EvoLexis is a dynamic or evolving version of Lexis, in the sense that the set of targets changes over time through additions (births) and removals (deaths) of targets. EvoLexis computes an (approximate) minimumcost adjustment of a given hierarchy when the set of targets changes over time (a process we refer to as “incremental design”).
2.1 Lexis Optimization
Given an alphabet and a set of “target” strings over the alphabet , we need to construct a LexisDAG. A LexisDAG is a directed acyclic graph , where is the set of nodes and the set of edges, that satisfies the following three constraints:^{1}^{1}1To simplify the notation, even though is a function of and , we do not denote it as such. a) Each node in a LexisDAG represents a string of characters from the alphabet . The nodes that represent characters of are referred to as sources, and they have zero indegree. The nodes that represent target strings are referred to as targets, and they have zero outdegree. also includes a set of intermediate nodes , which represent substrings that appear in the targets . So, . b) Each node in of a LexisDAG represents a string that is the concatenation of two or more substrings, specified by the incoming edges from other nodes to that node. Note that there may be more than one edge from node to node . c) A LexisDAG should only include intermediate nodes that have an outdegree of at least two, for a more parsimonious hierarchical representation. Fig. 4 illustrates the concepts introduced here.
2.1.1 The Lexis Optimization Problem
The Lexis optimization problem is to construct a minimumcost LexisDAG for the given alphabet and target strings . In other words, the problem is to determine the set of intermediate nodes and all required edges so that the corresponding LexisDAG is optimal in terms of a given cost function . This problem can be formulated as follows:
(1)  
A natural cost function, as investigated in previous work [20], is the number of edges in the LexisDAG. The edge cost to construct a node is defined as the number of incoming edges required to construct from its inneighbors, which is equal to . The edge cost of source nodes is obviously zero. The edge cost of LexisDAG is defined as the edge cost of all nodes, which is equal to the number of edges in . With edge cost, the problem in Eq. (1) is NPHard [20]. This problem is similar to the Smallest Grammar Problem (SGP) [6] and in fact its NPHardness is shown by a reduction from SGP [20].
We solve the Lexis optimization problem in Eq. (1
) with a greedy heuristic, called
GLexis [20]. GLexis starts with the trivial flat LexisDAG, and at each iteration it chooses the substring that maximally reduces the edge cost, when it is added as a new intermediate node to the LexisDAG and the corresponding edges are rewired by its addition.2.1.2 PathCentrality and the Core of a LexisDAG
After constructing a LexisDAG, an important question is to rank the constructed intermediate nodes in terms of significance or centrality. More formally, let be the number of sourcetotarget paths that traverse node ; we refer to as the path centrality of intermediate node . Path centrality can be computed as: where is the number of paths from any source to , and is the number of paths from to any target. ^{2}^{2}2A similar metric, called stress centrality of a vertex, is studied in [12].
An important followup question is to identify the core of a LexisDAG, i.e., a set of intermediate nodes that represent, as a whole, the most important substrings in that LexisDAG. Intuitively, we expect that the core should include nodes of high path centrality, and that almost all sourcetotarget dependency chains of the LexisDAG should traverse at least one of these core nodes. More formally, suppose is a set of intermediate nodes and is the set of sourcetotarget paths after we remove the nodes in from . The core of is defined as the minimumcardinality set of intermediate nodes such that the fraction of remaining sourcetotarget paths after the removal of is at most :^{3}^{3}3To simplify notation, we do not denote the core set as function of .
(2)  
where is the number of sourcetotarget paths in the original LexisDAG, without removing any nodes. We solve the core identification problem with a greedy algorithm referred to as GCore [20]. This algorithm adds in each iteration the node with the highest pathcentrality value to the core set, updates the LexisDAG by removing that node and its edges, and recomputes the path centralities of the remaining nodes before the next iteration.
2.1.3 Hourglass score
Intuitively, a LexisDAG exhibits the hourglass effect if it has a small core. We use a metric, named as Hourglass Score, or HScore, in our study for measuring the “hourglassness” of a network. This metric was originally presented in [19]. To calculate the Hscore, we create a flat LexisDAG containing the same targets as the original LexisDAG . Note that preserves the sourcetarget dependencies of : each target in is constructed based on the same set of sources as in . However, the dependency paths in are direct, without forming any intermediate modules that could be reused across different targets. So, by construction, the flat LexisDAG cannot have a nontrivial core since it does not have any intermediate nodes. We define the Hscore as follows: where and are the core sets of and for a given threshold , respectively. Since that can include a combination of sources and targets, it would never be larger than either the set of sources or targets, i.e., . Thus, . The Hscore of is approximately one if the core size of the original LexisDAG is negligible compared to the the core size of the corresponding flat LexisDAG.
2.2 EvoLexis Framework and Key Results
The EvoLexis framework includes a number of components that are described below. A general illustration of the framework is shown in Fig. 5. In every iteration, the following steps are performed: (1) A batch of new targets is generated via a target generation model. (2) In the “expansion phase”, the new targets are added incrementally to the current LexisDAG by minimizing the marginal cost of adding every new target to the existing hierarchy. We refer to this incremental design algorithm as IncLexis, and it is described in detail [21]. (3) If the number of targets that are present in the system has reached a steadystate threshold, we also remove the batch of oldest targets from the LexisDAG.
In general, a system interacts with its environment in a bidirectional manner: the environment imposes various constraints on the system and the system also affects its environment. To capture this coevolutionary setting in EvoLexis, we study how changes in the set of targets affect the resulting hierarchy but also how the current hierarchy affects the selection of new targets (i.e. whether a new candidate target is selected or not depends on its fitness or cost – and that depends on how easily that target can be supported by the given hierarchy). By incorporating wellknown evolutionary mechanisms, such as tinkering (mutation), recombination, and selection, EvoLexis can capture such coevolutionary dynamics between the generation of new targets and the hierarchy that supports them. Fig. 6 is an overview of the following key results from the EvoLexis model: i) Tinkering/mutation in the target generation process is found to be a strong initial force for the emergence of lowcost and deep hierarchies. ii) Selection is found to enhance the emergence of more complex intermediate modules in optimized hierarchies. The bias towards reuse of complex modules results in an hourglass architecture in which almost all sourcetotarget dependency paths traverse a small set of intermediate modules. iii) The addition of recombination in the target generation process is essential in providing target diversity in optimized hierarchies.
3 iGEM Dataset
3.1 Preliminaries
The International Genetically Engineered Machine (iGEM) is an annual worldwide synthetic biology competition. The competition is between students from diverse backgrounds including biology, chemistry, physics, engineering, and computer science to construct synthetic DNA structures with novel functionalities.
Every year at the beginning of the summer, there is a “Distribution Kit” handed to teams which includes interchangeable parts (so called “BioBricks”) from the Registry of Standard Biological Parts comprising various genetic components such as promoters, terminators, reporter elements, and plasmid backbones. Then, the teams try to use these parts and the new standardized parts of their own in order to build biological systems. The teams can build on previous projects or create completely new parts. At the end of the summer, all teams add their new BioBricks to the registry for further possible reuse in next years.
The iGEM Registry (i.e., the dataset we are working with) includes a set of standard biological parts. A [biological] part is a DNA sequence which encodes a biological function, e.g., a promoter or protein coding sequence. These biological parts are standardized to be easily assembled together and reused with other standardized parts in the registry. A “basic part” is a functional unit of a synthesized DNA that cannot be subdivided into smaller component parts. BBa_R0051 is an example of a promoter basic part. Basic parts have the role of sources in the Lexis setting. A “composite part” is a functional unit of DNA consisting of two or more basic parts assembled together. BBa_I13507 is an example of a composite part, consisting of four basic parts “BBa_B0034 BBa_E1010 BBa_B0010 BBa_B0012”. The dataset we analyze is the set of all composite parts submitted to the registry from 2003 to 2017. In this dataset, the composite parts are represented by the string of their basic parts (i.e., a nondividing representation). The sequence of iGEM composite parts can be considered as a sequence of target strings over a set of sources (i.e., basic parts). We have acquired the iGEM data from https://github.com/biohubx/igemdata. All the BioBrick parts were crawled until Dec 28th 2017. In Table 1, the preliminary statistics about the dataset are listed.
# Sources  # Targets  Total Length  Min/Max Target Length 
7,889  18,394  107,022  2 / 100 
The dataset mostly presents targets of small length. The top 5 categories having the highest fraction of the targets belongs to those of length 5, 2, 3, 4 and 6, accounting for more than 70% of the dataset. Less than 10% of the targets have a length of more than 10.
3.2 Considering Annual Batches of Targets
The iGEM competition is conducted annually. Hence, it is reasonable to consider the sequences of targets as annual batches of targets arriving each year. This consideration is in line with the incremental design process in EvoLexis.
To show some differences between iGEM and EvoLexis, in Fig. 11, we can see how the number of sources, the number of targets, length statistics and source reuse statistics change over time. We can make the following observations from these figures:

The number of sources increases, where it was constant in EvoLexis.

In the first four years, the number of targets per year is noticeably small. Later on, the number of targets increases up to 2,000 and then fluctuates around 1,000 to 1,300 targets per year. In EvoLexis, the number of targets per batch is constant and they all have the same length.

The mean and median of target lengths stay in the same range () during all 15 years.
In the following sections, we show that how these differences between iGEM dataset and EvoLexis cause differences between the resulting LexisDAGs.
4 Analysis of iGEM Dataset in EvoLexis Framework
From this section on, we compare the results over iGEM with the results gathered from EvoLexis in [21]. We refer the reader to [21] for details of the model and parameter settings.
4.1 LexisDAG Cost Analysis
In this section, we observe how cost efficient the LexisDAGs over the iGEM dataset are. We consider an incremental setting similar to EvoLexis: In the first year, a cleanslate LexisDAG is constructed over the targets of that year. For the targets of the subsequent years, an incremental LexisDAG is constructed. Fig. 14 shows how the normalized cost of the LexisDAGs varies over the years on iGEM. We observe major differences with EvoLexis; in EvoLexis the normalized cost remains almost constant.
To investigate the reasons for the above observations, in the same Fig. 14, we also track the cost reduction performance of the two stages of IncLexis for each batch (as a reminder, in stage1, we reuse intermediate nodes from previous LexisDAG and in stage2, we further optimize the hierarchy using GLexis). This experiment is done due to our interest in seeing how much stage1 of IncLexis contributes to the cost reduction on iGEM. There are two observations that we can make:

In most batches, more than 50% of the cost reduction is achieved by the stage1, i.e., reuse stage. The contribution of stage2 of IncLexis is roughly constant throughout years. This suggests that iGEM targets reuse a significant amount of sequences from previous years in their own submissions.

There is an increasing trend in the normalized cost after stage1. This observation means that the contribution of the reuse stage in IncLexis decreases over the years. As mentioned, the contribution of stage2 stays mostly constant. Hence, we can relate the increasing trend of the normalized cost to the fact that the amount of reuse reduces from year to year.
We can find the rootcause of the decrease of reuse over time on iGEM to the increase of the size of the set of sources. We have observed in Fig. (a)a that there are many new sources that get introduced over the years. One of the requirements for reuse from one batch to another in EvoLexis is the fact that the set of sources does not drastically change (in fact it is constant in the EvoLexis framework). To investigate whether this is true in iGEM, we check the ratio of the sources from one year to the next that remain the same. Specifically, if we have , and if & are the set of sources in year & respectively, we check the ratio: . This ratio, i.e., yearbyyear similarity, is the fraction of sources that remain from the previous year. Fig. (c)c
shows how this ratio changes from year to year. By year 2008, the ratio drops significantly to a value around 0.2 which means around 80% of the sources from the previous year are not reused. This reduces the amount of reuse that is possible in the iGEM dataset. The introduction of new sources is also propagated in individual targets. As time progresses, there is a higher probability to use more than
number of new sources per target. This observation is a further obstacle for reuse, especially given that the targets in iGEM are often short (57 subparts). Following the increase of the normalized cost, Fig. 17 shows that the DAGs get less deep and have lower average node length as time progresses. Overall, the results of this section show a number of differences between iGEM and EvoLexis:
In iGEM, the set of sources in each year has low similarity to the previous years, while in EvoLexis the source set is constant. The high amount of churn in the set of sources is the primary reason for the lower reuse in iGEM data compared to EvoLexis. The fact that the targets are shorter is another factor for iGEM’s lower potential for reuse of longer intermediate nodes.

The normalized cost, depth and average node length are all lower in iGEM due to the reduced reuse potential as discussed above.
4.2 Hourglass Effect in iGEM
The following results in this section show that in all years, there is a small number of core nodes in the iGEM LexisDAGs. Fig. 24 shows that such small cores make the topology of iGEM LexisDAGs consistent with an hourglass organization (high Hscore values  more than 0.6 in Fig. (c)c). In EvoLexis, we observe similar values of Hscore for DAGs constructed using synthetic data. As observed, although the core size increases in iGEM over time, we see a steeper increase in the size of the flat DAG’s core mostly due to the increase in set of sources. In EvoLexis, the core size shows a decreasing trend while the size of the core of the flat DAG does not significantly change, reflecting similarly high Hscore values as in iGEM. Overall, we can see that the topology of the LexisDAGs in iGEM data is in line with the EvoLexis model, although the bias in selection of costsaving nodes is not sufficiently large to cause a nonincreasing normalized cost.
4.3 Diversity among iGEM Targets
Another question is the degree of diversity among the targets of iGEM over time. We define the concept of Normalized Diversity as follows: Suppose we have a set of strings . The goal is to provide a single number that quantifies how dissimilar these elements are to each other.

We first identify the medoid of the set , i.e., the element that has the lowest average distance from all other elements. We use Levenshtein distance as a measure of distance between targets: .

To compute how diverse the elements are with respect to each other, we average the normalized distance of all elements from the medoid (distance is normalized by the maximum length of the two sequences in question). We call this measure , the Normalized Diversity of set . The bigger the metric, the more diverse a set of strings is: .
Fig. 28 shows that the normalized diversity metric has a value of more than 0.5 throughout time and reaches up to 0.8 (this means that on average 50% to 80% of a target should be changed so that a target is converted to another in the set of targets in each year). Although such values of diversity are in line with EvoLexis, it is understandable that the diversity in iGEM is also partially impacted (towards higher values) by the introduction of new sources discussed before. Because of this reason, and the fact that the diversity is measured in a slightly different way in [21], we do not show a direct comparison in Fig. 28.
4.4 Core Stability in iGEM LexisDAGs
We have already defined the core size and the Hscore. Here we define an additional metric, related to the stability of the core across time.
We track the stability of the core set by comparing two core sets at two different times. A direct comparison of the core sets via the Jaccard index leads to poor results. The reason is that often the strings of the two sets are similar to each other but not completely identical.
Thus, we define a generalized version of Jaccard similarity that we call LevenshteinJaccard Similarity:

Suppose we aim to compute the similarity of two sets A and B of strings. We define the mapping where every element is mapped to the most similar element . We also define the mapping from every element to the most similar element :
(3) where is the similarity of to and is calculated as: . Notice that is the maximum value of Levenshtein distance between and . This consideration ensures that if then , and if and have the maximum distance then .

Considering both and , we get the union of the two mappings and define the LevenshteinJaccard similarity as follows:
(4) We can see that if (all weights are equal to one) then . Also if none of the elements in are similar to (all the element pairs take zero similarity value), then .
As the results in Fig. (c)c show, the core set in iGEM DAGs have relatively high values of the core stability measure (Eq. (4)), close to the values we observed in EvoLexis. This means that the core nodes stay similar across time, and there are no sudden changes in the content of the core set. One reason for this stability is that the set of core nodes includes several sources, and many of core sources get transferred to the next year.
Additionally, every year the focus of the iGEM designers is on specific parts, most of which are of high path centrality. For example, “BBa_B0010 BBa_B0012” (the most widely used “terminator” part) and “BBa_B0034” are almost always the top2 central nodes (with the exception of year 2011). Also, some sources such as “BBa_R0011”, always appear in the top20 nodes in the core set. Remember that Fig. (d)d shows that the reuse distribution of sources is highly skewed. In summary, the stability of the core set in iGEM is caused by the same reason with EvoLexis, which is the bias and selectivity towards using a specific set of nodes in consecutive years.
5 Conclusions
iGEM is a dataset that satisfies the basic assumption of EvoLexis framework: a sequence of target strings with potential temporal reuse of previously introduced substrings. Because of this compatibility, we chose to use this dataset in a casestudy and contrast its qualitative properties with EvoLexis. We can summarize the answers to the questions posed in the abstract of this paper as follows:

We observe that although incremental design can build efficient hierarchies over the iGEM targets, the normalized cost increases over time. This is due to the fact that the amount of reuse from previous years decreases mainly due to the frequent introduction of new sources over time. The small length of the targets in iGEM is also an additional factor for lowering the potential of reuse of the previously constructed parts in iGEM.

The increasing normalized cost causes the LexisDAGs to become less deep and to contain shorter nodes on average as time progresses. This is different than EvoLexis. In addition, there is a high fraction of very short targets in each year in comparison to EvoLexis.

The iGEM LexisDAGs present a bias in reusing specific nodes more often than the other nodes. This biased reuse results in the LexisDAGs to take the shape of an hourglass with relatively high Hscore values and a stable set of core nodes over time. This observation is consistent with EvoLexis.

The core sets over the years remain stable and similar to previous years in iGEM data despite the fact that the set of sources changes significantly and the target sets are diverse each year. Most of the stability is contributed by a small set of central sources and central intermediate nodes that are heavily reused in iGEM registry over time.
References
 [1] igem.org/Main_Page
 [2] Akhshabi, S., Dovrolis, C.: The evolution of layered protocol stacks leads to an hourglassshaped architecture. pp. 206–217. SIGCOMM ’11, ACM (2011)
 [3] Blakes, J., Raz, O., Feige, U., Bacardit, J., Widera, P., BenYehezkel, T., Shapiro, E., Krasnogor, N.: Heuristic for maximizing DNA reuse in synthetic DNA library assembly. ACS Synthetic Biology 3(8), 529–542 (2014)
 [4] Callebaut, W., RasskinGutman, D.: Modularity: Understanding the Development and Evolution of Natural Complex Systems. Vienna series in theoretical biology, MIT Press (2005)
 [5] Casci, T.: Hourglass theory gets molecular approval. Nature Reviews Genetics 12, 76 EP – (Dec 2010)
 [6] Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The Smallest Grammar Problem. IEEE T. on Inf. Theory 51(7) (2005)
 [7] Clune, J., Mouret, J.B., Lipson, H.: The evolutionary origins of modularity. Proceedings of the Royal Society of London B: Biological Sciences 280(1755) (2013)
 [8] Csete, M., Doyle, J.C.: Bow ties, metabolism and disease. Trends in biotechnology 22 9, 446–50 (2004)
 [9] Fortuna, M.A., Bonachela, J.A., Levin, S.A.: Evolution of a modular software network. PNAS 108(50), 19985–19989 (2011)
 [10] Friedlander, T., Mayo, A.E., Tlusty, T., Alon, U.: Evolution of bowtie architectures in biology. PLOS Computational Biology 11(3), 1–19 (03 2015)
 [11] Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
 [12] Ishakian, V., Erdös, D., Terzi, E., Bestavros, A.: A Framework for the Evaluation and Management of Network Centrality, pp. 427–438
 [13] Kashtan, N., Noor, E., Alon, U.: Varying environments can speed up evolution. PNAS 104(34), 13711–13716 (2007)
 [14] Kashtan, N., Alon, U.: Spontaneous evolution of modularity and network motifs. PNAS 102(39), 13773–13778 (2005)
 [15] Mengistu, H., Huizinga, J., Mouret, J.B., Clune, J.: The evolutionary origins of hierarchy. PLOS Computational Biology 12(6), 1–23 (06 2016)
 [16] Meunier, D., Lambiotte, R., Bullmore, E.: Modular and hierarchically modular organization of brain networks. Frontiers in Neuroscience 4, 200 (2010)
 [17] Miller, W.: The hierarchical structure of ecosystems: Connections to evolution. Evolution: Education and Outreach 1(1), 16–24 (Jan 2008)
 [18] Myers, C.R.: Software systems as complex networks: Structure, function, and evolvability of software collaboration graphs. Phys. Rev. E 68, 046116 (Oct 2003)
 [19] Sabrin, K.M., Dovrolis, C.: The hourglass effect in hierarchical dependency networks. Network Science 5(4), 490–528 (2017)
 [20] Siyari, P., Dilkina, B., Dovrolis, C.: Lexis: An optimization framework for discovering the hierarchical structure of sequential data. pp. 1185–1194. SIGKDD ’16, ACM (2016)

[21]
Siyari, P., Dilkina, B., Dovrolis, C.: Emergence and evolution of hierarchical structure in complex systems. To Appear in Dynamics On and Of Complex Networks III: Machine Learning and Statistical Physics Approaches (2019)
 [22] Supper, J., Spangenberg, L., Planatscher, H., Dräger, A., Schröder, A., Zell, A.: Bowtiebuilder: modeling signal transduction pathways. BMC Systems Biology 3(1), 67 (Jun 2009)
 [23] Tanaka, R., Csete, M., Doyle, J.: Highly optimised global organisation of metabolic networks. IEE Proceedings  Systems Biology 2(4), 179–184 (Dec 2005)
 [24] Wagner, G.P., Pavlicev, M., Cheverud, J.M.: The road to modularity. Nature Reviews Genetics 8, 921 EP – (Dec 2007), review Article
Comments
There are no comments yet.