Graph Based Analysis for Gene Segment Organization In a Scrambled Genome

by   Mustafa Hajij, et al.

DNA rearrangement processes recombine gene segments that are organized on the chromosome in a variety of ways. The segments can overlap, interleave or one may be a subsegment of another. We use directed graphs to represent segment organizations on a given locus where contigs containing rearranged segments represent vertices and the edges correspond to the segment relationships. Using graph properties we associate a point in a higher dimensional Euclidean space to each graph such that cluster formations and analysis can be performed with methods from topological data analysis. The method is applied to a recently sequenced model organism Oxytricha trifallax, a species of ciliate with highly scrambled genome that undergoes massive rearrangement process after conjugation. The analysis shows some emerging star-like graph structures indicating that segments of a single gene can interleave, or even contain all of the segments from fifteen or more other genes in between its segments. We also observe that as many as six genes can have their segments mutually interleaving or overlapping.



There are no comments yet.


page 1

page 2

page 3

page 4


Segment representations with small resolution

A segment representation of a graph is an assignment of line segments in...

Direct Segmented Sonification of Characteristic Features of the Data Domain

Sonification and audification create auditory displays of datasets. Audi...

Experimental analysis of the accessibility of drawings with few segments

The visual complexity of a graph drawing is defined as the number of geo...

A change-point approach to identify hierarchical organization of topologically associated domains in chromatin interaction

The identification of spatial and temporal three-dimensional (3D) genome...

A Triclustering Approach for Time Evolving Graphs

This paper introduces a novel technique to track structures in time evol...

Classification of abrupt changes along viewing profiles of scientific articles

With the expansion of electronic publishing, a new dynamics of scientifi...

Optimal Construction of Hierarchical Overlap Graphs

Genome assembly is a fundamental problem in Bioinformatics, where for a ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

It has long been observed that genome rearrangement processes on an evolutionary scale can lead to speciation [dobzhansky1933sterility], while on developmental scale they often involve DNA deletions [beermann1977diminution, shibata2012extrachromosomal] as well as wholescale programmed rearrangements [smith2012genetic, prescott1994dna]. For example, the highly diverse collection of antibodies often is attributed to somatic DNA recombination [tonegawa1983somatic], and rearrangements on a chromosomal levels can be observed during homologous recombination [rieseberg2001chromosomal]. In recent years there are numerous observations of alternative splicing where rearranging patterns of exons and introns of a single gene can produce different protein variants from a single mRNA (e.g. [haussmann2016m6a]). Rearranging segments of nucleotide sequences can be organized in a variety of arrangements on the locus, for example, they can be overlapping or interleaving [g3]. Oxytricha trifallax is a single cell organism that is often taken as a model organism to study DNA rearrangement processes. This, and similar species of ciliates undergo massive restructuring of a germline micronuclear DNA during development of a somatic macronucleous. Recent sequencing and annotation of the whole O. trifallax genome allow genome level studies [chen2014architecture, burns2015database]. Scrambling patterns within thousands of genes were observed revealing hidden structures among the scrambled gene/nanochromosome segments that explain over 95% of the scrambled genome [burns2016recurring]. While those studies were focused on scrambled recurrent patterns within a single gene, in this paper we study inter-gene segment arrangements. Through clustering techniques we identify patterns in which segments of different genes interleave or overlap throughout the genome. We represent a micronuclear locus with the interleaving and overlapping gene segments by a directed graph. Such obtained graph data is then converted to a set of points (point cloud) in a Euclidean space and we apply topological data analysis techniques to obtain clusters of similar graphs. This method was applied to the whole genome data of Oxytricha trifallax [chen2014architecture, burns2015database]. In the last decade, Topological Data Analysis (TDA) has shown to be another tool for data analysis and data mining that can be used to extract topological information from various types of data [carlsson2009topology]. TDA originated from computational topology with ideas inspired from statistics and computer science. One of the notable tools in TDA is persistence homology. Persistence homology, or for brevity PH, was introduced by Edelsbrunner et al. [edelsbrunner2000topological] and later studied further by Zomorodian and Carlsson [zomorodian2005computing]. The theory has been studied extensively since then and many theoretical advances have been made [carlsson2010zigzag, carlsson2009zigzag, chazal2013persistence]. Persistence Homology is defined in all dimensions

, but clustering analysis, as used in this paper, uses only dimension

. More details on TDA and persistence homology can be found in [carlsson2009topology, edelsbrunner2008persistent, ghrist2008barcodes]. Due to advances in bioscience and biotechnology, the growth of biomolecular data has exploded and many data analysis algorithms have been developed aiming to better understand the generated data [cheng2006dompro, fernandez2014improving, meinicke2015uproc]. Data analysis using topological methods has proven to be useful in showing general patterns that were difficult to observe with other techniques. TDA methods have been used recently in many applications including protein structure identification [gameiro2015topological, xia2014persistent], aggregation models for animal behavior [ballerini2008interaction], and fullerene stability [xia2015persistent]. Our data set consists of directed graphs which we convert to a set of points in a Euclidean space, point cloud. . For the clustering analysis we applied TDA with PH at dimension zero to the obtained point cloud in . This process is described in Section 2. Although this paper is focused on analysis of a specific data of interleaving/overlapping gene segments, the method that we propose for converting a graph data to a point cloud data is novel and general, and can be applied to analyze similarities in various graph data.

2. Method and Data Construction

2.1. Gene Segment Maps in Oxytricha trifallax

Oxytricha trifallax is a species of ciliate used as a model organism to study genome rearrangements. It undergoes massive genome rearrangements during the development of a somatic macronucleus (MAC) specializing in gene expression from an archival germline micronucleus (MIC) [chen2014architecture]. Within this process, over 16,000 macronuclear nanochromosomes assemble through DNA processing events involving global deletion of 90-95% of the germline DNA, effectively eliminating nearly all so-called “junk” DNA, including intervening DNA segments (internally eliminated sequences, IESs). Because these IES segments interrupt the coding regions of the precursor macronuclear gene loci in the micronucleus, each macronuclear gene may appear as several nonconsecutive segments (macronuclear destined sequences, MDSs) in the micronucleus. Moreover, the precursor order of these MDS segments for thousands of genes can be permuted or inverted in the micronucleus such that during the macronuclear development, all IESs are deleted and the MDSs are rearranged to form thousands of gene-sized chromosomes. In [chen2014architecture] and later in [g3], it was observed that an IES between consecutive MDSs of one gene can contain MDS segments from other genes, and that this process can be nested. Furthermore MDSs from different MAC genes can overlap or one MDS can be a subsegment of an MDS of another gene.

Figure 1. Rearrangement of gene segments in Oxytricha trifallax

The interleaving situation is schematically depicted in Figure 1. MDSs are represented by colored boxes with numbers. This example illustrates a MIC contig that has MDSs of three MAC contigs, each indicated with a different color. The numbers within the boxes indicate the order of the MDSs in the corresponding MAC contig. The barred numbers indicate MDSs in a reverse orientation (inverted) in the MIC contig relative to the ordering of the MDSs in the corresponding MAC contig. The thin black lines between the squares indicate IESs. The MDS interrelationship analyzed in this paper uses the genome sequencing data in [chen2014architecture], and can be downloaded from the Supplemental Information in  [chen2014architecture] and also in [burns2016recurring]. The data used for analysis in this paper is the processed data reported in [burns2016recurring]. This data was filtered so that consecutive MDSs of a single MAC contig that overlap or have no nucleotide gap (are adjacent) in the MIC are merged into a singe MDS (correcting to possible previous annotation artifacts). In addition, we excluded MAC contigs that contain segments that are distant in the MAC contig but overlap in a MIC contig, as well as MAC contigs that are alternative fragmentations of longer MAC contigs. Both of these cases could be considered artifacts of the MIC-MAC maps although we cannot exclude the possibility that these cases are genuine to the data. We refer to such processed data as data available at

2.2. Graphs Corresponding to MIC Contings

As MDS from different MAC contigs can overlap or interleave in a MIC contig we define the following types of relationships between MAC contigs located on a single MIC contig.

Figure 2. Three types of MDS segment organization of different MAC contigs. MDSs of the same MAC contig are colored the same.
  • (Type 1 : Overlapping) If an MDS of a MAC contig overlaps with an MDS of another, distinct MAC contig , then it is said that and overlap, or they are overlapping. We also say that has type 1 interaction with , or has interaction of type 1 with . Two MAC contigs are considered to be overlapping if they have at least one pair of MDSs that overlap with at least 20bp in common. This is because two consecutive MDSs of the same MAC contig usually share sequences at their ends (pointers) that guide the rearrangement process [prescott1994dna], and two MDSs from distinct MAC contigs can share the same pointer sequence. The length of these pointer sequences usually ranges between 2 to 20 nucleotides. The overlapping relation is symmetric, if overlaps with , then overlaps with . The situation is depicted in Figure 2 (Type 1). In the figure, MDSs of and are represented by blue and red rectangles, respectively. This case excludes the case when one MDS is completely included in another, even though being a subsequence is a particular type of “overlapping”. Such situation is included in Type 2 case (below).

  • (Type 2 : Containment) If an MDS of a MAC contig is contained in (is a subsegment of) an MDS of another distinct MAC contig , then it is said that an MDS of is contained in an MDS of , and we say has type 2 interaction with . For this interaction when an MDS of contains an MDS of , we require that both ends of have at least 5 bps that are not in common with . That is, we require a complete inclusion such that there are no pointer sequences in common. In Figure 2 (Type 2), MDSs of are depicted in blue, and those for in red. This relation is not symmetric. We distinguish this situation from the one in Type 1 because the unscrambling of an MDS that is next to at least one IES (Type 1) may use a different biological process involving Piwi-interacting RNA [Fang] rather than one that does not neighbor an IES ( in Type 2).

  • (Type 3 : Interleaving) If an IES of a MAC contig contains an MDS of another, distinct MAC contig , then it is said that an MDS of interleaves (or is interleaving) with , or has Type 3 interaction with . We allow that the ends of an interleaving MDS of and the MDSs of to intersect (overlap) up to (including) 5 bases. This requirement distinguishes type 3 case from the ‘overlapping’ case where the requirement is at least 20 bases.

We consider pairs of MAC contigs and that belong to the same MIC contig. To each pair of MAC contigs we associate a triple where each entry () indicates whether is in relationship of Type with . The value of is either (there is no relationship of Type ) or ( is related to of with Type ). To investigate the situations of these three types of interactions of MDSs, we associate a directed graph with labeled edges to each MIC contig in data as follows. Each graph , which may be disconnected and have multiple connected components, corresponds to a MIC contig . Each vertex corresponds to a MAC contig whose MDSs are segments of the MIC contig . A labeled directed edge is in if . In the figures below we use colors on the edges to indicate the labels of the edges: red, green, blue, orange, purple, cyan, and black.

Figure 3. A segment of MIC contig ctg7180000069854 (left) and its corresponding graph (right). There is a relatively short overlap of MDSs of the black and purple contigs and the purple contig interleaves with the black (hence there is a blue edge indicating label from the purple to the black vertex) but there is no MDS segment of the black contig that interleaves with the purple contig, so there is a purple edge indicating in the opposite direction. IESs of both black and purple contigs contain MDSs of the cyan contig, so there are two black edges indicating labels ending at the cyan vertex.

Figure 3 shows a locus of the MIC contig ctg7180000069854 containing MDSs of three MAC contigs 5027.0 (purple), 21621.0 (black), and 4739.0 (cyan). Figures 16 and 28 depict some other examples of graphs for MIC contigs in the data. The set of graphs corresponding to the data that is analyzed here is denoted such that is a MIC contig in . There are 629 distinct colored graph isomorphism classes and 288 isomorphism classes of colored connected components of the graphs in , and they can be found at:

2.3. Graph Features Selection

We describe a method of converting graph data set to a set of points in the Euclidean space , i.e., the point cloud. To each (directed and colored) graph

in our data set we associate a vector

obtained by using relevant numerical graph invariants. This vector is then considered as a point in (Figure 4).

Figure 4. Every graph is associated to a feature vector , a point in the Euclidean space .

The vector is obtained by using local, vertex specific, and global, graph specific, features of . In this first global analysis of the genome we cluster the data according to general graph structure properties, therefore for each we also consider a corresponding undirected graph . The undirected, uncolored graph is obtained from by replacing each pair of parallel edges with opposite directions in with an undirected edge, and by ignoring the direction and the colors of the edges as shown in Figure 5.

Figure 5. An undirected graph associated to a directed one.

Global Vector. A vector with three entries, called the global vector, is associated to each graph . This vector consists of three features where and are the numbers of vertices and edges in , respectively, and is the size of the largest clique in . The isolated vertices are not counted in as they represent MAC contigs that have no interrelation with any other MAC contig present in the MIC contig represented by the graph. In our data the maximum number of vertices is 43, the maximum number of edges is 74, and the largest clique size is 6 (appears twice in the data). Local Vector. Vectors that use local properties of the vertices are associated to each . For each vertex we consider two numbers, its valency , and the clique number . The valency is a summation of its out-degree and its in-degree (including the parallel oppositely oriented edges) and the number of cliques of given sizes. The clique number is the number of cliques (induced subgraphs of isomorphic to the complete graph for some ) that contain this vertex. The vertices in , are ordered such that their valences are non-increasing. Vertices that have the same valency are further ordered such that their clique numbers are non-increasing. This order remains fixed for the graph . The valency vector, denoted , consists of a list of valencies of the preordered vertices of the graph . The maximum valency of a vertex in our data is achieved by contig 67157 with 25 outgoing edges and 4 incoming edges. The 23 of the outgoing edges have , one edge has label and the other is labeled . The maximum outgoing valency is 25 and it is achieved by the contig 67157. The maximum incoming valency is 6 and it is achieved by contig 67223.

Figure 6. Vertices of a graph are ordered by their valances. Here, .

The vertex order of the clique vector follows the same predetermined order of vertices for . We denote this vector by . An example of construction of is depicted in Figure 7.

Figure 7. The number of cliques associated with the vertex , vertices ordered as in Figure 6. Left: vertices and edges. Middle: . Right: . This graph has no cliques of size higher than . The clique vector in this example is which is the sum of the vectors.

The Graph Vector. The graph feature vector is defined by concatenating the vectors , and . For a graph , the number of entries of the vectors and are the same and therefore is a vector in . We denote the set of vectors associated to with , or simply . The Point Cloud. Observe that the number of entries of the vectors in is not uniform, because this number depends on the number of vertices in the corresponding graph. In order to work in the common Euclidean space, we expand some of the vectors (by appending 0’s) to obtain a consistent number of entries in all vectors. This modification is obtained in the following way. Let . If the valence vector of is

then we construct an auxiliary valence vector for with

increasing the number of entries of to such that entries of zeros are added at the end. Similarly we construct auxiliary clique vector by adding zeros at the end of . The graph vector is redefined with the concatenation . For our graph data the the maximum number of vertices in a graph is . We abuse the notation and use instead of to refer to the zero-augmented feature vector associated with a graph . The final point cloud set forms a subset of . For comparison, we also consider the point cloud from the global features vector . The point cloud is in . There are vectors obtained with the above adjustments. It is important to notice that the entries of the vectors , and for a graph are graph isomorphism invariants.

Lemma 2.1.

If the graphs and are graph isomorphic (but not necessarily label preserving) then and .


Let be a graph isomorphism. Then and have the same number of vertices, edges and the size of the maximal cliques in their undirected versions and . Therefore and the first three entries of and are the same. Also the number of non-zero entries in and are the same. A graph isomorphism maps vertices of to vertices of with the same number of outgoing and incoming edges. Similarly, the number of cliques incident to a vertex in is the same to the number of cliques of the corresponding vertex in . Let be a partition of such that

  • for all , for all and , and

  • for all and with , either or and .

Then is a partition of the vertices of satisfying the properties [i] and [ii]. Any order of vertices of (resp. ) that has non-increasing valencies and non-increasing clique numbers must list vertices of (resp. ) before vertices of (resp. ) whenever . Therefore it must be that and . ∎

In our analysis there are three reasons that reduced the data from graphs to . Many of these graphs are isomorphic if the edge color is ignored, and directed graphs often reduce to isomorphic undirected graphs. Of course there are graphs and that are non isomorphic but . Consider attaching two edges to a 4-cycle to obtain a 6-vertex graph. They can be attached to neighboring or to diagonally opposite vertices of the cycle. In both cases the associated vectors will be the same.

3. Clustering Analysis with TDA

For a data set , in our case corresponding to a set of directed graphs, a TDA analysis of persistent -dimensional homology gives rise to a hierarchy of connected components of (clustered) graphs as described below. To understand the distribution of the points of in we use the notion of the neighborhood graph, as defined below, to construct a hierarchy of undirected graphs whose vertices are . The neighborhood graph of depends on a chosen distance function. In our case the distance is the Euclidean distance between two points .

Definition 3.1.

Let be a set of points in and let be a non-negative number. The -neighborhood graph is an undirected graph , where and .

The clustering analysis is done by considering a sequence of neighborhood graphs for obtained by a sequence of incrementally increasing values .

Definition 3.2.

A cluster of at level is a connected component in the neighborhood graph .

We observe some facts about the graphs vectors and . Suppose is a family of graphs and and are points in obtained as described above. The vectors in the set and are all part of the integer lattice of and respectively, therefore any distance between two distinct vectors is at least . The observation below indicates that small changes in the graphs represented by the vectors induce larger distances of the vectors in .

Lemma 3.3.

Let . Then the following hold.

  • If is obtained from by addition of one vertex and one edge incident to that vertex. Then and .

  • If is obtained from by addition of one directed edge without changing the total number of vertices, nor the number of cliques, then .

  • If is obtained from by addition of one edge that adds a clique to the graph without changing the number of vertices, then .

  • If is obtained from by changing the target of one edge from vertex to vertex without changing the number of the cliques, either or .


(a) The addition of a vertex in changes the number of non-zero entries in in two places, once at and again at . Let be the new vertex in added to and let be the new edge in connecting with the new vertex . Then can be taken to be the last vertex in in the order of the vertices, while the order of in might be either the same as its order in or different. In both cases the entries in corresponding to , , , , are at least one more than the corresponding entries in and the entry of is at least two more (a 1-clique vertex and a 2-clique the new edge) than the corresponding entry in which is 0. So , and . The proofs of (b) and (c) follow a similar argument. Note that in case of (b), if the new directed edge is incident to vertices and , then because the number of cliques in is not changed from the number of cliques in , there is an edge in incident to and in opposite direction. So has at least one more in the entries , and . Observe that this may imply a change in the order of the vertices, in which case there may be a difference in the entries corresponding to the and which would increase the distance between the vectors. Therefore . For the case of (c), the entries of , , , , in vector have a change of at least one and therefore the distance and . The case (d) follows the argument of (b) if the valencies change or, if valencies don’t change, the graphs are represented by the same vectors and the distance is 0. ∎

3.1. Analyzing The Data Using Neighborhood Graphs

A filtration of a graph is a sequence of nested graphs where each is a subgraph of . The definition of the neighborhood graph on a point cloud naturally induces a filtration for a connected graph with vertices . Namely, given a point cloud and a finite sequence of non-negative numbers we obtain a filtration . We assume that , which implies that . This filtration also helps to extract the connected components (clusters) of at various spatial resolutions. For a given , each connected component of corresponds to a cluster of directed graphs whose corresponding points in are connected by edges that are of lengths less than . This means that the vectors associated with the graphs are at most apart, i.e., the graph properties indicated in the vectors are similar and are within neighborhood from each other. To have a better information about the topological properties encoded in a filtration one usually considers the persistence diagram of the filtration. For our purpose, the persistence diagram describes a way the connected components of the neighborhood graph merge together as we increase the value of . The persistence diagram is also equivalently described by the persistence barcode [Ghrist2008]. The barcode construction is described as follows. Let , where (in our case ). In Figures 8 and 12, the vertical axis enumerates points of , and -values are listed on the horizontal axis. At , , and each point of forms a single connected component. There are connected components, and hence the number of bars in the barcode at value is equal to the number of data points in corresponding to the birth of all connected components. With appropriate increments of new edges are added to the neighborhood graph and the connected components start joining each other forming larger clusters. The merging event of connected components is represented by a termination of all but one of the corresponding bars of the barcode. The choice of the bar that does not terminate in a merge of components is arbitrary, and we use the established convention (see  [Ghrist2008]) where bars are vertically ordered by their length from the shortest at the bottom of the diagram to the longest on the top. The number of connected components of the graph is the number of horizontal bars intersecting the vertical line at distance equal to . For instance, from Figure 8 we deduce that the number of connected components in is for indicating two clusters at that distance. Typically, the filtration ends with a neighborhood graph that has a single connected component. That is, the sequence of values increase from to the value that gives rise to a single component graph. In the case of data for the set of global vectors and the point clouds and , the values range from to and to respectively.

Figure 8. The barcode diagram describing the birth and death of the connected components of the neighborhood graph of the dataset .

3.2. Tree Diagrams Representing Merging Components

The merging events of connected components described in the persistence diagram can be encoded using a tree diagram called a dendrogram [murtagh1983survey]. The bottom points of the tree diagram correspond to the points of (resp. ), that also correspond to the connected components of . The vertical direction of the tree diagram represents values of . At each level the connected components (clusters) are enumerated and each vertex in the tree is labeled by where is an index that corresponds to the th cluster of the graph at level . At each level , the number of nodes corresponds to the number of clusters of . For a node (vertex) at level , the children of correspond to the clusters at level (i.e., the graph ) that have joined to a single connected component represented by in . For a large enough value of , is connected, and it corresponds to the single node (root) of the tree. The dendrograms corresponding to the persistent diagrams for  and are shown in Figures 9 and Figure 13 in the supplementary documentation, respectively.

Figure 9. The dendrogram clustering tree of dataset .

3.3. Implementation

The point cloud generated from the data was generated using a custom Python script. The persistence diagrams were generated using Javaplex [Javaplex] and the dendrogram tree diagrams were generated using Mathematica [Mathematica]. The sequence data, the graph data and the scripts are available at

4. Results

The analyzed data consists of processed [burns2016recurring] micronuclear contigs obtained after sequencing of O. trifallax [chen2014architecture] and is available at [burns2016recurring]. The directed graphs that correspond to the contigs in can be found at As mentioned, the data produced 273 distinct vector entries in that correspond to the same number of isomorphism classes of graphs ranging from 2 to 43 vertices. Each MIC contig corresponds to a vector in while the MAC contigs whose MDS segments do not have any of the types 1, 2 or 3 interactions with MDSs of other contigs represent isolated vertices in the graphs and are not taken in consideration for the construction of . We constructed filtration with increments of in order to detect small neighborhood changes in the neighborhood graph, these sometimes are reflected by reorienting a directed edge.

4.1. Output of Hierarchal Clustering

The bar code diagram and the dendrogram for the filtration and clustering of the neighborhood graph of are depicted in Figure 8 and Figure 9. As expected by Lemma 3.3, the neighborhood graph consists of isolated vertices for and the fist edges appear at when there are 14 two point and 4 three point clusters. The two or three graphs joined at this distance differ from each other by small changes such as a single directed edge addition that does not change the cliques. At , as noted in Lemma 3.3, most points remain distant from each other and only those representing graphs with small changes in their structure are joined by en edge. In addition two, three and in one instance four of the previously formed clusters join in (also with some additional points) to form new clusters, and there are 25 new small two or three point clusters. Most of the points in remain as isolated vertices. At a dramatic change occurs and one large cluster of 155 elements is formed with a second cluster of 5 points, and several small (two or three point) clusters. All other points stay as isolated vertices. At this point the feature of the point-cloud becomes clear, it consists of a single large cluster, singletons, and some small two or three element components. At , there is one large cluster of 269 points while the second largest cluster is of 4 elements, and there are 10 isolated points. In the last 5 digits of contig numbers, the second largest cluster consists of:

Figure 10 shows the graphs contained in this cluster.

Figure 10. Top ctg7180000088928 and ctg7180000088096. Down ctg7180000067742 and ctg7180000067187.

All four of these graphs contain a ‘star’ vertex that is of high valency having multiple black (label ) outgoing edges. This indicates that there is one MAC contig whose MDSs interleave with MDSs of multiple other MAC contigs, and we observed that in most of these cases it is one IES of the central ‘star’ MAC contig that contains most or all of the MDSs of the other contigs. This coincides with the observation in [g3] where the depth of these embeddings were considered. The isolated points belong to 10 contigs

These are depicted in the corresponding figures in Supplementary Material Section. We note that some of these graphs have multiple ‘star’ vertices, or the component that contains a ‘star’ vertex also has additional cycles and cliques. In particular, the two graphs with 6-cliques (contigs and ) and the one with a 5-clique (contig ) are part of these isolated points. Furthermore, the graph with the longest path of 5 vertices (contig ) indicated in [g3] as one of the most in-depth embedding of genes within a single IES is also on this list. In all these cases we observe that the majority of the edges are black and purple, meaning that the prevailing inter-gene MDS organization is interleaving. As increases, the four-element cluster becomes part of the large cluster at and the isolated singleton points join the large cluster one or two at the time until when the two, most distant contigs and remain isolated until and respectively. The pattern of clusters for is similar to that of . A large single cluster is formed at value , with 2 clusters of 5 elements, 3 clusters of 2 elements, and 23 singleton clusters. At , the clusters consist of a large single cluster, the second largest of 9 elements, the two clusters of two elements, and 5 singleton clusters. The size two clusters are and . The elements of the former cluster appear as isolated points in the neighborhood of , while of the latter cluster, appears in the 4-elements cluster of , and is in the largest cluster of . The isolated points for are We note that all also appear as isolated points of the neighborhood graph of for . In this case, as in the case of , the two most distant graphs correspond to the contigs and that join the large cluster at and respectfully.

Figure 11. A 2d multidimensional scaling projection for . The points of are colored according to clustering at . At this level we have clusters, the largest cluster is colored red in the figure, the second largest cluster consists of elements and is colored green and the singletons are all colored blue.

Figures 11 and 14 represent the 2d multidimensional scaling projections[kruskal1978multidimensional] of the point clouds and , respectively.

5. Discussion

In this paper we initiated a mathematical method of representing and analyzing gene segment relationship in a scrambled genome of Oxytricha trifallax and up to our knowledge, such genome wide study for inter-gene segment arrangement has not been done before. The inter-gene segment arrangements are represented by graphs representing their segment relationship. We analyzed the graph data by converting these graphs to a point cloud in a higher dimensional Euclidean space and applied clustering methods from topological data analysis to identify patterns in the graph structures. The big majority of interactions within a single MIC contig are represented with small graphs up to five vertices (corresponding to the large cluster at ) and one can ‘move’ from one graph to another by small vertex/edge changes. This suggests that genes with complex interaction patterns are unique and often not found among macronuclear genes. The majority of the inter-gene segment arrangement within micronuclear chromosomes involve two or three genes and are pairwise interleaving and sometimes overlapping. The most prevalent multi-gene segment arrangements in the Oxytricha’s genome are interleaving and often this appears as one gene (or one IES in a MAC contig) interleaving with multiple other MAC contigs (as seen through the ‘star’ like vertices). In the cluster consisting of four graphs, a single IES of the ‘star’ MAC contig interleaves with multiple MAC contigs. All star contigs are scrambled, which follows the analysis in [g3]

where it was observed that contigs whose IESs interleave with other MAC contigs are mostly scrambled. The graph representation of the inter-gene segment relationship introduced here is novel, and we hope that a similar approach can be used in studies of the scrambled genomes of other species. Comparisons among orthologous genes in other species with scrambled genomes may reveal whether patterns in these graph structures are conserved or embellished over evolutionary time. Furthermore, studies whether genes with interleaved gene segments are co-expressed may indicate whether the rearrangement of these MAC segments are in parallel or sequential. We suggest that models of gene rearrangement should also focus on operations that can be applied to these frequent interleaving gene segments, which in some cases resemble the odd-even patterns detected within scrambled genes

[burns2016recurring]. The construction of the graph data into a point cloud in this paper is by a vector whose entries are common graph invariant properties, such as the number of vertices, edges and cliques. We used two vectors, one that had more local vertex properties and the other in which included only the number of vertices, edges and the maximal clique. It is interesting that in both cases the isolated points are the same, and much distant from the rest of the points. The rearrangement process of the MIC contigs corresponding to these isolated points may indicate specific biological process that include multiple genes simultaneously. The graphs with large cliques (5 and 6) imply that segments of up to 5 or 6 genes all mutually interleave and we suggest further rearrangement gene analysis for these situations. In our study we did not consider the length of overlapping segments, nor the number of interleaving gene segments. Further methods that include edge weights may be suitable for more detailed analysis. Our representation of graphs relied mainly on representation of a graph via a feature vector in . Similar attempts in this direction have been made for graph similarities [gibert2012graph, riesen2010graph]. Their work focus on undirected graphs and do not consider the local properties that we used. There are other venues that rely on developing a similarity measure between graph [bunke2011recent, papadimitriou2010web, hajij2017visual] or graph kernels [gartner2003graph, baur2005network] that we have not explored here. These are methods that often relay on the structural properties of the graph sometimes identified through topological methods and they may also reveal other properties in the genome. For example, such methods have been successfully applied in protein function prediction [borgwardt2005protein] and chemical informatics [ralaivola2005graph]. Comparison of such graph analysis methods will be subject of another study.



Supplementary Material

Figure 12. The persistence diagram of the data set obtained from the global features.
Figure 13. The clustering tree of the data set obtained from the global features.
Figure 14. A 2d multidimensional scaling projection for . The points of are colored according to clustering at . At this level we have clusters, the largest cluster is colored red, the second two largest clusters consists of elements and they are colored magenta and green, and the singletons are all colored black.

The graphs in the cluster of at consisting of four elements depicted in Figures 16,16,18 and 18.

Figure 15. ctg7180000088928
Figure 16. ctg7180000088096
Figure 17. ctg7180000067742
Figure 18. ctg7180000067187

The graphs that form singleton clusters (isolated points) in () are depicted in the remaining figures.

Figure 19. ctg7180000067761
Figure 20. ctg7180000087162
Figure 21. ctg7180000087484
Figure 22. ctg7180000067363
Figure 23. ctg7180000067280
Figure 24. ctg7180000067243
Figure 25. ctg7180000067157
Figure 26. ctg7180000067223
Figure 27. ctg7180000067417
Figure 28. ctg7180000067411