Lexis
Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data
view repo
Data represented as strings abounds in biology, linguistics, document mining, web search and many other fields. Such data often have a hierarchical structure, either because they were artificially designed and composed in a hierarchical manner or because there is an underlying evolutionary process that creates repeatedly more complex strings from simpler substrings. We propose a framework, referred to as "Lexis", that produces an optimized hierarchical representation of a given set of "target" strings. The resulting hierarchy, "LexisDAG", shows how to construct each target through the concatenation of intermediate substrings, minimizing the total number of such concatenations or DAG edges. The Lexis optimization problem is related to the smallest grammar problem. After we prove its NPHardness for two cost formulations, we propose an efficient greedy algorithm for the construction of LexisDAGs. We also consider the problem of identifying the set of intermediate nodes (substrings) that collectively form the "core" of a LexisDAG, which is important in the analysis of LexisDAGs. We show that the Lexis framework can be applied in diverse applications such as optimized synthesis of DNA fragments in genomic libraries, hierarchical structure discovery in protein sequences, dictionarybased text compression, and feature extraction from a set of documents.
READ FULL TEXT VIEW PDF
It is well known that many complex systems, both in technology and natur...
read it
Given a set of k strings I, their longest common subsequence (LCS) is th...
read it
We study the complexity of the problem of searching for a set of pattern...
read it
Many complex systems, both in technology and nature, exhibit hierarchica...
read it
Strings are a natural representation of biological data such as DNA, RNA...
read it
Strings are ubiquitous in code. Not all strings are created equal, some
...
read it
An algorithm (bliss) is proposed to speed up the construction of slow
ad...
read it
Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data
In both nature and technology, information is often represented in sequential form, as strings of characters from a given alphabet [11]. Such data often exhibit a hierarchical structure in which previously constructed strings are reused in composing longer strings [21]. In some cases this hierarchy is formed “by design” in synthetic processes where there are some cost savings associated with the reuse of existing modules [15, 19]. In other cases, the hierarchy emerges naturally when there is an underlying evolutionary process that repeatedly creates more complex strings from simpler ones, conserving only those that are being reused [19, 21]. For instance, language is hierarchically organized starting from phonemes to stems, words, compound words, phrases, and so on [20]. In the biological world, genetic information is also represented sequentially and there is ample evidence that evolution has led to a hierarchical structure in which sequences of DNA bases are first translated into amino acids, then form motifs, regions, domains, and this process continues to create many thousands of distinct proteins [8].
In the context of synthetic design, an important problem is to construct a minimumcost Directed Acyclic Graph (DAG) that shows how to produce a given set of “target strings” from a given “alphabet” in a hierarchical manner, through the construction of intermediate substrings that are reused in at least two higherlevel strings. The cost of a DAG should be related somehow to the amount of “concatenation work” (to be defined more precisely in the next setion) that the corresponding hierarchy would require. For instance, in de novo DNA synthesis [4, 7], biologists aim to construct target DNA sequences by concatenating previously synthesized DNAs in the most costefficient manner.
In other contexts, it may be that the target strings were previously constructed through an evolutionary process (not necessarily biological), or that the synthetic process that was followed to create the targets is unknown. Our main premise is that even in those cases it is still useful to construct a costminimizing DAG that composes the given set of targets hierarchically, through the creation of intermediate substrings. The resulting DAG shows the most parsimonious way to represent the given targets hierarchically, revealing substrings of different lengths that are highly reused in the targets and identifying the dependencies between the reused substrings. Even though it would not be possible to prove that the given targets were actually constructed through the inferred DAG, this optimized DAG can be thought of as a plausible hypothesis for the unknown process that created the given targets as long as we have reasons to believe that that process cares to minimize, even heuristically, the same cost function that the DAG optimization considers. Additionally, even if our goal is not to reverse engineer the process that generated the given targets, the derived DAG can have practical value in the applications such as compression or feature extraction.
In this paper, we propose an optimization framework, referred to as Lexis,^{1}^{1}1Lexis means “word” in Greek. that designs a minimumcost hierarchical representation of a given set of target strings. The resulting hierarchy, referred to as “LexisDAG”, shows how to construct each target through the concatenation of intermediate substrings, which themselves might be the result of concatenation of other shorter substrings, all the way to a given alphabet of elementary symbols. We consider two cost functions: minimizing the total number of concatenations and minimizing the number of DAG edges. The choice of cost function is applicationspecific. The Lexis optimization problem is related to the smallest grammar problem [6, 14]. We show that Lexis is NPhard for both cost functions, and propose an efficient greedy algorithm for the construction of LexisDAGs. Interestingly, the same algorithm can be used for both cost functions. We also consider the problem of identifying the set of intermediate nodes (substrings) that collectively form the “core” of a LexisDAG. This is the minimal set of DAG nodes that can cover a given fraction of sourcetotarget paths, from alphabet symbols to target strings. The core of a LexisDAG represents the most central substrings in the corresponding hierarchy. We show that the Lexis framework can be applied in diverse applications such as optimized synthesis of DNA fragments in genomic libraries, hierarchical structure discovery in protein sequences, dictionarybased text compression, and feature extraction from a set of documents.
Given an alphabet and a set of “target” strings over the alphabet , we need to construct a LexisDAG. A LexisDAG is a directed acyclic graph , where is the set of nodes and the set of edges, that satisfies the following three constraints.^{2}^{2}2To simplify the notation, even though is a function of and , we do not denote it as such.
First, each node in a LexisDAG represents a string of characters from the alphabet . The nodes that represent characters of are referred to as sources, and they have zero indegree. The nodes that represent target strings are referred to as targets, and they have zero outdegree. also includes a set of intermediate nodes , which represent substrings that appear in the targets . So, .
Second, each node in of a LexisDAG represents a string that is the concatenation of two or more substrings, specified by the incoming edges from other nodes to that node. Specifically, an edge from node to node is a triplet such that the string appears as substring of at index (the first character of a string has index 1). Note that there may be more than one edges from node to node . The number of incoming and outgoing edges for is denoted by and , respectively. is the sequence of nodes that appear in the incoming edges of , ordered by edge index . We require that for each node in replacing the sequence of nodes in with their corresponding strings results in exactly .
Third, a LexisDAG should only include intermediate nodes that have an outdegree of at least two,
(1) 
In other words, every intermediate node in a LexisDAG should be such that the string is reused in at least two concatenation operations. Otherwise, is either not used in any concatenation operation, or it is used only once and so the outgoing edge from can be replaced by rewiring the incoming edges of straight to the single occurrence of . In both cases node can be removed from the LexisDAG, resulting in a more parsimonious hierarchical representation of the targets. Fig. 1 illustrates the concepts introduced in this section.
The Lexis optimization problem is to construct a minimumcost LexisDAG for the given alphabet and target strings . In other words, the problem is to determine the set of intermediate nodes and all required edges so that the corresponding LexisDAG is optimal in terms of a given cost function .
(2)  
The selection of an appropriate cost function is somewhat applicationspecific. A natural cost function to consider is the number of edges in the LexisDAG. In certain applications, such as DNA synthesis, the cost is usually measured in terms of the number of required concatenation operations. In the following, we consider both cost functions. Note that we choose to not explicitly minimize the number of intermediate nodes in ; minimizing the number of edges or concatenations, however, tends to also reduce the number of required intermediate nodes. Additionally, the constraint (1) means that the optimal LexisDAG will not have redundant intermediate nodes that can be easily removed without increasing the concatenation or edge cost. More general cost formulations, such as a variable edge cost or a weighted average of a node cost and an edge cost, are interesting but they are not pursued in this paper.
Suppose that the cost of each edge is one. The edge cost to construct a node is defined as the number of incoming edges required to construct from its inneighbors, which is equal to . The edge cost of source nodes is obviously zero. The edge cost of LexisDAG is defined as the edge cost of all nodes, equal to the number of edges in :
(3) 
Suppose that the cost of each concatenation operation is one. The concatenation cost to construct a node is defined as the number of concatenations required to construct from its inneighbors, which is equal to . The concatenation cost of LexisDAG is defined as the concatenation cost of all nonsource nodes; it is easy to see that this is equal to the number of edges in minus the number of nonsource nodes,
(4) 
The proof is given in the Appendix. Note that the objective in Eq. (4) is an explicit function of the number of intermediate nodes in the LexisDAG. Hence the optimal solutions for the concatenation cost can be different than those for the edge cost. An example is shown in the Appendix.
(a)  
(b)  (c) 
In this section, we describe a greedy algorithm, referred to as GLexis, for both previous optimization problems. The basic idea in GLexis is that it searches for the substring that will lead, under certain assumptions, to the maximum cost reduction when added as a new intermediate node in the LexisDAG. The algorithm starts from the trivial LexisDAG with no intermediate nodes and edges from the source nodes representing alphabet symbols to each of their occurrences in the target strings.
Recall that for every node , is the sequence of nodes appearing in the incoming edges of , i.e., the sequence of nodes whose string concatenation results in the string represented by . The sequences can be interpreted as strings over the “alphabet” of LexisDAG nodes. Note that every symbol in a string has a corresponding edge in the LexisDAG. We look for a repeated substring in the strings that can be used to construct a new intermediate node. We can construct a new intermediate node for , create incoming edges based on the symbols in (remember is a substring over the alphabet of nodes), and replace the incoming edges to each of the nonoverlapping repeated occurrences of with a single outgoing edge from the new node.
Consider the edge cost first. Suppose that is repeated times in the strings . If these occurrences of are nonoverlapping, the number of required edges would be . After we construct a new intermediate node for as outlined above, the edge cost will be . So, the reduction in edge cost from reusing would be . Under the stated assumptions about , this reduction is nonnegative if is repeated at least twice and its length is at least two.
Consider the concatenation cost now. If these occurrences of are nonoverlapping, the number of required concatenations for all the repeated occurrences would be . After we construct a new intermediate node for as outlined above, the concatenation cost will be . We expect a reduction in the number of required concatenations by .
So, the greedy choice for both cost functions is the same: select the substring that maximizes the term . For this reason, our GLexis algorithm can be used for both cost functions we consider. It starts with the trivial LexisDAG, and at each iteration it chooses a substring of in the LexisDAG that maximizes SavedCost, creates a new intermediate node for that substring and updates the edges of the LexisDAG accordingly. The algorithm terminates when there are no more substrings of with length at least two and repeated at least twice. The pseudocode for GLexis is shown in Algorithm 1. An example of application of the GLexis algorithm is shown in Fig. 2.
At each iteration of GLexis, we need to find efficiently the substring of with maximum SavedCost. We observe that the substring that maximizes SavedCost is a “maximal repeat”. Maximal repeats are substrings of length at least two, whose extension to the right or left would reduce its occurrences in the given set of strings. Suppose that it is not. Then, there is a substring , which is not a maximal repeat, that maximizes SavedCost. If we can extend to the left or right we can increase its length without reducing its number of occurrences. By doing so, we construct a new substring with higher SavedCost than , violating our initial assumption. So, the substring that maximizes SavedCost is a maximal repeat. A suffix tree over a set of input strings captures all rightmaximal repeats, and rightmaximal repeats are a superset of all maximal repeats [11]. To pick the one with maximum SavedCost, we need the count of nonoverlapping occurrences of these substrings. A Minimal Augmented Suffix Tree [5] over can be constructed and used to count the number of nonoverlapping occurrences of all rightmaximal repeats in overall time, where is the total length of target strings. Using a regular suffix tree instead, this can be achieved in only time; but suffix tree may count overlapping occurrences. In our implementation we prefer to use regular suffix tree, following related work [10] that has shown that this performance optimization has negligible impact on the solution’s quality. So, the substring that is chosen for the new LexisDAG node is based on length and overlapping occurrence count. We then use the suffix tree to iterate over all occurrences of the selected substring, skipping overlapping occurrences. If a selected substring has less than two nonoverlapping occurrences, we skip to the next best substring. Using the suffix tree, we can update the LexisDAG with the new intermediate node, and with the corresponding edges for all occurrences of that substring, in time. The maximum number of iterations of GLexis is because each iteration reduces the number of edges (or concatenations), which at the start is . So, the overall runtime complexity using suffix tree is .
We have also experimented with other algorithms, such as a greedy heuristic that selects the longest repeat in each iteration of building the DAG, i.e., it chooses based on length among all substrings that appear at least twice in the targets or intermediate node strings. This heuristic can be efficiently implemented to run in only time [13]. Our evaluation shows that GLexis performs significantly better than the longest repeat heuristic in terms of solution quality, despite some running time overhead. Running both algorithms on a machine with an Intel Corei7 2.9 GHz CPU and 16GB of RAM on the NSF abstracts dataset (introduced in Section 5) of target strings with total length symbols takes 562 sec for GLexis and 408 sec for the longest repeat algorithm. The edge cost with GLexis is 169,060 compared to 183,961 with the longest repeat algorithm. More detailed results can be found in the Appendix.
After constructing a LexisDAG, an important question is to rank the constructed intermediate nodes in terms of significance or centrality.
Even though there are many related metrics in the network analysis literature, such as closeness, betweenness or eigenvector centrality
[22], none of them captures well the semantics of a LexisDAG. In a LexisDAG, a path that starts from a source and terminates at a target represents a dependency chain in which each node depends on all previous nodes in that path. So, the higher the number of such sourcetotarget paths traversing an intermediate node is, the more important is in terms of the number of dependency chains it participates in. More formally, let be the number of sourcetotarget paths that traverse node ; we refer to as the path centrality of intermediate node . The path centrality of sources and targets is zero by definition. First, note that:(5) 
where is the number of paths from any source to , and is the number of paths from to any target. This suggests an efficient way to calculate the path centrality of all nodes in a LexisDAG in time: perform two DFS traversals, one starting from sources and following the direction of edges, and another starting from targets and following the opposite direction. The first DFS traversal will recursively produce while the second will produce , for all intermediate nodes.
Second, it is easy to see that is equal to the number of times string is used for replacement in the target strings . Similarly, is equal to the number of times any source node is repeated in , which is simply the length of . So, the path centrality of a node in a LexisDAG can be also interpreted as its “reuse count” (or number of replaced occurrences in the targets) times its length. Thus, an intermediate node will rank highly in terms of path centrality if it is both long and frequently reused.
An important followup question is to identify the core of a LexisDAG, i.e., a set of intermediate nodes that represent, as a whole, the most important substrings in that LexisDAG. Intuitively, we expect that the core should include nodes of high path centrality, and that almost all sourcetotarget dependency chains of the LexisDAG should traverse at least one of these core nodes.
More formally, suppose is a set of intermediate nodes and is the set of sourcetotarget paths after we remove the nodes in from . The core of is defined as the minimumcardinality set of intermediate nodes such that the fraction of remaining sourcetotarget paths after the removal of is at most :
(6)  
where is the number of sourcetotarget paths in the original LexisDAG, without removing any nodes.^{3}^{3}3It is easy to see that is equal to the cumulative length of all target strings .
Note that if the core identification problem becomes equivalent to finding the minvertexcut of the given LexisDAG. In practice a LexisDAG often includes some tendrillike sourcetotarget paths traversing a small number of intermediate nodes that very few other paths traverse. These paths can cause a large increase in the size of the core. For this reason, we prefer to consider the case of a positive, but potentially small, value of the threshold .
We solve the core identification problem with a greedy algorithm referred to as GCore. This algorithm adds in each iteration the node with the highest pathcentrality value to the core set, updates the LexisDAG by removing that node and its edges, and recomputes the path centralities before the next iteration. The algorithm terminates when the desired fraction of sourcetotarget paths has been achieved. GCore requires at most iterations, and in each iteration we update the path centralities in time. So the runtime complexity of GCore is .
We now discuss a variety of applications of the proposed framework. Note that in all experiments, we use the library from [10] for extracting the maximal repeats and NetworkX [12] as the graph library in our implementation.
Lexis can be used as an optimization tool for the hierarchical synthesis of sequences. One such application comes from synthetic biology, where novel DNA sequences are created by concatenating existing DNA sequences in a hierarchical process [7]. The cost of DNA synthesis is considerable today due to the biochemical operations that are required to perform this “genetic merging” [7, 4]. Hence, it is desirable to reuse existing DNA sequences, and more generally, to perform this design process in an efficient hierarchical manner.
Biologists have created a library of synthetic DNA sequences, referred to as iGEM [2]. Currently, there are elementary “BioBrick parts” from which longer composite sequences can be created. Longer sequences are submitted to the Registry of Standard Biological Parts in the annual iGEM competition, then functionally evaluated and labeled. In the following, we analyze a subset of the iGEM dataset. In particular, this dataset contains composite DNA sequences that are labeled as iGEM devices because they have distinct biological functions; we treat these sequences as Lexis targets. The cumulative length of the target sequences is symbols. The elementary BioBrick parts are treated as the Lexis sources. The iGEM dataset also includes other BioBrick parts that are neither devices nor elementary, and that have been used to construct more complex parts in iGEM; we ignore those because they do not have a distinct biological function (i.e., they should not be viewed as targets but as intermediate sequences that different teams of biologists have previously constructed).
We constructed an optimized LexisDAG for the given sets of iGEM sources and targets. To quantify the gain that results from using a hierarchical synthesis process, we compare the number of edges and concatenations in the LexisDAG versus a flat synthesis process in which each target is independently constructed from the required sources. The Lexis solution requires only 52% of the edges (or 56% of the concatenations) that the flat process would require. The sequence with the highest path centrality in the LexisDAG is B0010B0012.^{4}^{4}4 BioBricks start with BBa_ prefix that are omitted here. This part is registered as B0015 in the iGEM library and it is the most common “terminator” in iGEM devices. Lexis identified several more high centrality parts that are already in iGEM, such as B0032E0040B0010B0012, registered as E0240. Interestingly, however, the LexisDAG also includes some high centrality parts that have not been registered in iGEM yet, such as B0034C0062B0010B0012R0062. A list of the top15 nodes in terms of path centrality is given in the Appendix.
To explore the hierarchical nature of the iGEM sequences, we compared the “Original” LexisDAG, the one we constructed with the actual iGEM devices as targets, with “Randomized” LexisDAGs. “Randomized” LexisDAG is the result of applying GLexis to a target set where each iGEM device sequence is randomly reshuffled. We compare the Original LexisDAG characteristics to the average LexisDAG characteristics over ten randomized experiments. The Original LexisDAG has fewer intermediate nodes than the Randomized ones (169 in Original vs 359 in Randomized), and its depth is twice as large (8 vs 4.4). Importantly, the Randomized DAGs are significantly more costly: 44% higher cost in terms of edges and 52% in terms of concatenations.
To further understand these differences from the topological perspective, Fig. 3 shows scatter plots for the length, path centrality, and reuse (number of replacements) of each intermediate node in the Original LexisDAG vs one of the Randomized LexisDAGs. With randomized targets, the intermediate nodes are short (mostly 23 symbols), their reuse is roughly equal to their outdegree, and their path centrality is determined by their outdegree; in other words, most intermediate nodes are directly connected to the targets that include them, and the most central nodes are those that have the highest number of such edges. On the contrary, with the original targets we find longer intermediate nodes (up to 1112 symbols) and their number of replacements in the targets can be up to an order of magnitude higher than their outdegree. This happens when intermediate nodes with a large number of replacements are not only used directly to construct targets but they are repeatedly combined to construct longer intermediate nodes, creating a deeper hierarchy of reuse. In this case, the high path centrality nodes tend to be those that are both relatively long and common, achieving a good tradeoff between specificity and generality.
As mentioned in the introduction, it is often the case that the hierarchical process that creates the observed sequences is unknown. Lexis can be used to discover underlying hierarchical structure as long as we have reasons to believe that that hierarchical process cares to minimize, even heuristically, the same cost function that Lexis considers (i.e., number of edges or concatenations). A related reason to apply Lexis in the analysis of sequential data is to identify the most parsimonious way, in terms of number of edges or concatenations, to represent the given sequences hierarchically. Even though this representation may not be related to the process that generated the given targets, it can expose if the given data have an inherent hierarchical structure.
As an illustration of this process, we apply Lexis on a set of protein sequences. Even though it is wellknown that such sequences include conserved and repeated subsequences (such as motifs) of various lengths, it is not currently known whether these repeats form a hierarchical structure. That would be the case if one or more short conserved sequences are often combined to form longer conserved sequences, which can themselves be combined with others to form even longer sequences, etc. If we accept the premise that a conserved sequence serves a distinct biological function, the discovery of hierarchical structure in protein sequences would suggest that elementary biological functions are combined in a Legolike manner to construct the complexity and diversity of the proteome. In other words, the presence of hierarchical structure would suggest that proteins satisfy, at least to a certain extent, the composability principle, meaning that the function of each protein is composed of, and it can be understood through, the simpler functions of hierarchical components.
Our dataset is the proteome of baker’s Yeast^{5}^{5}5http://www.uniprot.org/proteomes/UP000002311, which consists of 6,721 proteins. However, this includes many protein homologues. It is important that we replace each cluster of homologues with a single protein; otherwise Lexis can detect repeated sequences within two or more homologues. To remedy this issue, we use the UCLUST sequence clustering tool [1], which is based on the USEARCH similarity measure (or identity search) [9]. The Percentage of Identity (PID) parameter controls how similar two sequences should be so that they are assigned to the same cluster. We set PID to 50%, which reduces the number of proteins to 6,033. Much higher PID values do not cluster together some obvious homologues, while lower PID values are too restrictive.^{6}^{6}6http://drive5.com/usearch/manual/uclust_algo.html To reduce the running time associated with the randomization experiments described next, we randomly sample 1,500 proteins from the output of UCLUST.
The total length of the protein targets is about 344K amino acids. The resulting LexisDAG has about 151K edges and 5,171 intermediate nodes, and its maximum depth is 7. Fig. 4(a) shows a scatter plot of the length and number of replacements of these intermediate Lexis nodes (repeated sequences discovered by Lexis).
Of course some of these sequences may not have any biological significance because their length and number of replacements may be so low that they are likely to occur just based on chance. For instance, a sequence of two amino acids that is repeated just twice in a sequence of thousands of amino acids is almost certain (the distribution of amino acids is not very skewed). To filter out the sequences that are
not statistically significant, we rely on the following hypothesis test. Consider a node that corresponds to a sequence with length and number of replacementsin the given targets. The nullhypothesis is that sequences with these values of
andwill occur in a LexisDAG that is constructed for a random permutation of the given targets. To evaluate this hypothesis, we randomize the given target sequences multiple times, and construct a LexisDAG for each randomized sequence. We then estimate the probability that sequences of length
and number of replacements occur in the randomized target LexisDAG, as the fraction of 500 experiments in which this is true. For a given significance level , we can then identify the pairs for which we can reject the nullhypothesis; these pairs correspond to the nodes that we view as statistically significant.^{7}^{7}7Another way to conduct this hypothesis test would be to estimate the probability that a specific sequence of length will be repeated times in a permutation of the targets. The number of randomization experiments would need to be much higher in that case, however, to cover all sequences that we see in the actual LexisDAG, each with a given value of .On average, the randomized target LexisDAGs have a smaller depth () and more edges () than the original LexisDAG. Fig. 4(b) shows the intermediate nodes of the original LexisDAG that pass the previous hypothesis test.
Fig. 5 shows a small subgraph of the LexisDAG, showing only about 30 intermediate nodes; all these nodes have passed the previous significance test. The grey lines represent indirect connections, going through nodes that have not passed the significance test (not shown), while the bold lines represent direct connections. Interestingly, there seems to be a nontrivial hierarchical structure with several long paths, and with sequences of several amino acids that repeat several times even in this relatively small sample of proteins. Despite these preliminary results, it is clearly still early to draw any hard conclusions about the presence of hierarchical structure in protein sequences. We are planning to further pursue this question using Lexis in collaboration with domain experts.
Recent work has highlighted the connection between pattern mining and sequence compressibility [16]. Data compression looks for regularities that can be used to compress the data, while patterns are often useful as such regularities. In dictionarybased lossless compression, the original sequence is encoded with the help of a dictionary, which is a subset of the sequence’s substrings as well as a series of pointers indicating the location(s) at which each dictionary element should be placed to fully reconstruct the original sequence. Following the Minimum Description Principle, one strives for a compression scheme that results in the smallest size for the joint representation of both the dictionary and the encoding of the data using that dictionary. The size of this joint representation is the total space needed to store the dictionary entries plus the total space needed for the required pointers. We assume for simplicity that the space to store an individual character and a pointer are the same.
We now evaluate the use of a LexisDAG for compression (or compact representation) of strings. To do so, we need to decide 1) how to choose the patterns that will be used for compression, and 2) if a pattern appears more than once, which occurrences of that pattern to replace. A naive approach is to simply use the set of substrings that appear in the LexisDAG as dictionary entries, and compare them to sets of patterns found by other substring mining algorithms. Given a set of patterns as potential dictionary entries, selecting the best dictionary and pointer placement is NPhard. A simple greedy compression scheme, that we refer to as CompressLR, is to iteratively add to the dictionary the substring that gives the highest compression gain when replacing all LefttoRight nonoverlapping occurrences of that substring with pointers. We reevaluate the compression gain of candidate dictionary entries in each iteration. For a substring with length and number of lefttoright nonoverlapping occurrences , the compression gain is:
(7) 
We compare the substrings identified by Lexis with the substrings generated by a recent contiguous pattern mining algorithm called ConSgen [27] (we could only run it on the smallest available dataset). Additionally, we compare the Lexis substrings with the set of patterns containing all 2 and 3grams of the data. The comparisons are performed on six sequence datasets: the “Yeast” and iGEM datasets of the previous sections, as well as four “NSF CS awards” datasets that will be described in more detail in the next section.
Table 1 shows the comparison results under the headings: LexisCompressLR, 2+3gramsCompressLR and ConSgenCompressLR. These naive approaches are all on par with each other. This comparison, however, treats GLexis as a mere pattern mining algorithm. Instead, the GLexis algorithm constructs a LexisDAG that puts the generated patterns in a hierarchical context. One can think of the LexisDAG as the instructions in constructing a hierarchical “Legolike” sequence. The edges into the targets tell us how to place the final pointers, i.e., which occurrences of a dictionary entry to replace in the targets. Further, the rest of the DAG shows how to compress the patterns that appear in the targets using smaller patterns. It is easy to see that using this strategy the compressed size becomes equal to the number of edges in the DAG. Using this strategy that is encoded in the LexisDAG results in an additional 2%20% reduction in the compressed size over the CompressLR approaches.
Dataset  Lexis  Lexis  2+3gram  ConSgen 

DAG  CompressLR  CompressLR  CompressLR  


Know&Cog  68.69  76.58  77.13  — 
Networks  78.48  86.47  86.43  — 
Robotics  73.19  80.62  79.69  — 
Theory  79.41  81.89  82.63  — 
Yeast  44.28  51.08  50.71  — 
iGEM  47.86  67.47  67.75  67.47 
The LexisDAG can also be used to extract machine learning features for sequential data. The intermediate nodes that form the core of a LexisDAG, in particular, correspond to sequences that are both relatively long and frequently reused in the targets. We hypothesize that such sequences will be good features for machine learning tasks such as classification or clustering because they can discriminate different classes of objects (due to their longer length) and at the same time they are general within the same class of objects (due to their frequent occurrence in the targets of that class).
To test this hypothesis, we used Lexis to extract text features for four classes of NSF research award abstracts during the 1990–2003 time period.^{8}^{8}8archive.ics.uci.edu/ml/machinelearningdatabases/nsfabsmld/nsfawards.data.html We preprocessed each award’s abstract through stopword removal and Porter stemming. The alphabet is the set of word stems that appear at least once in any of these abstracts. Table 2 describes this dataset in terms of number of abstracts, cumulative abstract length, and average length per abstract for each class.
Class  



Knowledge  411  47,858  116  2,902 
&Cog Sci  
Networks  836  74,738  89  3,730 
Robotics  496  56,481  113  4,560 
Theory  566  66,891  118  4,247 
We constructed the LexisDAG for each class of abstracts, and then used the GCore algorithm to identify the core for each DAG. We stopped GCore at the point where of indirect paths in the LexisDAG are covered. The strings in each core are the extracted features for the corresponding class of abstracts. Table 3 shows the 5 strings extracted by GCore for each class. We create a common set of GCore
features by taking the union of the sets of core substrings derived for each class. The next step is to construct the feature vector for each abstract. We do so by representing each abstract as a vector of counts, with a count for each substring feature.
Knowledge & Cog  Networks 



machine learn  request support nsf connect 
knowledg base  bit per second 
natur languag  two year 
artifici intellig  provide partial support 
neural network  high perform 
Robotics  Theory 


comput vision  comput scienc 
first year  real world 
robot system  complex class 
object recognit  complex theori 
real time  approxim algorithm 
Method (# Features)  # Nonzeros  ()  Accuracy 



Bagofwords (9,7k)  190,2k  (0.0015,3)  88.3% 
GCore (14,4k)  55,5k  (0.02,1)  90.0% 
2Gram (124,9k)  228,0k  (0.0015,3)  91.3% 
3Gram (186,3k)  234,5k  (0.001,1)  75.8% 
1+2Gram (134,6k)  418,2k  (0.001,1)  90.9% 
1+2+3Gram (321,0k)  652,8k  (0.001,1)  89.2% 
To assess how good these features are, we compare the classification accuracy obtained using the Lexis features with more mainstream representations in text mining on NSF data: “bagofwords”, 2gram, 3gram, and two combinations of these representations. We use a basic SVM classifier with an RBF kernel. We used the SVM implementation in MATLAB. The accuracy results are similar to those with a KNN classifier that we tried with a Cosine distance, and the accuracy is evaluated with 10fold crossvalidation.
Table 4 shows that the Lexis features result in a much sparser termdocument matrix, and so in smaller data overhead in learning tasks, without sacrificing classification accuracy. Lexis also results in a lower feature dimensionality (with the exception of the 1gram method but the accuracy of that method is much lower). Note that most Lexis features are 2grams but the Lexis core (for of path coverage) may also include longer grams. Lexis becomes better, relative to the other feature sets, as we decrease the number of considered features. For instance, with features the accuracy with Lexis is 74%, while the accuracy with the 1gram and 2gram features is 69% and 64%, respectively.
Lexis is closely related to the Smallest Grammar Problem (SGP), which focuses on the following question: What is the smallest contextfree grammar that only generates a given string? The constraint that the grammar should generate only one string is important because otherwise we could simply consider a as the generator of any string over . The SGP is NPhard, and inapproximable beyond a ratio of [6]. Algorithms for SGP have been used for string compression [18] and structure discovery [21] (for a survey see [10]). There are major differences between Lexis and SGP. First, in Lexis we are given a set of several target strings, not only one string. Second, Lexis infers a network representation (a DAG) instead of a grammar, and so it is a natural tool for questions relating to the hierarchical structure of the targets. For instance, the centrality analysis of intermediate nodes or the core identification question are well understood problems in network analysis, while it is not obvious how to approach them with a grammarbased approach.
One can also relate Lexis to the body of work on sequence pattern mining, where one is interested in discovering frequent or interesting patterns in a given sequence. Most work in this area has focused on mining subsequences, i.e., a set of ordered but not necessarily adjacent characters from the given sequence. In Lexis, we focus on identifying substrings, also known as contiguous sequence patterns. A couple of recent papers develop algorithms for mining substring patterns [27, 26], since sequence mining algorithms do not readily apply to the contiguous case. However, they rely on the methodology of candidate generation (commonly used in sequence pattern mining), where all patterns meeting a certain criterion are found, such as having frequency of at least two or being maximal. In the sequence mining literature, it has been recently observed that the size of the discovered set of patterns as well as their redundancy can be better controlled by mining for a set of patterns that meet a criterion collectively, as opposed to individually. This is useful when these patterns are used as features in other tasks such as summarization or classification. Algorithms for such setbased pattern discovery have been recently developed for sequence pattern mining [16, 25]. In the context of substring pattern mining, Paskov et al [24]
show how to identify a set of patterns with optimal lossless compression cost in an unsupervised setting, to be then used as features in supervised learning for classification. In a followup paper
[23], DRACULA provides a “deep variant” of [24] that is similar to Lexis, in terms of the general problem setup. DRACULA’s focus is mostly on complexity and learning aspects of the problem, while Lexis focuses on network analysis of the resulting optimized hierarchy. For instance, DRACULA considers how to take into account how dictionary strings are constructed to regularize learning problems, and how the optimal Dracula solution behaves as the cost varies. We have shown that, although not specifically designed for feature extraction or compression, the Lexis framework also results in a small and nonredundant set of substring patterns that can be used in classification and compression tasks.Optimal DNA synthesis is a new application domain, and we are only aware of the work by Blakes et. al. [4]; they describe DNALD, an algorithm that greedily attempts to maximize DNA reuse for multistage assembly of DNA libraries with shared intermediates. Even though the Lexis framework was not specifically designed for DNA synthesis, the LexisDAGs can be seemlessly used as solutions for this task. In our illustrative example with the iGEM dataset, GLexis returns solutions with 11% lower synthesis cost (equivalent to concatenation cost) than DNALD.
Structure discovery in strings has been explored from several different perspectives. For example, the grammarbased algorithm SEQUITIR [21] presents interesting possible applications in natural language and musical structure identification. In an informationtheoretic context, Lancot et. al. [17] shows how to distinguish between coding and noncoding regions by analyzing the hierarchical structure of genomic sequences.
Lexis is a novel optimizationbased framework for exploring the hierarchical nature of sequence data. In this paper, we stated the corresponding optimization problems in the admittedly limited context of two simple cost functions (number of edges and concatenations), proved their NPhardness, and proposed a greedy algorithm for the construction of LexisDAGs. This research can be extended in the direction of more general cost formulations and more efficient algorithms. Additionally, we are working on an incremental version of Lexis in which new targets are added to an existing LexisDAG, without redesigning the hierarchy.
We also applied network analysis in LexisDAGs, proposing a new centrality metric that can be used to identify the most important intermediate nodes, corresponding to substrings that are both relatively long and frequently occurring in the target sequences. This network analysis connection raises an interesting question: are there certain topological properties that are common across LexisDAGs that have resulted from longrunning evolutionary processes? We have some evidence that one such topological property is that these DAGs exhibit the hourglass effect [3].
Finally, we gave four examples of how Lexis can be used in practice, applied in optimized hierarchical synthesis, structure discovery, compression and feature extraction. In future work, we plan to apply Lexis in various domainspecific problems. In biology, in particular, we can use Lexis in comparing the hierarchical structure of protein sequences between healthy and cancerous tissues. Another related application is generalized phylogeny inference, considering that horizontal gene transfers (which are common in viruses and bacteria) result in DAGlike phylogenies (as opposed to trees).
This research was supported by the National Science Foundation under Grant No. 1319549.
Data representation and compression using linearprogramming approximations.
In ICLR, 2016.We prove that the Lexis problem is NPhard through a reduction from the Smallest Grammar Problem (SGP) [6].
Formally, The Smallest Grammar Problem for a string is to identify a StraightLine Grammar (SLG) such that and for any other with , where denotes the size of grammar . Charikar et al. [6] define the size of a grammar as the cumulative length of the righthand side of all rules, i.e., where is the number of symbols appearing in the term of a grammar rule. Under this grammar size, Charikar et al. show that the Smallest Grammar Problem is NPhard [6].
Theorem 1. The Lexis problem in Eq. (2) is NPhard for being edge cost (Eq. (3)) or concatenation cost (Eq. (4)).
Proof: Let us first start with edge cost. Consider an instance of SGP in which we are given string and we are asked to compute an SLG such that and , where . We reduce it to an instance of the Lexis problem with a single target string , in which we are asked to compute a LexisDAG with .
Given a grammar as a solution to the SGP problem, we construct a solution for the reduced Lexis problem. For each symbol in , construct a node. For a nonterminal , we refer to the corresponding node also as , and associate that node with the string that is produced by expanding rule according to grammar . Also, for each rule in , we scan and add an edge in from every node that corresponds to a terminal or nonterminal in to the node that corresponds to (along with the corresponding index). It is easy to see that is acyclic since is a straightline grammar and that the number of edges in is: .
Conversely, consider a LexisDAG which is a solution to the Lexis problem from our reduction above , i.e., it has a single target string , and . We can construct a corresponding grammar for the SGP as follows. For each source node in , construct a terminal in . For each node in , construct a nonterminal in . For the single target node in , designate as the start symbol . For each node in , add a rule in with the righthand side listing the corresponding symbols for all nodes in the sequence (i.e., ordered as their respective strings appear concatenated in ). The constructed grammar is straightline, since every nonterminal has one rule associated with it, and the grammar is also acyclic because the LexisDAG is acyclic. It is easy to see that .
The NPhardness proof of the Smallest Grammar Problem with grammar size defined as [6] can be adapted for a modified grammar size definition, i.e., . We can then use the same reduction from SGP to Lexis as in the case of edge cost to show that Lexis with concatenation cost is also NPhard.
Fig. 6 illustrates an example that the LexisDAG when optimizing for the edge cost function may be different than the LexisDAG when optimizing for concatenation cost.
We compare GLexis with an algorithm that greedily replaces the longest repeated substring, in terms of both runtime and cost. We implemented the latter, originally proposed in [13] using suffix trees, using our own efficient linkedsuffix array. We used the NSF data described in the main text and ran the two algorithms on different fractions of the total dataset, repeating the experiments 10 times and recording the average runtime and edge cost. As seen in Fig. 7, the Longest Substring Replacement heuristic offers a better runtime but its cost becomes increasingly worse than GLexis as the size of the dataset grows. Also, GLexis is still reasonably fast on all datasets we have analyzed so far.


B0010B0012  E0040B0010B0012 
B0034E0040B0010B0012  B0034C0062B0010B0012 
B0032E0040B0010B0012  B0034C0062B0010B0012R0062 
B0034E1010B0010B0012  B0034E1010 
B0034I732006B0034E0040B0010B0012  B0034C0062 
B0030E0040B0010B0012  R0010B0034 
K228000B0010B0012  R0040B0034 
C0051B0010B0012  

Comments
There are no comments yet.