DeepAI AI Chat
Log In Sign Up

Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data

by   Payam Siyari, et al.
Georgia Institute of Technology

Data represented as strings abounds in biology, linguistics, document mining, web search and many other fields. Such data often have a hierarchical structure, either because they were artificially designed and composed in a hierarchical manner or because there is an underlying evolutionary process that creates repeatedly more complex strings from simpler substrings. We propose a framework, referred to as "Lexis", that produces an optimized hierarchical representation of a given set of "target" strings. The resulting hierarchy, "Lexis-DAG", shows how to construct each target through the concatenation of intermediate substrings, minimizing the total number of such concatenations or DAG edges. The Lexis optimization problem is related to the smallest grammar problem. After we prove its NP-Hardness for two cost formulations, we propose an efficient greedy algorithm for the construction of Lexis-DAGs. We also consider the problem of identifying the set of intermediate nodes (substrings) that collectively form the "core" of a Lexis-DAG, which is important in the analysis of Lexis-DAGs. We show that the Lexis framework can be applied in diverse applications such as optimized synthesis of DNA fragments in genomic libraries, hierarchical structure discovery in protein sequences, dictionary-based text compression, and feature extraction from a set of documents.


page 1

page 2

page 3

page 4


Emergence and Evolution of Hierarchical Structure in Complex Systems

It is well known that many complex systems, both in technology and natur...

Optimal Reference for DNA Synthesis

In the recent years, DNA has emerged as a potentially viable storage tec...

Hierarchical Relative Lempel-Ziv Compression

Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in ...

Evolution of Hierarchical Structure Reuse in iGEM Synthetic DNA Sequences

Many complex systems, both in technology and nature, exhibit hierarchica...

Longest Common Substring in Longest Common Subsequence's Solution Service: A Novel Hyper-Heuristic

The Longest Common Subsequence (LCS) is the problem of finding a subsequ...

Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string

Strings are a natural representation of biological data such as DNA, RNA...

Speeding up the construction of slow adaptive walks

An algorithm (bliss) is proposed to speed up the construction of slow ad...

Code Repositories


Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data

view repo