A Data-Structure for Approximate Longest Common Subsequence of A Set of Strings

08/04/2020
by   Sepideh Aghamolaei, et al.
Sharif Accelerator
0

Given a set of k strings I, their longest common subsequence (LCS) is the string with the maximum length that is a subset of all the strings in I. A data-structure for this problem preprocesses I into a data-structure such that the LCS of a set of query strings Q with the strings of I can be computed faster. Since the problem is NP-hard for arbitrary k, we allow an error that allows some characters to be replaced by other characters. We define the approximation version of the problem with an extra input m, which is the length of the regular expression (regex) that describes the input, and the approximation factor is the logarithm of the number of possibilities in the regex returned by the algorithm, divided by the logarithm regex with the minimum number of possibilities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/25/2018

Fast entropy-bounded string dictionary look-up with mismatches

We revisit the fundamental problem of dictionary look-up with mismatches...
04/25/2019

SafeStrings: Representing Strings as Structured Data

Strings are ubiquitous in code. Not all strings are created equal, some ...
04/16/2019

Dynamic Packed Compact Tries Revisited

Given a dynamic set K of k strings of total length n whose characters ar...
10/04/2018

Longest Property-Preserved Common Factor

In this paper we introduce a new family of string processing problems. W...
06/29/2020

Pattern Masking for Dictionary Matching

In the Pattern Masking for Dictionary Matching (PMDM) problem, we are gi...
11/16/2021

Introduction to Set Shaping Theory

In this article, we define the Set Shaping Theory whose goal is the stud...
02/17/2016

Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data

Data represented as strings abounds in biology, linguistics, document mi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Summarizing input strings in a preprocessing step for string matching algorithms has been used before [4]. We use a similar approach for the longest common subsequence problem.

A special case of this problem is matching regular expressions, which has been studied for the case with two strings [5].

For two strings, a finite state machine with that separates them exists, which improves other sublinear existing results [3]. However, the paper does not provide an algorithm for building it.

Longest Common Substring and Shortest Common Super-sequence.

For a given set of strings, their longest common substring is the longest string which is a subsequence of both strings. This problem is also NP-hard and W[1]-hard [9, 6, 11]. For the longest common substring problem, data-structures such as prefix/suffix-tree, trie and etc. are known for matching one string with a set of query strings.

The shortest common super-sequence of a set of strings of length is the shortest string that contains all the strings. This problem is NP-hard for alphabets of size  [10] and it can be solved in using dynamic programming [9, 6]. The problem is also W[1]-hard and has no algorithms [11].

Multifurcating Phylogenetic Tree

A multifurcating phylogenetic tree of a set of biological strings is a tree with nodes of arbitrary degree where each intermediate node corresponds to a common ancestor and each leaf is a living organism. Using the input strings in the leaves of the tree and their common super-sequences at intermediate nodes, this gives a tree on the strings.

Agglomerative Clustering Tree

A method of constructing phylogenetic trees is to use a hierarchical clustering algorithm, and report the resulting tree. Using single-linkage clustering, at each step the two closest points are clustered together and an intermediate node representing them is added to the tree, and this proceeds until there are

clusters. If , this is equivalent to Kruskal’s algorithm for minimum spanning tree.

String Matching using Regular Expressions.

Regular expressions are specifications of regular languages using a set of characters called the alphabet and a set of wildcards. Symbols such as parantheses for grouping, for choosing one of the character (groups) on each of its sides, for representing an arbitrary character. A wildcard character denotes the repetitions of other character groups, for example means an arbitrary number of repetitions, means at least one repetition, means or repetition of the character or a group of character inside the parentheses just before the wildcard. In regular expressions, another notation for representing repetitions for any can also be used.

String matching with wildcards was discussed in [5] for two strings, where the authors use hashing to solve the problem in the massively parallel model [2].

2 Wildcard Compression of Strings

The goal is to compute a regular expression of length of a string of length , such that the number of strings in the language of the regular expression is minimized. Assume the length of the regular expression of the strings separated by is the maximum of their lengths, and all the characters that do not appear in the output string have length .

Example.

Consider and , then has possible cases and has possible cases.

If is not allowed and no error is allowed, the regular expression with the minimum length can be computed by DFA minimization and writing the equivalent regular expression in for an alphabet of size using Hopcroft’s algorithm [7].

The DFA for a single string can be built by setting states, where the state goes to state with character , for , and to a trap state with any other character. The start state is and the terminal is .

For a set of strings, we first need to build their DFA, which has size assuming all prefixes of the strings at the states. Building the DFA takes , for choosing a character and a string to add the next character, at each state. Computing the regular expression of this state machine takes time.

We define an -approximation of the wildcard compression with possibilities for a string , as a regular expression with possible cases for . A similar definition can be used for a set of strings instead of one string .

The regular expression with possible cases has an unbounded approximation factor for a string that repeats a character times. So, there is no obvious bound on the approximation factor of the problem.

Example.

A string with all distinct characters can be summarized as , which has possibilities. This is an exponential number of possibilities in the input size, unless . A shorter representation is , with possibilities. The second representation is a -approximation.

Algorithm.

The following operations are repeatedly applied to the string or the DFA of the language of that string to get the formula:

  • The operation of unifying two characters, i.e. replacing two characters such as and with is equivalent to merging their states. The newly equivalent states can now be merged together. Two states are equivalent if their incoming and outgoing states are the same.

  • A sequence of consecutive states representing string , if a sequence of consecutive states are equal to , for an integer , repeated times, it can be replaced with , and the strings between the repetitions are merged by .

In each iteration, the operation which minimizes is chosen, where is the set of possibilities after the operation divided by the number of possibilities before it, and is the amount of length reduction done by . The algorithm ends when the length of the string reaches . Also, the second operation has priority over the first one.

Theorem 2.1

The approximation factor of the wildcard simplification algorithm is .

Proof

By checking subsequences of length instead of , the length of the repeated sequence is approximated by at most . Finding a solution of the same size would require adding as many new characters as the length of the string. So, the first operation is within a factor of the optimal solution.

If the second operation concatenates strings that appear separately in the optimal solution, it can add as much as the length of the concatenated string to the length of the solution, but since that is also a copy of the current string, the approximation factor is again .

Any operations that is blocked after a sequence of operations, can still be done with approximation factor . Based on the greedy choice of the algorithm, the ratio between the number of possibilities to the length of the string used in the regular expression of the best operation in the optimal solution is less than or equal to the one chosen by the algorithm in the first step, i.e. . If the first operation changes the string used in an operation from the optimal solution, the ratio will decrease or remain the same, since the length of the string is at least as much as the length of the optimal string and the approximation factor gives .The algorithm needs to check all the copies of a string at the same time for the greedy choice to work.

Assuming the number of operations resulting in the optimal solution is , then the sum of the lengths of the repeated strings is , where is the length of the union of the strings. By replacing each operation with a greedy choice, the length of each string either remains the same or increases, and their intersections is at least as much as their lengths, so the number of operations of the algorithm is at most . Applying the bound for each operation, the total possibilities of the algorithm would be , which is a -approximation of the optimal solution, i.e. . ∎

Example.

For the string “banana” from the previous example, one application of the second operation gives , and then one application of the first operation which gives . The number of possible cases is , so it is a -approximation.

A Simple Algorithm for Strings of Equal Length.

In case the input set consists of strings with length , it is enough to keep the set of possible characters at each location. This can be done by computing the distinct characters at each location, by sorting or marking the elements of the alphabet. This can be done in time.

Matching Regular Expressions

For a set of input strings, and the nearest neighbor query, we can first build a wildcard compression, the shortest common super-sequence or any other regular expression, build a finite state machine for it, and then use it at the query time.

3 An MPC Algorithm for LCS of Wildcard Compressions

The LCS problem does not have the mergeability property [1], i.e. merging the solutions of subsets of inputs does not give the solution to the LCS of all. For example, consider the set aac,caa,cbb,bbc, and assume the partitions are and cbb,bbc; then the LCS of the partitions are aa,bb, whose LCS is empty, while the solution is the string “c”.

While the wildcard compressions are not mergeable either, they are composable [8], meaning the solutions to subsets of the input give an approximate solution to the original problem. In the worst case the wildcard characters are at different locations, so they have no intersection with each other and the number of possibilities is the product of the number of possibilities of each subproblem. So, the approximation factor of the composable solution is the sum of the approximation factors of the subproblems.

References

  • [1] Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J., Wei, Z., Yi, K.: Mergeable summaries. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. pp. 23–34 (2012)
  • [2] Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI symposium on Principles of database systems. pp. 273–284. ACM (2013)
  • [3] Chase, Z.: A new upper bound for separating words. arXiv preprint arXiv:2007.12097 (2020)
  • [4] Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms (3-rd edition). MIT Press and McGraw-Hill (2009)
  • [5] Hajiaghayi, M., Saleh, H., Seddighin, S., Sun, X.: Massively parallel algorithms for string matching with wildcards. arXiv preprint arXiv:1910.11829 (2019)
  • [6] Hakata, K., Imai, H.: The longest common subsequence problem for small alphabet size between many strings. In: International Symposium on Algorithms and Computation. pp. 469–478. Springer (1992)
  • [7] Hopcroft, J.: An n log n algorithm for minimizing states in a finite automaton. In: Theory of machines and computations, pp. 189–196. Elsevier (1971)
  • [8] Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. pp. 100–108 (2014)
  • [9]

    Irving, R.W., Fraser, C.B.: Two algorithms for the longest common subsequence of three (or more) strings. In: Annual Symposium on Combinatorial Pattern Matching. pp. 214–229. Springer (1992)

  • [10] Maier, D.: The complexity of some problems on subsequences and supersequences. Journal of the ACM (JACM) 25(2), 322–336 (1978)
  • [11] Pietrzak, K.: On the parameterized complexity of the fixed alphabet shortest common supersequence and longest common subsequence problems. Journal of Computer and System Sciences 67(4), 757–771 (2003)