1 Introduction
Summarizing input strings in a preprocessing step for string matching algorithms has been used before [4]. We use a similar approach for the longest common subsequence problem.
A special case of this problem is matching regular expressions, which has been studied for the case with two strings [5].
For two strings, a finite state machine with that separates them exists, which improves other sublinear existing results [3]. However, the paper does not provide an algorithm for building it.
Longest Common Substring and Shortest Common Supersequence.
For a given set of strings, their longest common substring is the longest string which is a subsequence of both strings. This problem is also NPhard and W[1]hard [9, 6, 11]. For the longest common substring problem, datastructures such as prefix/suffixtree, trie and etc. are known for matching one string with a set of query strings.
Multifurcating Phylogenetic Tree
A multifurcating phylogenetic tree of a set of biological strings is a tree with nodes of arbitrary degree where each intermediate node corresponds to a common ancestor and each leaf is a living organism. Using the input strings in the leaves of the tree and their common supersequences at intermediate nodes, this gives a tree on the strings.
Agglomerative Clustering Tree
A method of constructing phylogenetic trees is to use a hierarchical clustering algorithm, and report the resulting tree. Using singlelinkage clustering, at each step the two closest points are clustered together and an intermediate node representing them is added to the tree, and this proceeds until there are
clusters. If , this is equivalent to Kruskal’s algorithm for minimum spanning tree.String Matching using Regular Expressions.
Regular expressions are specifications of regular languages using a set of characters called the alphabet and a set of wildcards. Symbols such as parantheses for grouping, for choosing one of the character (groups) on each of its sides, for representing an arbitrary character. A wildcard character denotes the repetitions of other character groups, for example means an arbitrary number of repetitions, means at least one repetition, means or repetition of the character or a group of character inside the parentheses just before the wildcard. In regular expressions, another notation for representing repetitions for any can also be used.
2 Wildcard Compression of Strings
The goal is to compute a regular expression of length of a string of length , such that the number of strings in the language of the regular expression is minimized. Assume the length of the regular expression of the strings separated by is the maximum of their lengths, and all the characters that do not appear in the output string have length .
Example.
Consider and , then has possible cases and has possible cases.
If is not allowed and no error is allowed, the regular expression with the minimum length can be computed by DFA minimization and writing the equivalent regular expression in for an alphabet of size using Hopcroft’s algorithm [7].
The DFA for a single string can be built by setting states, where the state goes to state with character , for , and to a trap state with any other character. The start state is and the terminal is .
For a set of strings, we first need to build their DFA, which has size assuming all prefixes of the strings at the states. Building the DFA takes , for choosing a character and a string to add the next character, at each state. Computing the regular expression of this state machine takes time.
We define an approximation of the wildcard compression with possibilities for a string , as a regular expression with possible cases for . A similar definition can be used for a set of strings instead of one string .
The regular expression with possible cases has an unbounded approximation factor for a string that repeats a character times. So, there is no obvious bound on the approximation factor of the problem.
Example.
A string with all distinct characters can be summarized as , which has possibilities. This is an exponential number of possibilities in the input size, unless . A shorter representation is , with possibilities. The second representation is a approximation.
Algorithm.
The following operations are repeatedly applied to the string or the DFA of the language of that string to get the formula:

The operation of unifying two characters, i.e. replacing two characters such as and with is equivalent to merging their states. The newly equivalent states can now be merged together. Two states are equivalent if their incoming and outgoing states are the same.

A sequence of consecutive states representing string , if a sequence of consecutive states are equal to , for an integer , repeated times, it can be replaced with , and the strings between the repetitions are merged by .
In each iteration, the operation which minimizes is chosen, where is the set of possibilities after the operation divided by the number of possibilities before it, and is the amount of length reduction done by . The algorithm ends when the length of the string reaches . Also, the second operation has priority over the first one.
Theorem 2.1
The approximation factor of the wildcard simplification algorithm is .
Proof
By checking subsequences of length instead of , the length of the repeated sequence is approximated by at most . Finding a solution of the same size would require adding as many new characters as the length of the string. So, the first operation is within a factor of the optimal solution.
If the second operation concatenates strings that appear separately in the optimal solution, it can add as much as the length of the concatenated string to the length of the solution, but since that is also a copy of the current string, the approximation factor is again .
Any operations that is blocked after a sequence of operations, can still be done with approximation factor . Based on the greedy choice of the algorithm, the ratio between the number of possibilities to the length of the string used in the regular expression of the best operation in the optimal solution is less than or equal to the one chosen by the algorithm in the first step, i.e. . If the first operation changes the string used in an operation from the optimal solution, the ratio will decrease or remain the same, since the length of the string is at least as much as the length of the optimal string and the approximation factor gives .The algorithm needs to check all the copies of a string at the same time for the greedy choice to work.
Assuming the number of operations resulting in the optimal solution is , then the sum of the lengths of the repeated strings is , where is the length of the union of the strings. By replacing each operation with a greedy choice, the length of each string either remains the same or increases, and their intersections is at least as much as their lengths, so the number of operations of the algorithm is at most . Applying the bound for each operation, the total possibilities of the algorithm would be , which is a approximation of the optimal solution, i.e. . ∎
Example.
For the string “banana” from the previous example, one application of the second operation gives , and then one application of the first operation which gives . The number of possible cases is , so it is a approximation.
A Simple Algorithm for Strings of Equal Length.
In case the input set consists of strings with length , it is enough to keep the set of possible characters at each location. This can be done by computing the distinct characters at each location, by sorting or marking the elements of the alphabet. This can be done in time.
Matching Regular Expressions
For a set of input strings, and the nearest neighbor query, we can first build a wildcard compression, the shortest common supersequence or any other regular expression, build a finite state machine for it, and then use it at the query time.
3 An MPC Algorithm for LCS of Wildcard Compressions
The LCS problem does not have the mergeability property [1], i.e. merging the solutions of subsets of inputs does not give the solution to the LCS of all. For example, consider the set aac,caa,cbb,bbc, and assume the partitions are and cbb,bbc; then the LCS of the partitions are aa,bb, whose LCS is empty, while the solution is the string “c”.
While the wildcard compressions are not mergeable either, they are composable [8], meaning the solutions to subsets of the input give an approximate solution to the original problem. In the worst case the wildcard characters are at different locations, so they have no intersection with each other and the number of possibilities is the product of the number of possibilities of each subproblem. So, the approximation factor of the composable solution is the sum of the approximation factors of the subproblems.
References
 [1] Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J., Wei, Z., Yi, K.: Mergeable summaries. In: Proceedings of the 31st ACM SIGMODSIGACTSIGAI symposium on Principles of Database Systems. pp. 23–34 (2012)
 [2] Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Proceedings of the 32nd ACM SIGMODSIGACTSIGAI symposium on Principles of database systems. pp. 273–284. ACM (2013)
 [3] Chase, Z.: A new upper bound for separating words. arXiv preprint arXiv:2007.12097 (2020)
 [4] Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms (3rd edition). MIT Press and McGrawHill (2009)
 [5] Hajiaghayi, M., Saleh, H., Seddighin, S., Sun, X.: Massively parallel algorithms for string matching with wildcards. arXiv preprint arXiv:1910.11829 (2019)
 [6] Hakata, K., Imai, H.: The longest common subsequence problem for small alphabet size between many strings. In: International Symposium on Algorithms and Computation. pp. 469–478. Springer (1992)
 [7] Hopcroft, J.: An n log n algorithm for minimizing states in a finite automaton. In: Theory of machines and computations, pp. 189–196. Elsevier (1971)
 [8] Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable coresets for diversity and coverage maximization. In: Proceedings of the 33rd ACM SIGMODSIGACTSIGART symposium on Principles of database systems. pp. 100–108 (2014)

[9]
Irving, R.W., Fraser, C.B.: Two algorithms for the longest common subsequence of three (or more) strings. In: Annual Symposium on Combinatorial Pattern Matching. pp. 214–229. Springer (1992)
 [10] Maier, D.: The complexity of some problems on subsequences and supersequences. Journal of the ACM (JACM) 25(2), 322–336 (1978)
 [11] Pietrzak, K.: On the parameterized complexity of the fixed alphabet shortest common supersequence and longest common subsequence problems. Journal of Computer and System Sciences 67(4), 757–771 (2003)
Comments
There are no comments yet.