Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions

04/20/2020
by   Patty Commins, et al.
0

Algorithms to find optimal alignments among strings, or to find a parsimonious summary of a collection of strings, are well studied in a variety of contexts, addressing a wide range of interesting applications. In this paper, we consider chain letters, which contain a growing sequence of signatories added as the letter propagates. The unusual constellation of features exhibited by chain letters (one-ended growth, divergence, and mutation) make their propagation, and thus the corresponding reconstruction problem, both distinctive and rich. Here, inspired by these chain letters, we formally define the problem of computing an optimal summary of a set of diverging string sequences. From a collection of these sequences of names, with each sequence noisily corresponding to a branch of the unknown tree T representing the letter's true dissemination, can we efficiently and accurately reconstruct a tree T' ≈ T? In this paper, we give efficient exact algorithms for this summarization problem when the number of sequences is small; for larger sets of sequences, we prove hardness and provide an efficient heuristic algorithm. We evaluate this heuristic on synthetic data sets chosen to emulate real chain letters, showing that our algorithm is competitive with or better than previous approaches, and that it also comes close to finding the true trees in these synthetic datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2019

String Sanitization: A Combinatorial Approach

String data are often disseminated to support applications such as locat...
research
12/04/2019

Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string

Strings are a natural representation of biological data such as DNA, RNA...
research
04/18/2022

Practical KMP/BM Style Pattern-Matching on Indeterminate Strings

In this paper we describe two simple, fast, space-efficient algorithms f...
research
08/24/2022

Hierarchical Relative Lempel-Ziv Compression

Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in ...
research
06/28/2020

Random Access in Persistent Strings

We consider compact representations of collections of similar strings th...
research
01/13/2019

Longest Common Subsequence on Weighted Sequences

We consider the general problem of the Longest Common Subsequence (LCS) ...

Please sign up or login with your details

Forgot password? Click here to reset