Linear Time Construction of Indexable Elastic Founder Graphs
Pattern matching on graphs has been widely studied lately due to its importance in genomics applications. Unfortunately, even the simplest problem of deciding if a string appears as a subpath of a graph admits a quadratic lower bound under the Orthogonal Vectors Hypothesis (Equi et al. ICALP 2019, SOFSEM 2021). To avoid this bottleneck, the research has shifted towards more specific graph classes, e.g. those induced from multiple sequence alignments (MSAs). Consider segmenting 𝖬𝖲𝖠[1..m,1..n] into b blocks 𝖬𝖲𝖠[1..m,1..j_1], 𝖬𝖲𝖠[1..m,j_1+1..j_2], …, 𝖬𝖲𝖠[1..m,j_b-1+1..n]. The distinct strings in the rows of the blocks, after the removal of gap symbols, form the nodes of an elastic founder graph (EFG) where the edges represent the original connections observed in the MSA. An EFG is called indexable if a node label occurs as a prefix of only those paths that start from a node of the same block. Equi et al. (ISAAC 2021) showed that such EFGs support fast pattern matching and gave an O(mn log m)-time algorithm for preprocessing the MSA in a way that allows the construction of indexable EFGs maximizing the number of blocks and, alternatively, minimizing the maximum length of a block, in O(n) and O(n loglog n) time respectively. Using the suffix tree and solving a novel ancestor problem on trees, we improve the preprocessing to O(mn) time and the O(n loglog n)-time EFG construction to O(n) time, thus showing that both types of indexable EFGs can be constructed in time linear in the input size.
READ FULL TEXT