Linear Time Construction of Indexable Founder Block Graphs

05/19/2020 ∙ by Veli Mäkinen, et al. ∙ Helsingin yliopisto 0

We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size 410× 29811 is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only 3% of the size of the MSA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computational pangenomics [26] ponders around the problem of expressing a reference genome of a species in a more meaningful way than as a string of symbols. The basic problem in such generalized representations is that one should still be able to support string matching type of operations on the content. Another problem is that any representation generalizing set of sequences also expresses sequences that may not be part of the real pangenome. That is, a good representation should have a feature to control over-expressiveness and simultaneously support efficient queries.

In this paper, we develop the theory around one promising pangenome representation candidate, the founder block graph. This graph is a natural derivative of segmentation algorithms [28, 13] related to founder sequences [34].

Consider a set of individuals represented as lists of variations from a common reference genome. Such a set can be expressed as a variation graph or as a multiple sequence alignment. The former expresses reference as a backbone of an automaton, and adds a subpath for each variant. The latter inputs all variations of an individual to the reference, creating a row for each individual into a multiple alignment. Figure 1 shows an example of both structures with 6 very short genomes.

A multiple alignment of much fewer founder sequences can be used to approximate the input represented as a multiple alignment as well as possible, meaning that each original row can be mapped to the founder multiple alignment with a minimum amount of row changes (discontinuities). Finding an optimal set of founders is NP-hard [29], but one can solve relaxed problem statements in linear time [28, 13], which are sufficient for our purposes. As an example on the usefulness of founders, Norri et al. [28] showed that, on a large public dataset of haplotypes of human genome, the solution was able to replace 5009 haplotypes with only 130 founders so that the average distance between row jumps was over 9000 base pairs [28]. This means that alignments of short reads (e.g. 100 bp) very rarely hit a discontinuity, and the space requirement drops from terabytes to just tens of gigabytes. Figure 1 shows such a solution on our toy example.

(a)

(b)
(c)
(d)
Figure 1: (fig:sub:sequences) Input multiple alignment, (fig:sub:founders) a set of founders with common recombination positions as a solution to a relaxed version of founder reconstruction, (fig:sub:graph) a variation graph encoding the input and (fig:sub:founder_graph) a founder block graph. Here the input alignment and the resulting variation graph are unrealistically bad; the example is made to illustrate the founders.

A block graph is a labelled directed acyclic graph consisting of consecutive blocks, where a block represents a set of sequences of the same length as parallel (unconnected) nodes. There are edges only from nodes of one block to the nodes of the next block. A founder block graph is a block graph with blocks representing the segments of founder sequences corresponding to the optimal segmentation [28]. Fig. 1 visualises such a founder block graph: There the founder set is divided into 3 blocks with the first, the second, and the third containing sequences of length 4, 4, and 3, respectively. The coloured connections between sequences in consecutive blocks define the edges. Such graphs interpreted as automata recognise the input sequences just like variation graphs, but otherwise recognise a much smaller subset of the language. With different optimisation criteria to compute the founder blocks, one can control the expressiveness of this pangenome representation.

In this paper, we show that there is a natural subclass of founder block graphs that admit efficient index structures to be built that support exact string matching on the paths of such graphs. Moreover, we give a linear time algorithm to construct such founder block graphs from a given multiple alignment. The construction algorithm can also be adjusted to produce a subclass of elastic degeneralized strings [9], which also support efficient indexing.

The founder block graph definition given above only makes sense if we assume that our input multiple alignment is gapless, meaning that the alignment is simply produced by putting strings of equal length under each other, like in Figure 1. We develop the theory around founder block graphs under gapless multiple alignments. However, most of the results can be easily extended to handle gaps properly.

We start in Sect. 2 by putting the work into the context of related work. In Sect. 3 we introduce the basic notions and tools. In Sect. 4 we study the property of founder block graphs that enable indexing. In Sect. 5 we give the linear time construction algorithm. In Sect. 6 we develop a succinct index structure that supports exact string matching in linear time. In Sect. 7 we consider the general case of having gap symbols in multiple alignment. We report some preliminary experiments in Sect. 8 on the construction and indexing of founder block graphs for a collection of SARS-CoV-2 strains. We consider future directions in Sect. 9.

2 Related work

Indexing directed acyclic graphs (DAGs) for exact string matching on its paths was first studied by Sirén et al. in WABI 2011 [32]. A generalization of Burrows-Wheeler transform [12] was proposed that supported near-linear time queries. However, the proposed transformation can grow exponential size in the worst case. Many practical solutions have been proposed since then, that either limit the search to short queries or use more time on queries [33, 22, 24, 31, 19]. More recently, such approaches have been captured by the theory on Wheeler graphs [18, 20, 2].

Since it is NP-hard to recognize if a given graph is Wheeler [20], it is of interest to look for other graph classes that could provide some indexability functionality. Unfortunately, quite simple graphs turn out to be hard to index [15, 16] (under the Strong Exponential Time Hypothesis). In fact, the reductions by Equi et al. [15, 16] can be adjusted to show that block graphs cannot be indexed in polynomial time to support fast string matching. But as we will later see, further restrictions on block graphs change the situation.

Block graphs have also tight connection to generalized degenerate (GD) strings and their elastic version. These can also be seen as DAGs with a very specific structure. Matching a GD string is computationally easier and even linear time online algorithms can be achieved to compare two such strings, as analyzed by Alzamel et al. [3]. The elastic counterpart requires more care, as studied by Bernardini et al. [9]. Our results on founder block graphs can be casted on GD strings and elastic strings, as we will show later.

Finally, our indexing solution has connections to succinct representations of de Bruijn graphs [11, 10, 8]. Compared to de Bruijn graphs that are cyclic and have limited memory (-mer length), our solution retains the linear structure of the block graph.

3 Definitions and basic tools

3.1 Strings

We denote integer intervals by . Let be an alphabet of size . A string is a sequence of symbols from , i.e. , where denotes the set of strings of length under the alphabet . A suffix of string is for . A prefix of string is for . A substring of string is for . Substring where is defined as the empty string.

The lexicographic order of two strings and is naturally defined by the order of the alphabet: iff and for some . If , then the shorter one is regarded as smaller. However, we usually avoid this implicit comparison by adding end marker to the strings.

Concatenation of strings and is denoted .

3.2 Founder block graphs

As mentioned in the introduction, our goal is to compactly represent a gapless multiple sequence alignment (MSA) using a founder block graph. In this section we formalize these concepts.

A gapless multiple sequence alignment is a set of strings drawn from , each of length . Intuitively, it can be thought of as a matrix in which each row is one of the strings. Such a structure can be partitioned into what we call a segmentation, that is, a collection of sets of shorter strings that can represent the original alignment.

[Segmentation] Let be a gapless multiple alignment and let be the strings in MSA. A segmentation of MSA is a set of sets of strings such that for each there exist and such that for each , for some . Furthermore, it holds , , and for all , so that covers the MSA. We call the width of .

A segmentation of a MSA can naturally lead to the construction of a founder block graph. Let us first introduce the definition of a block graph.

[Block Graph] A block graph is a graph where is a function that assigns a string label to every node and for which the following properties hold.

  1. is a partition of , that is, and for all ;

  2. if then and for some ;

  3. if then for each and if , .

As a convention we call every a block and every a segment.

Given a segmentation of a MSA, we can define the founder block graph as a block graph induced by . The idea is to have a graph in which the nodes represents the strings in while the edges retain the information of how such strings can be recombined to spell any sequence in the original MSA.

[Founder Block Graph] A founder block graph is a block graph induced by as follows: For each we have and if and only if there exists and such that , and with .

We regard the edges of (founder) block graphs to be directed from left to right. Consider a path in between any two nodes. The label of is the concatenation of labels of the nodes in the path. Let be a query string. We say that occurs in if is a substring of for any path of . Figure 2 illustrates such queries.

Figure 2: An example of two strings, GCG and ATTCGATA, occurring in .

In our example in Figure 1, segmentation would be , and the induced founder block graph has thus 3 blocks with 9 nodes and 11 edges in total.

3.3 Basic tools

A trie [14] of a set of strings is a rooted directed tree with outgoing edges of each node labeled by distinct characters such that there is a root to leaf path spelling each string in the set; shared part of the root to leaf paths to two different leaves spell the common prefix of the corresponding strings. Such a trie can be computed in time, where is the total length of the strings, and it supports string queries that require time, where is the length of the queried string.

An Aho-Corasick automaton [1] is a trie of a set of strings with additional pointers (fail-links). While scanning a query string, these pointers (and some shortcut links on them) allow to identify all the positions in the query at which a match for any of the strings occurs. Construction of the automaton takes the same time as that of the trie. Queries take time, where is the number of matches.

A suffix array [25] of string is an array such that if is the -th smallest suffix of string , where , and is the end marker. Thus, .

Burrows-Wheeler transform BWT [12] of string is such that BWT, where and is regarded as .

A bidirectional BWT index [30, 7] is a succinct index structure based on some auxiliary data structures on BWT. Given a string , with , such index occupying bits of space can be build in time and it supports finding in time if a query string appears as substring of [7]. Moreover, such query returns an interval pair , such that suffixes of starting at positions share a common prefix matching the query. Interval is the corresponding interval in the suffix array of the reverse of . Let , be the interval corresponding to query substring . A bidirectional backward step updates the interval pair , to the corresponding interval pair when the query substring is extended to the left into or to the right into . This takes constant time [7]. A fully-functional bidirectional BWT index [5] expands the steps to allow contracting symbols from the left or from the right. That is, substring can be modified into or to and the the corresponding interval pair can be updated in constant time.

Among the auxiliary structures used in BWT-based indexes, we explicitly use the rank and select structures: String from binary alphabet is called a bitvector. Operation returns the number of s in . Operation returns the index containing the -th in . Both queries can be answered in constant time using an index requiring bits in addition to the bitvector itself [23].

4 Subclass of founder block graphs admitting indexing

We now show that there exist a family of founder block graphs that admit a polynomial time constructable index structure supporting fast string matching. First, a trivial observation: the input multiple alignment is a founder block graph for the segmentation consisting of only one segment. Such founder block graph (set of sequences) can be indexed in linear time to support linear time string matching [7]. Now, the question is, are there other segmentations that allow the resulting founder block graph to be indexed in polynomial time? We show that this is the case.

Founder block graph is segment repeat-free if each for occurs exactly once in .

Our example graph (Fig. 1) is not quite segment repeat-free, as TAT occurs also as substring of paths starting with ATAT.

Segment repeat-free founder block graphs can be indexed in polynomial time to support polynomial time string queries.

To prove the proposition, we construct such an index and show how queries can be answered efficiently.

Let be the set of all paths starting from node and ending in a sink node. Let be the set of suffix path labels for . Consider sorting in lexicographic order. Then one can binary search any query string in to find out if it occurs in or not. The problem with this approach is that is of exponential size.

However, if we know that is segment repeat-free, we know that the lexicographic order of , , is fully determined by the prefix of , where is the node following on the path . Let denote the set of suffix path labels cut at this manner. Now the corresponding set is no longer of exponential size. Consider again binary searching a string on sorted . If occurs in then it occurs in . If not, has to have some for as its substring in order to occur in .

To figure out if contains for some as its substring, we build an Aho-Corasick automaton [1] for . Scanning this automaton takes time and returns such if it exist.

To verify such potential match, we need several tries [14]. For each , we build tries and on the sets and , respectively, where denotes the reverse of string .

Assume now we have located (using Aho-Corasick automaton) with such that , where is at the -th block of . We continue searching from right to left in trie . If we reach a leaf after scanning , we continue the search with on trie , where is the node at block of corresponding to the leaf we reached in the trie. If the search succeeds after reading we have found a path in spelling . We repeat the analogous procedure with starting from trie . That is, we can verify a candidate occurrence of in in time, as the search in the tries takes time per step. Note however, that there could be several labels occurring as substrings of , so we need to do the verification process for each one of them separately. There can be at most such candidate occurrences, due to the distinctness of node labels in . In total, this search can take at most time.

We are now ready to specify a theorem that reformulates Proposition 4 in detailed form.

Let be a segment repeat-free founder block graph with blocks such that . We can preprocess an index structure for in time, where is the alphabet for node labels, , and . Given a query string , we can use the index structure to find out if occurs in . This query takes time.

Proof.

With preprocessing time we can build the Aho-Corasick automaton [1]. The tries can be built in time. The required queries on these structures take time.

To avoid the costly binary search in sorted , we instead construct the bidirectional BWT index [7] for the concatenation . Concatenation is thus a string of length from alphabet . The bidirectional BWT index for can be constructed in time, so that in time, one can find out if occurs in [7]. This query equals that of binary search in . ∎

We remark that founder block graphs have a connection with generalized degenerate strings (GD strings) [4]. In a GD string, sets of strings of equal length are placed one after the other to represent in a compact way a bigger set of strings. Such set contains all possible concatenations of those strings, which are obtained by scanning the individual sets from left to right and selecting one string from each set. The length of the strings in a specific set is called width, and sum of all the width of all sets in a GD string is the total width. Given two GD strings of the same total width it is possible to determine if the intersection of the sets of strings that they represent is non empty in linear time in the size of the GD strings [4]. Thus, the special case in which one of the two GD string is just a standard string can be seen also as a special case of a founder block graph in which every segment is fully connected with the next one and the length of the query string is equal to the maximal length of a path in the graph.

We consider the question of indexing GD strings (fully connected block graphs) to search for queries shorter than the total width. We can exploit the segment repeat-free property to yield such an index.

Let be a segment repeat-free GD string a.k.a. a fully connected segment repeat-free founder block graph with blocks such that and for all and , . We can preprocess an index structure for in time, where is the alphabet for node labels, , and . Given a query string , we can use the index structure to find out if occurs in . This query takes time.

Proof.

Recall the index structure of Theorem 4. for the case of GD strings, we can simplify it as follow.

We keep the same BWT index structure and the Aho-Corasick automaton, but we do not need any tries. After finding at most occurrences of substrings of in the graph using the Aho-Corasick automaton on node labels, we mark the matching blocks accordingly. If 2 marked blocks have exactly one marked neighboring block and blocks have 2 marked neighboring blocks, then we have found an occurrence, otherwise not. ∎

Observe that , where and are the number of rows and number of columns, respectively, in the multiple sequence alignment from where the founder block graph is induced. That is, the index construction algorithms of the above theorems can be seen to be almost linear time in the (original) input size. We study succinct variants of these indexes in Sect. 6, and also improve the construction and query times to linear as side product.

5 Construction of segment repeat-free founder block graphs

We can adapt the dynamic programming segmentation algorithms for founders [28, 13] to produce segment repeat-free founder block graphs.

The idea is as follows. Let be a segmentation of . We say is valid if it induces a segment repeat-free founder block graph . We build such valid for prefixes of MSA from left to right, minimizing the maximum block length needed.

5.1 Characterization lemma

Given a segmentation and founder block graph induced by , we can ensure that it is valid by checking if, for all , occurs in the rows of the MSA only in the interval of the block , where is the block of such that .

[Characterization] Let . A segmentation is valid if and only if, for all , and , if then .

Proof.

To see that this is a necessary condition for the validity of , notice that each row of MSA can be read through , so if occurs elsewhere than inside the block, then these extra occurrences make invalid. To see that this is a sufficient condition for the validity of , we observe the following:

  1. For all , is a substring of some row of the input MSA.

  2. Let be two edges such that is not a substring of any row of input MSA. Then any substring of either occurs in some row of the input MSA or it includes as its substring.

  3. If for some does not occur as substring of some input MSA row, but it occurs as a substring of some path in , then it includes as its substring for some , where . Such occurs at least once outside the block , where . That is, if makes invalid then also makes it invalid.

  4. Continuing case c) inductively, at some point there has to be for some such that is substring of for some . By a), such is also a substring of some row of the input MSA. That is, of case c) cannot exist.

5.2 From characterization to a segmentation

Let be the score of an optimal (minimum scoring) valid segmentation of prefix , where the score is defined as . We can compute

(1)

where is the largest integer such that segment is valid. The segment is valid iff each substring , for , occurs as many times in as in the whole MSA. If such does not exist for some , we set .

We can compute the score of the optimal segmentation of MSA in time after preprocessing values in time, where . For the former, one can start comparing from decreasing by one, and then the value grows bigger than at latest after steps. For preprocessing, we build the bidirectional BWT index of the MSA in time [7]. At column , consider the trie containing the reverse of the rows of . Search the trie paths from the bidirectional BWT index until the number of leaves in each trie subtree equals the length of the corresponding BWT interval. Let be the column closest to where this holds for all trie paths. Then one can set . The time construction of the trie has to be repeated for each column. As , the claimed preprocessing time follows.

5.3 Faster preprocessing

We can do the preprocessing in time.

Given a multiple sequence alignment , values for each can be computed in time, where is the largest integer such that segment is valid.

Proof.

Let us build the bidirectional BWT index [7] of MSA rows concatenated into one long string. We will run several algorithms in synchronization over this BWT index, but we explain them first as if they would be run independently.

Algorithm 1 searches in parallel all rows from right to left advancing each by one position at a time. Let be the number parallel of steps done so far. It follows that we can maintain a bitvector that at -th step stores iff is the -th last symbol of some row.

Algorithm 2 uses the variable length sliding window approach of Belazzougui and Cunial [5] to compute values . Let the first row of MSA be . Search backwards in the fully-functional bidirectional BWT index [5]. Stop the search at such that the corresponding BWT interval contains only suffixes originating from column of the MSA, that is, spelling in the concatenation, for some rows . Set . Contract from the search string and modify BWT interval accordingly [5]. Continue the search to find s.t. again the corresponding BWT interval contains only suffixes originating from column . Update . Continue like this throughout . Repeat the process for all rows, to compute for all . Set for all .

Let us call the instances of the Algorithm 2 run on the rest of the rows as Algorithms .

Let the current BWT interval in Algorithms to be . The problematic part in them is checking if the corresponding active BWT intervals for Algorithms contain only suffixes originating from column . To solve this, we run Algorithm 1 as well as Algorithms to in synchronization so that we are at the -th step in Algorithm 1 when we are processing interval in rest of the algorithms, for . In addition, we maintain bitvectors and such that and for . For each that we set to 1 at step with and , we check if and . If and only if this check fails on any , there is a suffix starting outside column . This follows from the fact that each suffix starting at column must be contained in exactly one of the distinct intervals of the set . This is because cannot contain nested interval pairs as all strings in segment of MSA are equal length, and thus their BWT intervals cannot overlap except if the intervals are exactly the same.

Finally, the running time is , since each extend-left and contract-right operations take constant time [5], and since the bitvectors are manipulated locally only on indexes that are maintained as variables during the execution. ∎

5.4 Faster main algorithm

Recall Eq. (1). Before proceeding to the involved optimal solution, we give some insights by first improving the running time to logarithmic per entry.

As it holds , the range where the minimum is taken grows as grows. Now, can be seen as the effect range of : for columns the maximum from the options is . Consider maintaining (key, value) pairs in a binary search tree (BST). When computing we should have pairs for in BST. Value can be computed by taking range minimum on BST values with keys in range . Such query is easy to solve in time. If there is nothing in the interval, . Since this is semi-open interval on keys in range , BST can be replaced by van Emde Boas tree to obtain time computation of all values [17]. Alternatively, we can remove elements from the BST once they no longer can be answers to queries, and we can get solution. To obtain better running time, we need to exploit more structural properties of the recurrence.

Cazaux et al. [13] considered a similar recurrence and gave a linear time solution for it. In what follows we modify that technique to work with valid ranges.

For between and , we define

For any , we have .

Proof.

By the definition of , for any , we have for , and for , .

We assume that there exists , such that . In this case, and we have . As , and thus . As , we have . To simplify the proof, we take , , and . Hence, we have , and . Now we are going to prove that this system admits no solution.

  • Case where and . As , we have and thus which is impossible because .

  • Case where and . We can assume that (in the other case, we take ) and as , we have and thus which is impossible because .

  • Case where and . We have and , thus which is impossible because .

  • Case where and . We have and , thus which is impossible because .

By initialising to a threshold , for any , we have .

Proof.

We are going to show by induction. The base case is obvious because . As , by using induction,

By using , we can have that each is in .

Given , we can compute in constant time if

Proof.

We need just to compare and where is in
. If is smaller than , is smaller than all the with and thus for all . Hence we have
.

Otherwise, and as , . In this case . By using the constant time semi-dynamic range maximum query by Cazaux et al. [13] on the array , we can obtain in constant time and thus check the equality in constant time. ∎

We can build all the values in time after preprocessing.

Proof.

After preprocessing in to compute all the values , we can build all the values by increasing and computing the values . For any , we check any from to with the equality of Lemma 5.4 until one is true. With the property of the equality, we know that this is . After that we compute which corresponds to the value , we add to the constant time semi-dynamic range maximum query and we compute . ∎

6 Succinct index for segment-free founder block graphs

Recall the indexing solutions of Sect. 4 and the definitions from Sect. 3.

We now show that explicit tries and Aho-Corasick automaton can be replaced by some auxiliary data structures associated with the Burrows-Wheeler transformation of the concatenation .

Consider interval in the suffix array of corresponding to suffixes having as prefix for some . From the segment repeat-free property it follows that this interval can be split into two subintervals, and , such that suffixes in start with and suffixes in start with , where . Moreover, BWT equals multiset sorted in lexicographic order. This follows by considering the lexicographic order of suffixes for . That is, BWT (interpreted as a set) represents the children of the root of the trie considered in Sect. 4.

We are now ready to present the search algorithm that uses only the BWT of and some small auxiliary data structures. We associate two bitvectors and to the BWT of as follows. We set and iff is maximal interval with all suffixes starting with for some .

Consider the backward search with query . Let be the interval after matching the shortest suffix such that BWT. Let and . If and , index lies inside an interval where all suffixes start with for some . We modify the range into , and continue with the backward step on . We check the same condition in each step and expand the interval if the condition is met. Let us call this procedure expanded backward search.

We can now strictly improve Theorems 4 and 4 as follows.

Let be a segment repeat-free founder block graph (or a segment repeat-free GD string) with blocks such that . We can preprocess an index structure for occupying bits in time, where is the alphabet for node labels and . Given a query string , we can use expanded backward search with the index structure to find out if occurs in . This query takes time.

Proof.

(sketch) As we expand the search interval in BWT, it is evident that we still find all occurrences for short patterns that span at most two nodes, like in the proof of Theorem 4. We need to show that a) the expansions do not yield spurious occurrences for such short patterns and b) the expansions yield exactly the occurrences for long patterns that we earlier found with the Aho-Corasick and tries approach.

In case b), notice that after an expansion step we are indeed in an interval where all suffixes match and thus corresponds to a node . The suffix of the query processed before reaching interval must be at least of length . That is, to mimic Aho-Corasick approach, we should continue with the trie . This is identical to taking backward step from BWT, and continuing therein to follow the rest of this implicit trie.

To conclude case b), we still need to show that we reach all the same nodes as when using Aho-Corasick, and that the search to other direction with can be avoided. These follow from case a), as we see.

In case a), before doing the first expansion, the search is identical to the original algorithm in the proof of Theorem 4. After the expansion, all matches to be found are those of case b). That is, no spurious matches are reported. Finally, no search interval can include two distinct node labels, so the search reaches the only relevant node label, where the Aho-Corasick and trie search simulation takes place. We reach all such nodes that can yield a full match for the query. ∎

7 Gaps in multiple alignment

We have so far assumed that our input is a gapless multiple alignment. Let us now consider how to extend the results to the general case. The idea is that gaps are only used in the segmentation algorithm to define the valid ranges, and that is the only place where special attention needs to be taken; elsewhere, whenever a substring from MSA rows is read, gaps are treated as empty strings. That is, A-GC-TA- becomes AGCTA.

It turns out that allowing gaps in MSA indeed makes the computation of valid ranges more difficult. To see this, consider a segment in MSA containing strings:

-AC-CGATC-
-A-CCGATCC
AAC-CGATC-
AAC-CGA-C-

After gaps are removed this segment becomes:

ACCGATC
ACCGATCC
AACCGATC
AACCGAC

Without even seeing the rest of the MSA, one can see that this is not a valid block, as the first string is a prefix of the second. With gapless MSAs this was not possible and the algorithm in Sect. 5.3 exploited this fact. Modifying that algorithm to handle gaps properly is possible, but non-trivial.

We leave this extension to future work; see however Sections 8 and 9 for some further insights.

Despite the preprocessing for the segmentation is affected by gaps in MSA, once such valid segmentation is found, the rest of the results stay unaffected. All the proposed definitions extend to this interpretation by just omitting gap symbols when reading the strings. The consequence for founder block graph is that strings inside a block can be variable length. Interestingly, with this interpretation Theorem 4 can be expressed with GD strings replaced by elastic strings [9].

8 Implementation and experiments

We implemented several methods proposed in this paper. The implementation is available at https://github.com/algbio/founderblockgraphs. Some preliminary experiments are reported below.

8.1 Construction

We implemented the founder block graph construction algorithm of Sect. 5.2 with the faster preprocessing routine of Sect. 5.3; in place of fully-functional bidirectional BWT index, we used similar routines implemented in compressed suffix trees of SDSL library [21].

To test the implementation we downloaded 1484 strains of SARS-CoV-2 strains stored in NCBI database.111https://www.ncbi.nlm.nih.gov/, accessed 24.04.2020. We created a multiple sequence alignment of the strains using ViralMSA[27]. We filtered out rows that contained gaps or N’s. We were left with a multiple sequence alignment of 410 rows and 29811 columns. Our algorithm took 58 seconds to produce the optimal segmentation on Intel(R) Xeon(R) CPU, E5-2690, v4, 2.60GHz. There were 3352 segments in the segmentation, the maximum segment length was 12, and the maximum number of founder segments in a block was 12. The founder block graph had 3900 nodes and 4440 edges. The total length of node labels was 34968. The graph size is thus less than of the MSA size.

We also implemented support to construct founder block graphs for general MSA’s that contain gaps. The modification to the gapless case was that nested BWT intervals needed to be detected. We stopped left-extension as soon as the BWT intervals contained no other repeats than those caused by nestedness. This left valid range undefined on such columns, but for the rest the valid range can still be computed correctly (undefined values could be postprocessed using matching statistics [6] on all pairs of prefixes preceding suffixes causing nested intervals). Initial experiments show very similar behaviour to the gapless case, but we defer further experiments until the implementation is mature enough.

8.2 Indexing

We implemented the succinct indexing approach of Sect. 6. On the founder block graph of the previous experiment, the index occupied 87 KB. This is of the original input size, as the encoding of the input MSA with 2 bits per nucleotide takes 2984 KB.

Figure 3 shows an experiment with indexes built on different size samples of the MSA rows, and with querying patterns of varying length sampled from the same rows. As can be seen, the query times are not affected by the size of the MSA samples (showing independence of the input MSA), but only on their length (showing linear dependency on the query length).

Figure 3: Running time of querying patterns from the founder block graph. The sample sizes (for MSA row subsets) are shown on the right-hand side of each plot. The plots show averages and distribution over 10 repeats of each search, where one search consist of a set of query patterns of given length randomly sampled from the respective MSA row subset. The pattern set sample size (10,20,30,40, respectively) grows by the MSA sample size, but the reported numbers are normalized so that the query time (milliseconds) is per pattern. This experiment was run on Intel(R) Core(TM) i5-4308U CPU, 2.80GHz.

9 Discussion

One characterization of our solution is that we compact those vertical repeats in MSA that are not horizontal repeats. This can be seen as positional extension of variable order de Bruijn graphs. Also, our solution is parameter-free unlike de Bruijn approaches that always need some threshold , even in the variable order case.

The founder block graph concept could also be generalized so that it is not directly induced from a segmentation. One could consider cyclic graphs having the same segment repeat-free property. This could be interesting direction in defining parameter-free de Bruijn graphs.

Future work include a proper extension of the approach to general MSA’s containing gaps. Our implementation already contains support for such MSA’s, but the theory framework still requires some more work to show that such extension can be done without any effect on the running time.

On the experimental side, there is still much more work to be done. So far, we have not optimized any of the algorithms for multithreading nor for space usage. For example, one could use BWT indexes engineered for highly repetitive text collections to build the founder block graphs in space proportional of the compressed MSA. Such optimizations are essential for applying the approach on e.g. human genome data. Past experience on similar solutions [28] indicate that our approach should easily be applicable to much larger datasets than those we covered in our preliminary experiments.

Finally, this paper only scratches the surface of a new family of pangenome representations. There are myriad of options how to optimize among the valid segmentations [28, 13], e.g. by optimizing the number of founder segments [28] or controlling the over-expressiveness of the graph, rather than minimizing the maximum block size as studied here.

References

  • [1] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333–340, 1975.
  • [2] Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 911–930. SIAM, 2020.
  • [3] Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Degenerate String Comparison and Applications. In Laxmi Parida and Esko Ukkonen, editors, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018), volume 113 of Leibniz International Proceedings in Informatics (LIPIcs), pages 21:1–21:14, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. URL: http://drops.dagstuhl.de/opus/volltexte/2018/9323, doi:10.4230/LIPIcs.WABI.2018.21.
  • [4] Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Degenerate string comparison and applications. In Laxmi Parida and Esko Ukkonen, editors, 18th International Workshop on Algorithms in Bioinformatics, WABI 2018, August 20-22, 2018, Helsinki, Finland, volume 113 of LIPIcs, pages 21:1–21:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
  • [5] Djamal Belazzougui and Fabio Cunial. Fully-functional bidirectional burrows-wheeler indexes and infinite-order de bruijn graphs. In Nadia Pisanti and Solon P. Pissis, editors, 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, volume 128 of LIPIcs, pages 10:1–10:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • [6] Djamal Belazzougui, Fabio Cunial, and Olgert Denas. Fast matching statistics in small space. In Gianlorenzo D’Angelo, editor, 17th International Symposium on Experimental Algorithms, SEA 2018, June 27-29, 2018, L’Aquila, Italy, volume 103 of LIPIcs, pages 17:1–17:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.SEA.2018.17, doi:10.4230/LIPIcs.SEA.2018.17.
  • [7] Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. ACM Trans. Algorithms, 16(2), March 2020. URL: https://doi.org/10.1145/3381417, doi:10.1145/3381417.
  • [8] Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali, and Simon J. Puglisi. Bidirectional variable-order de bruijn graphs. Int. J. Found. Comput. Sci., 29(8):1279–1295, 2018. URL: https://doi.org/10.1142/S0129054118430037, doi:10.1142/S0129054118430037.
  • [9] Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Even Faster Elastic-Degenerate String Matching via Fast Matrix Multiplication. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019), volume 132 of Leibniz International Proceedings in Informatics (LIPIcs), pages 21:1–21:15, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. URL: http://drops.dagstuhl.de/opus/volltexte/2019/10597, doi:10.4230/LIPIcs.ICALP.2019.21.
  • [10] Christina Boucher, Alexander Bowe, Travis Gagie, Simon J. Puglisi, and Kunihiko Sadakane. Variable-order de bruijn graphs. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, 2015 Data Compression Conference, DCC 2015, Snowbird, UT, USA, April 7-9, 2015, pages 383–392. IEEE, 2015. URL: https://doi.org/10.1109/DCC.2015.70, doi:10.1109/DCC.2015.70.
  • [11] Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de bruijn graphs. In Benjamin J. Raphael and Jijun Tang, editors, Algorithms in Bioinformatics - 12th International Workshop, WABI 2012, Ljubljana, Slovenia, September 10-12, 2012. Proceedings, volume 7534 of Lecture Notes in Computer Science, pages 225–235. Springer, 2012. URL: https://doi.org/10.1007/978-3-642-33122-0_18, doi:10.1007/978-3-642-33122-0_18.
  • [12] M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
  • [13] Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen, and Tuukka Norri. Linear time maximum segmentation problems in column stream model. In Nieves R. Brisaboa and Simon J. Puglisi, editors, String Processing and Information Retrieval - 26th International Symposium, SPIRE 2019, Segovia, Spain, October 7-9, 2019, Proceedings, volume 11811 of Lecture Notes in Computer Science, pages 322–336. Springer, 2019.
  • [14] Rene De La Briandais. File searching using variable length keys. In Papers Presented at the the March 3-5, 1959, Western Joint Computer Conference, IRE-AIEE-ACM ’59 (Western), page 295–298, New York, NY, USA, 1959. Association for Computing Machinery. URL: https://doi.org/10.1145/1457838.1457895, doi:10.1145/1457838.1457895.
  • [15] Massimo Equi, Roberto Grossi, Veli Mäkinen, and Alexandru I. Tomescu. On the complexity of string matching for graphs. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 55:1–55:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • [16] Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless seth fails, 2020. arXiv:2002.00629.
  • [17] Harold N. Gabow, Jon Louis Bentley, and Robert Endre Tarjan. Scaling and related techniques for geometry problems. In Richard A. DeMillo, editor, Proceedings of the 16th Annual ACM Symposium on Theory of Computing, April 30 - May 2, 1984, Washington, DC, USA, pages 135–143. ACM, 1984. URL: https://doi.org/10.1145/800057.808675, doi:10.1145/800057.808675.
  • [18] Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for bwt-based data structures. Theor. Comput. Sci., 698:67–78, 2017.
  • [19] Erik Garrison, Jouni Sirén, Adam Novak, Glenn Hickey, Jordan Eizenga, Eric Dawson, William Jones, Shilpa Garg, Charles Markello, Michael Lin, and Benedict Paten. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology, 36, 08 2018. doi:10.1038/nbt.4227.
  • [20] Daniel Gibney and Sharma V. Thankachan. On the hardness and inapproximability of recognizing wheeler graphs. In Michael A. Bender, Ola Svensson, and Grzegorz Herman, editors, 27th Annual European Symposium on Algorithms, ESA 2019, September 9-11, 2019, Munich/Garching, Germany, volume 144 of LIPIcs, pages 51:1–51:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • [21] Simon Gog. Sdsl – succinct data structure library, 2020. https://github.com/simongog/sdsl-lite.
  • [22] Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):361–370, 2013.
  • [23] G. Jacobson. Space-efficient static trees and graphs. In Proc. FOCS, pages 549–554, 1989.
  • [24] Sorina Maciuca, Carlos del Ojo Elias, Gil McVean, and Zamin Iqbal. A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Aarhus, Denmark, August 22-24, 2016. Proceedings, volume 9838 of Lecture Notes in Computer Science, pages 222–233. Springer, 2016.
  • [25] Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993. URL: https://doi.org/10.1137/0222058, doi:10.1137/0222058.
  • [26] Tobias Marschall, Manja Marz, Thomas Abeel, Louis Dijkstra, Bas E Dutilh, Ali Ghaffaari, Paul Kersey, Wigard Kloosterman, Veli Mäkinen, Adam Novak, et al. Computational pan-genomics: status, promises and challenges. BioRxiv, page 043430, 2016.
  • [27] Niema Moshiri. Viralmsa: Massively scalable reference-guided multiple sequence alignment of viral genomes. bioRxiv, 2020. doi:10.1101/2020.04.20.052068.
  • [28] Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, and Veli Mäkinen. Linear time minimum segmentation enables scalable founder reconstruction. Algorithms Mol. Biol., 14(1):12:1–12:15, 2019.
  • [29] Pasi Rastas and Esko Ukkonen. Haplotype inference via hierarchical genotype parsing. In Algorithms in Bioinformatics, 7th International Workshop, WABI 2007, Philadelphia, PA, USA, September 8-9, 2007, Proceedings, pages 85–97, 2007.
  • [30] Thomas Schnattinger, Enno Ohlebusch, and Simon Gog. Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput., 213:13–22, 2012.
  • [31] Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. arXiv preprint arXiv:1805.03834, 2018.
  • [32] Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375–388, 2014.
  • [33] Chris Thachuk. Indexing hypertext. Journal of Discrete Algorithms, 18:113–122, 2012.
  • [34] Esko Ukkonen. Finding founder sequences from a set of recombinants. In Algorithms in Bioinformatics, Second International Workshop, WABI 2002, Rome, Italy, September 17-21, 2002, Proceedings, pages 277–286, 2002.