In the last several decades, the number of sequenced human genomes has been growing at unprecedented pace. In 2015 the number of sequenced genomes was doubling every months [stephens_big_2015] – a pace that has not slowed into the current decade. The plethora of resulting sequencing data has expanded our knowledge of the biomarkers responsible for human disease and phenotypes [Berner, 100K, 1000genomes], the evolutionary history between and among species [VGP], and will eventually help realize the personalization of healthcare [personal]. However, the amount of data for any individual species is large enough that it poses challenges with respect to storage and analysis. One of the most well-known and widely-used methods for compressing and indexing data that has been applied in bioinformatics is the Burrows-Wheeler Transform (), which is a text transformation that compresses the input in a manner that also allows for efficient substring queries. Not only can it be constructed in linear-time in the length of the input, it is also reversible – meaning the original input can be constructed from its compressed form. The is formally defined over a single input string; thus, in order to define and construct it for one or more strings, the input strings need to be concatenated or modified in some way. In 2007 Mantaci et al. [MantaciRRS07] presented a formal definition of the for a multiset of strings, which they call the extended Burrows-Wheeler Transform (). It is a bijective transformation that sorts the cyclic rotations of the strings of the multiset according to the -order relation, an order, defined by considering infinite iterations of each string, which is different from the lexicographic order.
Since its introduction several algorithms have been developed that construct the of collection of strings for various types of biological data including short sequence reads [DBLP:journals/tcs/BonizzoniVPPR21, DBLP:journals/tcs/BauerCR13, DBLP:journals/bioinformatics/CoxBJR12, egidi2019external, louza2020gsufsort, diaz2021efficient, egidi2019external, Ander2013, GuerriniRosone_Alcob2019, PrezzaPSR19, PrezzaPSR20], protein sequences [YANG2010742], metagenomic data [meta] and longer DNA sequences such as long sequence reads and whole chromosomes [rope]. However, we note that in the development of some of these methods the underlying definition of was loosened. For example, ropebwt2 [rope] tackles a similar problem of building what they describe as the FM-index for a multiset of long sequence reads, however, they do not construct the suffix array () or samples, and also, require that the sequences are delimited by separator symbols. Similarly, gsufsort [louza2020gsufsort] and egap [egidi2019external] construct the for a collection of strings but do not construct the according to its original definition. gsufsort [louza2020gsufsort] requires the collection of strings to be concatenated in a manner that the strings are deliminated by separator symbols that have an augmented relative order among them. egap [egidi2019external], which was developed to construct the and for a collection of strings in external memory, uses the gSACA-K algorithm to construct the suffix array of the concatenated input using an additional bits, and then constructs the for the collection from the resulting suffix array. Lastly, we note that there exists a number of methods for construction of the for a collection of short sequence reads, including ble [DBLP:journals/tcs/BonizzoniVPPR21], BCR [DBLP:journals/tcs/BauerCR13], G2BWT [diaz2021efficient], egsa [egsa]; however, these methods make implicit or explicit use of end-of-string symbols appended to strings in the collection. For an example of the effects of these manipulations, see Section 2, and [CL21] for a more detailed study.
We present an efficient algorithm for constructing the that preserves the original definition of Mantaci et al. [MantaciRRS07]—thus, it does not impose any ordering of the input strings or delimiter symbols. It is an adaptation of the well-known Suffix Array Induced Sorting (SAIS) algorithm of Nong et al. [NongZC2011], which computes the suffix array of a single string ending with an end-of-string character . Our adaptation is similar to the algorithm proposed by Bannai et al. [BannaiKKP21] for computing the , which can also be used for computing the , after linear-time preprocessing of the input strings. The key change in our approach is based on the insight that the properties necessary for applying Induced Sorting are valid also for the -order between different strings. As a result, is it not necessary that the input be Lyndon words, or that their relative order be known at the beginning. Furthermore, our algorithmic strategy, when applied to a single string, provides the first linear-time algorithm for computing the of the string that uses neither an end-of-string symbol nor Lyndon rotations.
We then combine our new construction with a variation of a preprocessing technique called prefix free parsing (). was introduced by Boucher et al. [DBLP:journals/almob/BoucherGKLMM19] for building the (run length encoded) of large and highly repetitive input text. Since its original introduction, it has been extended to construct the -index [recomb19], been applied as a preprocessing step for building grammars [BigRePair], and used as a data structure itself [boucher2020pfp]. Briefly, is a one-pass algorithm that divides the input into overlapping variable length phrases with delimiting prefixes and suffixes; which in effect, leads to the construction of what is referred to as the dictionary and parse of the input. It follows that the can be constructed in the space that is proportional to the size of the dictionary and parse, which is expected to be significantly smaller than linear for repetitive text.
In our approach, prefix-free parsing is applied to obtain a parse that is a multiset of cyclic strings (cyclic prefix-free parse) on which our construction is applied. We implement our approach (called pfpebwt), measure the time and memory required to build the for sets of increasing size of chromosome 19, Salmonella, and SARS-CoV2 genomes, and compare this to that required by gsufsort, ropebwt2, and egap. We show that pfpebwt is consistently faster and uses less memory than gsufsort and egap on reasonably large input ( 4 copies of chromosome 19, 50 Salmonella genomes, and 25,000 SARS-CoV2 genomes). Although ropebwt2 uses less memory than pfpebwt on large input, pfpebwt is 7x more efficient in terms of wall clock time, and 2.8x in terms of CPU time. Moreover, pfpebwt is capable of reporting samples in addition to the with a negligible increase in time and memory [recomb19], whereas ropebwt2 does not have that ability. If we compare pfpebwt only with methods that are able to report samples in addition to the (e.g., egap and gsufsort), we obtain a 57.1x improvement in peak memory.
A string is a sequence of characters drawn from an ordered alphabet of size . We denote by the length of , and by the empty string, the only string of length . Given two integers , we denote by the string , if , while if . We refer to as a substring (or factor) of , to as the -th prefix of , and to as the -th suffix of . A substring of is called proper if . Given two strings and , we denote by the length of the longest common prefix of and , i.e., .
Given a string and an integer , we denote by the -length string (-fold concatenation of ), and by the infinite string obtained by concatenating an infinite number of copies of . A string is called primitive if implies and . For any string , there exists a unique primitive word and a unique integer such that . We refer to as and to as . Thus,
We denote by the lexicographic order: for two strings and , if is a proper prefix of , or there exists an index such that and . Given a string , the suffix array [mm1993], denoted by , is the permutation of such that is the -th lexicographically smallest suffix of .
We denote by the -order [GeRe93, MantaciRRS07], defined as follows: for two strings and , if and , or (this implies ). One can verify that the -order relation is different from the lexicographic one. For instance, but .
The string is a conjugate of the string if , for some (also called the -th rotation of ). The conjugate is also denoted . It is easy to see that is primitive if and only if it has distinct conjugates. A Lyndon word is a primitive string which is lexicographically smaller than all of its conjugates. For a string , the conjugate array111Our conjugate array is called circular suffix array and denoted in [HonKLST12, BannaiKKP21], and BW-array in [KucherovTV13, EnCombWords], but in both cases defined for primitive strings only. of is the permutation of such that if is the -th conjugate of with respect to the lexicographic order, with ties broken according to string order, i.e. if and for some , then either , or and . Note that if is a Lyndon word, then for all [GIA07].
Given a string , a circular or cyclic substring of if it is a factor of of length at most , or equivalently, if it is the prefix of some conjugate of . For instance, ATA is a cyclic substring of AGCAT. It is sometimes also convenient to regard a given string itself as circular (or cyclic); in this case we set and .
The Burrows-Wheeler Transform [BW94] of , denoted , is a reversible transformation extensively used in data compression. Given a string , is a permutation of the letters of which equals the last column of the matrix of the lexicographically sorted conjugates of . The mapping is reversible, up to rotation. It can be made uniquely reversible by adding to and index indicating the rank of in the lexicographic order of all of its conjugates. Given and an index , the original string can be computed in linear time [BW94]. The itself can be computed from the conjugate array, since for all , , where is considered to be cyclic.
It should be noted that in many applications, it is assumed that an end-of-string-character (usually denoted ), which is not element of , is appended to the string; this character is assumed to be smaller than all characters from . Since has exactly one occurrence of , is now uniquely reversible, without the need for the additional index , since is the unique conjugate ending in . Moreover, adding a final makes the string primitive, and is a Lyndon word. Therefore, computing the conjugate array becomes equivalent to computing the suffix array, since . Thus, applying one of the linear-time suffix-array computation algorithms [gonzalo-book] leads to linear-time computation of the .
When no -character is appended to the string, the situation is slightly more complex. For primitive strings , first the Lyndon conjugate of has to be computed (in linear time, [Shiloach81]) and then a linear-time suffix array algorithm can be employed [GIA07]. For strings which are not primitive, one can take advantage of the following well-known property of the : let and , then (Prop. 2 in [MantaciRS03]). Thus, it suffices to compute the of . The root of can be found by computing the border array of : is a power if and only if is an integer, which is then also the length of . The border array can be computed, for example, by the preprocessing phase of the KMP-algorithm for pattern matching [KMP77], in linear time in the length of .
2.2 Generalized Conjugate Array and Extended Burrows-Wheeler Transform
Given a multiset of strings , the generalized conjugate array of , denoted by or just by , contains the list of the conjugates of all strings in , sorted according to the -order relation. More formally, if is the -th string in the -sorted list of the conjugates of all strings of , with ties broken first w.r.t. the index of the string (in case of identical strings), and then w.r.t. the index in the string itself.
The extended Burrows-Wheeler Transform () is an extension of the to a multiset of strings [MantaciRRS07]. It is a bijective transformation that, given a multiset of strings , produces a permutation of the characters on the strings in the multiset . Formally, can be computed by sorting all the conjugates of the strings in the multiset according to the -order, and the output is the string obtained by concatenating the last character of each conjugate in the sorted list, together with the set of indices representing the positions of the original strings of in the list. Similarly to the , the is thus uniquely reversible. The can be computed from the generalized conjugate array of in linear time, since if , where again, the strings in are considered to be cyclic. It is easy to see that when consists of only one string, i.e. , then .
Let . Then is as follows, where we give the pair vertically, i.e. the first row contains the position in the string, and the second row the index of the string:
From the we can compute , with index set . Note that e.g. the conjugate comes before , since , because holds. The full list of conjugates is in Appendix A.
Note that if end-of-string symbols are appended to the string of the collection the output of could be quite different. For instance, if , .
Note that while in the original definition of [MantaciRRS07], the multiset was assumed to contain only primitive strings, our definition is more general and allows also for non-primitive strings. For example, , with index set , while , with index set . Also the linear-time algorithm for recovering the original multiset can be straightforwardly extended.
The following lemma shows how to construct the generalized conjugate array of a multiset of strings (not necessarily primitive), once we know the generalized conjugate array of the multiset of the roots of the strings in . It follows straightforwardly from the fact that equal conjugates will end up consecutively in the .
Let be a multiset of strings and let the multiset of the roots of the strings in , i.e. , where , with for . Let , where . The generalized conjugate array is then given by
From now on we will assume that the multiset consists of primitive strings.
3 A simpler algorithm for computing the and
In this section, we describe our algorithm to compute the of a multiset of strings . We will assume that all strings in are primitive, since we can use Lemma 2.2 to compute the of otherwise. Our algorithm is an adaptation of the well-known SAIS algorithm of Nong et al. [NongZC2011], which computes the suffix array of a single string ending with an end-of-string character . Our adaptation is similar to that of Bannai et al. [BannaiKKP21] for computing the , which can also be used for computing the . Even though our algorithm does not improve the latter asymptotically (both are linear time), it is significantly simpler, since it does not require first computing and sorting the Lyndon rotations of the input strings.
In the following, we assume some familiarity with the SAIS algorithm, focusing on the differences between our algorithm and the original SAIS. Detailed explanations of SAIS can be found in the original paper [NongZC2011], or in the books [ohlebusch-book, louza-book].
The main differences between our algorithm and the original SAIS algorithm are: (1) we are comparing conjugates rather than suffixes, (2) we have a multiset of strings rather than just one string, (3) the comparison is done w.r.t. the omega-order rather than the lexicographic order, and (4) the strings are not terminated by an end-of-string symbol.
We need the following definition, which is the cyclic version of the definition in [NongZC2011] (where stands for smaller, for larger, and LMS for leftmost-S):
[Cyclic types, LMS-substrings] Let be a primitive string of length at least , and . Position of is called (cyclic) S-type if , and (cyclic) L-type if . An S-type position is called (cyclic) LMS if is L-type (where we view as a cyclic string). An LMS-substring is a cyclic substring of such that both and are LMS-positions, but there is no LMS-position between and . Given a conjugate , its LMS-prefix is the cyclic substring from to the first LMS-position strictly greater than (viewed cyclically).
Since is primitive, no two conjugates are equal, and in particular, no two adjacent conjugates are equal. Therefore, the type of every position of is defined.
Continuing Example 2.2,
where we mark LMS-positions with a . The LMS-substrings are ACA, AACGGTA, CGGCA, and ACGTC. The LMS-prefix of the conjugate is CGGTA.
[Cyclic type properties] Let be primitive string of length at least . Let be the smallest and the largest character of the alphabet. Then the following hold, where is viewed cyclically:
if , then is of type , and if , then is of type ,
if , then the type of is the same as the type of ,
is of type iff , where ,
if , then is of type , and if , then is of type .
1. follows from the fact that for all , if then for all , ; 2. follows by induction from the fact that for all , if , then ; 3. and 4. follow from 2. by induction. ∎
[Linear-time cyclic type assignment] Let be a primitive string of length at least . Then all positions can be assigned a type in altogether at most steps.
Once the type of one position is known, then the assignment can be done in one cyclic pass over from right to left, by Lemma 3. Therefore, it suffices to find the type of one single position. Any position of character or of character will do; alternatively, any position such that , again by Lemma 3. Since is primitive and has length at least , the latter must exist and can be found in at most one pass over . ∎
Let be the total length of the strings in . The algorithm constructs an initially empty array of size , which, at termination, will contain the of . The algorithm also returns the set containing the set of indices in representing the positions of the strings of . The overall procedure consists of the following steps:
remove strings of length from (these will be added back at the end)
assign cyclic types to all positions of strings from
use procedure Induced Sorting to sort cyclic LMS-substrings
assign names to cyclic LMS-substrings; if all distinct, go to Step 6
recurse on new string multiset , returning array , map back to
use procedure Induced Sorting to sort all positions in , add length-1 strings in their respective positions, return
At the heart of the algorithm is the procedure Induced Sorting of [NongZC2011] (Algorithms 3.3 and 3.4), which is used once to sort the LMS-substrings (Step 3), and once to induce the order of all conjugates from the correct order of the LMS-positions (Step 6), as in the original SAIS. Before sketching this procedure, we need to define the order according to which the LMS-substrings are sorted in Step 2. Note that our definition of LMS-order is an extension of the LMS-order defined in [NongZC2011], to LMS-prefixes. It can be proved that these definitions coincide for LMS-substrings.
[LMS-order] Given two strings and , let resp. be their LMS-prefixes. We define if either is a proper prefix of , or neither is a proper prefix of the other and .
The procedure Induced Sorting for the conjugates of the multiset is analogous to the original one, except that strings are viewed cyclically. First, the array is subdivided into so-called buckets, one for each character. For , let denote the total number of occurrences of the character in the strings in . Then the buckets are , i.e., the -th bucket will contain all conjugates starting with character . The procedure Induced Sorting first inserts all LMS-positions at the end of their respective buckets, then induces the L-type positions in a left-to-right scan of , and finally, induces the S-type positions in a right-to-left scan of , possibly overwriting previously inserted positions. We need two pointers for each bucket , and , pointing to the current first resp. last free position of the bucket.
Procedure Induced Sorting [NongZC2011]
insert all LMS-positions at the end of their respective buckets; initialize head(b), tail(b) to the first resp. last position of the bucket, for all buckets
induce the L-type positions in a left-to-right scan of : for from to , if then ; increment
induce the S-type positions in a right-to-left scan of : for from to , if then ; decrement
At the end of this procedure, the LMS-substrings are listed in correct relative LMS-order (see Lemma 3.1), and they can be named according to their rank. For the recursive step, we define, for , a new string , where each LMS-substring of is replaced by its rank. The algorithm is called recursively on (Step 5).
Finally (Step 6), the array from the recursive step is mapped back into the original array, resulting in the placement of the LMS-substrings in their correct relative order. This is then used to induce the full array . All length-1 strings which were removed in Step 1 can now be inserted between the L- and S-type positions in their bucket (Lemma 3.1). See Figure 1 for a full example.
3.1 Correctness and running time
The following lemma shows that the individual steps of Induced Sorting are applicable for the -order on conjugates of a multiset (part 1), that L-type conjugates (of all strings) come before the S-type conjugates within the same bucket (part 2), and that length-1 strings are placed between S-type and L-type conjugates (part 3). The second property was originally proved for the lexicographic order between suffixes in [KA2005]:
[Induced sorting for multisets] Let .
If , then for all , .
If , is an L-type position, and an S-type position, then .
If , is an L-type position, and an S-type position, then .
Next, we show that after applying procedure Induced Sorting, the conjugates will appear in such that they are correctly sorted w.r.t. to the LMS-order of their LMS-prefixes, while the order in which conjugates with identical LMS-prefixes appear in is determined by the input order of the LMS-positions.
[Extension of Thm. 3.12 of [NongZC2011]] Let , let be the LMS-prefix of , with the last position of ; let be the LMS-prefix of , and the last position of . Let be the position of in array after the procedure Induced Sorting, and that of .
If , then .
If , then if and only if was placed before at the start of the procedure.
Both claims follow from Lemma 3.1, and the fact that from one LMS-position to the previous one, there is exactly one run of L-type positions, preceded by one run of S-type positions. ∎
The next lemma shows that the LMS-order of the LMS-prefixes respects the -order.
Let , let be the LMS-prefix of and the LMS-prefix of . If then .
If neither nor is a proper prefix one of the other, then there exists an index s.t. , and therefore, . Otherwise, is a proper prefix of . Let and . Since both and are LMS-prefixes, with being the last position of but not of , this implies that is of type S, while is of type L. Let be the next character in s.t. , and be the next character in s.t. . By Lemma 3, , , and by definition of all characters inbetween equal . Then for , we have , with being the first position where and differ. Therefore, . ∎
Algorithm SAIS-for- correctly computes the and of a multiset of strings in time , where is the total length of the strings in .
By Lemma 3, Step 2 correctly assigns the types. Step 3 correctly sorts the LMS-substrings by Lemma 3.1. It follows from Lemma 3.1 that the order of the conjugates of the new strings coincides with the relative order of the LMS-conjugates. In Step 6, the LMS-conjugates are placed in in correct relative order from the recursion; by Lemmas 3.1 and 3.1, this results in the correct placement of all conjugates of strings of length , while the positioning of the length-1 strings is given by Lemma 3.1.
For the running time, note that Step 1 takes time at most . The Induced Sorting procedure also runs in linear time . Finally, since no two LMS-positions are consecutive, and we remove strings of length , the problem size in the recursion step is reduced to at most . ∎
3.2 Computing the BWT for one single string
The special case where consists of one single string leads to a new algorithm for computing the , since for a singleton set, the coincides with the . To the best of our knowledge, this is the first linear-time algorithm for computing the of a string without an end-of-string character that uses neither Lyndon rotations nor end-of-string characters.
We demonstrate the algorithm on a well-known example, . We get the following types, from left to right: , and all three S-type positions are LMS. We insert into the array ; after the left-to-right pass, indices are in the order , and after the right-to-left pass, in the order . The LMS-substring aba (pos. 6) gets the name , and the LMS-substring ana (pos. 2,4) gets the name . In the recursive step, the new string , with types and only one LMS-position , the gets induced in just one pass: . This maps back to the original string: , and one more pass over the array results in and the nnbaaa. See Figure 2.
4 and prefix-free parsing
In this section, we show how to extend the prefix-free parsing to build the . We define the cyclic prefix-free parse for a multiset of strings (with , ) as the multiset of parses with dictionary , where we consider as circular, and is the parse of . We denote by the length of the parse .
Next, given a positive integer , let be a set of strings of length called trigger strings. We assume that each string has length at least and at least one cyclic factor in .
We divide each string into overlapping phrases as follows: a phrase is a circular factor of of length that starts and ends with a trigger string and has no internal occurrences of a trigger string. The set of phrases obtained from strings in is the dictionary . The parse can be computed from the string by replacing each occurrence of a phrase in with its lexicographic rank in .
Let and let . The dictionary of the multiset of parses of is and , where means that the parsing of is given by the second and fifth phrases of the dictionary. Note that the string has a trigger string AC that spans the first position of .
We denote by the set of suffixes of having length greater than . The first important property of the dictionary is that the set prefix-free, i.e., no string in is prefix of another string of . This follows directly from [DBLP:journals/almob/BoucherGKLMM19].
Continuing Example 4, we have that
The computation of from the prefix-free parse consists of three steps: computing the cyclic prefix-free parse of (denoted as ), computing the of by using the algorithm described in Section 3; and lastly, computing the of from the of using the lexicographically sorted dictionary and its prefix-free suffix set . We now describe the last step as follows. We define as the function that uniquely maps each character of to the pair , where with , , and corresponds to the -th character of the -th phrase of . We call and the position and the offset of , respectively. Furthermore, we define as the function that uniquely associates to each conjugate the element such that is the -th suffix of the -th element of , where . By extension, and are also called the position and the offset of the suffix .
In Example 4, since is the second character (offset 2) of the phrase , which is the first phrase (position 1) of . Moreover, since is the suffix of , which is prefix of .
Given two strings , if it follows that .
It follows from the definition of that and are prefixes of and , respectively. ∎
Given two strings . Let and be the -th and -th conjugates of and , respectively, and let and . Then if and only if either , or , i.e., precedes in .
By definition of , and , where