lyndon-array
Computing the Lyndon Array in linear time [JDA 2018, arXiv'19]
view repo
In this paper we present an algorithm to compute the Lyndon array of a string T of length n as a byproduct of the inversion of the Burrows-Wheeler transform of T. Our algorithm runs in linear time using only a stack in addition to the data structures used for Burrows-Wheeler inversion. We compare our algorithm with two other linear-time algorithms for Lyndon array construction and show that computing the Burrows-Wheeler transform and then constructing the Lyndon array is competitive compared to the known approaches. We also propose a new balanced parenthesis representation for the Lyndon array that uses 2n+o(n) bits of space and supports constant time access. This representation can be built in linear time using O(n) words of space, or in O(n n/ n) time using asymptotically the same space as T.
READ FULL TEXT VIEW PDF
We present a simple algorithm for computing the document array given the...
read it
The Burrows-Wheeler transform (BWT) is a permutation whose applications ...
read it
We assume the permutation π is given by an n-element array in which the
...
read it
Simon's congruence ∼_k is defined as follows: two words are
∼_k-equivale...
read it
We present the first worst-case linear time algorithm that directly comp...
read it
The inversion distance, that is the distance between two unichromosomal
...
read it
We propose a B tree representation storing n keys, each of k bits, in
ei...
read it
Computing the Lyndon Array in linear time [JDA 2018, arXiv'19]
Lyndon words were introduced to find bases of the free Lie algebra CFL58 , and have been extensively applied in algebra and combinatorics. The term “Lyndon array” was apparently introduced in Franek2016 , essentially equivalent to the “Lyndon tree” of Hohlweg & Reutenauer Hohlweg2003 . Interest in Lyndon arrays has been sparked by the surprising characterization of runs through Lyndon words by Bannai et al. Bannai2014 , who were thus able to resolve the long-standing conjecture that the number of runs (maximal periodicities) in any string of length is less than .
The Burrows-Wheeler transform (BWT) Burrows1994 plays a fundamental role in data compression and in text indexing Ohlebusch2013 ; Makinen2015 ; Navarro2016 . Embedded into a wavelet tree, the BWT is a self-index with a remarkable time/space tradeoff Ferragina2000 ; Grossi2003 .
In this article we introduce a linear time algorithm to construct the Lyndon array of a string of length , from an ordered alphabet of size , as a byproduct of Burrows-Wheeler inversion, thus establishing an apparently unremarked connection between BWT and Lyndon array construction. We compare our algorithm to others in the literature that also compute the Lyndon array in worst-case linear time. We find that the new algorithm performs well in practice with a small memory footprint.
Inspired by the inner working of our new algorithm, we propose a representation of the Lyndon array consisting of a balanced parenthesis string of length . Such representation leads to a data structure of size bits, supporting the computation of each entry of the Lyndon array in constant time. We also show that such representation is theoretically appealing since it can be computed from in time using words of space, or in time using bits of space.
This article is organized as follows. Section 2 introduces concepts, notation and related work. Section 3 presents our algorithm and Section 4 shows experimental results. Section 5 describes our balanced parenthesis representation of the Lyndon array and two construction algorithms with different time/space tradeoffs. Section 6 summarizes our conclusions.
Let be a string of length over an ordered alphabet of size . The -th symbol of is denoted by and the substring is denoted by , for . We assume that always ends with a special symbol , that doesn’t appear elsewhere in and precedes every symbol in . A prefix of is a substring of the form and a suffix is a substring of the form , which will be denoted by . We use the symbol for the lexicographic order relation between strings.
The suffix array () Manber1993 ; Gonnet1992 of a string is an array of integers in the range that gives the lexicographic order of all suffixes of , such that . We denote the inverse of as , . The suffix array can be constructed in linear time using additional space Nong2013 .
The next smaller value array () defined for an array of integers stores in the position of the next value in that is smaller than . If there is no value in smaller than then . Formally, . may be constructed in linear time using additional memory for an auxiliary stack Goto2013 .
A string of length is called a Lyndon word if it is lexicographically strictly smaller than its rotations CFL58 . Alternatively, if is a Lyndon word and is any factorization of into non-empty strings, then . The Lyndon array of a string , denoted or simply when is understood, has length and stores at each position the length of the longest Lyndon word starting at .
Following Hohlweg2003 , Franek et al. Franek2016 have recently shown that the Lyndon array can be easily computed in linear time by applying the computation to the inverse suffix array (), such that , for . Also, in a recent talk surveying Lyndon array construction, Franek and Smyth Smyth2017 quote an unpublished observation by Cristoph Diegelmann Diegelmann2016 that, in its first phase, the linear-time suffix array construction algorithm by Baier Baier2016 computes a permuted version of the Lyndon array. This permuted version, called , stores in the length of the longest Lyndon word starting at position of . Thus, including the BWT-based algorithm proposed here, there are apparently three algorithms that compute the Lyndon array in worst-case time. In addition, in (Bannai2014, , Lemma 23) a linear-time algorithm is suggested that uses lca/rmq techniques to compute the Lyndon tree. The same paper also gives an algorithm for Lyndon tree calculation described as being “in essence” the same as .
The Burrows-Wheeler transform (BWT) Burrows1994 ; Adjeroh2008 is a reversible transformation that produces a permutation of the original string such that equal symbols of tend to be clustered in . The BWT can be obtained by adding each circular shift of as a row of a conceptual matrix , lexicographically sorting the rows of producing , and concatenating the symbols in the last column of to form . Alternatively, the BWT can be obtained from the suffix array through the application of the relation if or otherwise.
Burrows-Wheeler inversion, the processing of to obtain , is based on the LF-mapping (last-to-first mapping). Let and be the first and the last columns of the conceptual matrix mentioned above. We have such that if is the occurrence of a symbol in , then corresponds to the position of the occurrence of in .
sorted | sorted | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
circular shifts | circular shifts | suffixes | ||||||||
1 | banana$ | $banana | 7 | 5 | 2 | 2 | 1 | 1 | a | $ |
2 | anana$b | a$banan | 6 | 4 | 4 | 6 | 2 | 1 | n | a$ |
3 | nana$ba | ana$ban | 4 | 7 | 4 | 7 | 1 | 2 | n | ana$ |
4 | ana$ban | anana$b | 2 | 3 | 6 | 5 | 2 | 2 | b | anana$ |
5 | na$bana | banana$ | 1 | 6 | 6 | 1 | 1 | 1 | $ | banana$ |
6 | a$banan | na$bana | 5 | 2 | 7 | 3 | 1 | 1 | a | na$ |
7 | $banana | nana$ba | 3 | 1 | 8 | 4 | 1 | 1 | a | nana$ |
The -mapping can be pre-computed in an array of integers in the range . Given an array of integers of length that stores in the number of symbols in strictly smaller than , can be computed in linear time using bits of additional space (Ohlebusch2013, , Alg. 7.2). Alternatively, can be computed on-the-fly in time querying a wavelet tree Grossi2003 constructed for . Given the and the array, the Burrows-Wheeler inversion can be performed in linear time (Ohlebusch2013, , Alg. 7.3).
Our starting point is the following characterization of the Lyndon array.
Let be the smallest position in after position such that suffix is lexicographically smaller than suffix ; that is, . Then the length of the longest Lyndon word starting at position is . If then .
For let be defined as above and . If then , , such that and . Since it follows that , hence and is a Lyndon word. In addition, , hence and is not a Lyndon word.
The above lemma is at the basis of the algorithm by Franek et al. Franek2016 computing as . Since is the lexicographic rank of , is precisely the value used in the lemma. In this section, we use the relationship between -mapping and to design alternative algorithms for Lyndon array construction. Since , and it follows that where denotes the map iterated times.
Given the and the mapping our algorithm computes and the Lyndon array from right to left. Briefly, our algorithm finds, during the Burrows-Wheeler inversion, for each position , the first suffix that is smaller than and using Lemma 1 it computes .
Starting with and an index in the BWT, the algorithm decodes the BWT according to , keeping the visited positions whose indexes are smaller than pos in a stack. The visited positions indicate the suffix ordering: a suffix visited at position is lexicographically smaller than all suffixes visited at positions . The stack stores pairs of integers corresponding to each visited position in iteration . The stack is initialized by pushing . The complete pseudo-code appears in Algorithm 1. Note that lines 1, 2, 7, 8, 15 and 16 are exactly the lines from the Burrows-Wheeler inversion presented in (Ohlebusch2013, , Alg. 7.3).
An element in the stack represents the suffix visited in iteration . At iteration the algorithm pops suffixes that are lexicographically larger than the current suffix . Consequently, at the end of the while loop, the top element represents the next suffix (in text order) that is smaller than and is computed at line 1.
Figure 2 shows a running example of our algorithm to compute the Lyndon array for string during its Burrows-Wheeler inversion. Before step is set to (lines 1–6) is decoded at position and the stack is initialized with the end-of-stack marker . The first loop iteration (lines 7–15) decodes a and finds out that the stack is empty. Then , the pair is pushed on the stack and .
At the second iteration n is decoded and the algorithm checks if the suffix at the top of the stack (a$) is larger then the current suffix (na$). The algorithm does not pop the stack because there is no suffix lexicographically larger than the current one. Then . The pair is pushed on the stack.
At the third iteration a is decoded. The top element, representing suffix na$, is popped since it is larger then the current suffix ana$. Then and the pair is pushed. The next iterations proceed in a similar fashion.
Algorithm 1 computes the text and its Lyndon array in time using words of space.
Since each instruction takes constant time, the running time is proportional to the number of stack operations, which is since each text position is added to the stack exactly once. The space usage is dominated by the arrays , , and by the stack that use words in total.
In this section we compare our algorithm with the linear time algorithms of Hohlweg and Reutenauer Hohlweg2003 ; Franek2016 (NSV-Lyndon) and Baier Baier2016 (Baier-Lyndon). All algorithms were adapted to compute only the Lyndon array for an input string . In order to compare our solution with the others, we compute the suffix array for the input string , then we obtain and the array, and finally we construct the Lyndon array during Burrows-Wheeler inversion (Algorithm 1). This procedure will be called BWT-Lyndon. We used algorithm SACA-K Nong2013 to construct in time using working space. was computed in the same space as (overwriting the values) both in BWT-Lyndon and in NSV-Lyndon.
We implemented all algorithms in ANSI C. The source code is publicly available at https://github.com/felipelouza/lyndon-array. The experiments were executed on a 64-bit Debian GNU/Linux 8 (kernel 3.16.0-4) system with an Intel Xeon Processor E5-2630 v3 20M Cache -GHz, GB of internal memory and a TB SATA storage. The sources were compiled by GNU GCC version 4.9.2, with the optimizing option -O3 for all algorithms. The time was measured using the clock() function of C standard libraries and the peak memory usage was measured using malloc_count library^{1}^{1}1http://panthema.net/2013/malloc_count.
We used string datasets from Pizza & Chili^{2}^{2}2https://pizzachili.dcc.uchile.cl/ as shown in the first three columns of Tables 1 and 2. The datasets einstein-de, kernel, fib41 and cere are highly repetitive texts The dataset english.1gb is the first 1GB of the original english dataset. In our experiments, each integer array of length is stored using bytes, and each string of length is stored using bytes.
Table 1 shows the running time (in seconds), the peak space memory (in bytes per input symbol) and the working space (in GB) of each algorithm.
The fastest algorithm was Baier-Lyndon, which overall spent about two-thirds of the time required by BWT-Lyndon, though the timings were much closer for larger alphabets. NSV-Lyndon was slightly faster than BWT-Lyndon, requiring about of the time spent by BWT-Lyndon on average.
The smallest peak space was obtained by BWT-Lyndon and NSV-Lyndon, which both use slightly more than bytes. BWT-Lyndon uses bytes to store the string and the integer arrays and , plus the space used by the stack, which occupied about KB in all experiments, except for dataset cere, in which the stack used KB. The strings and are computed and decoded in the same space. NSV-Lyndon also requires bytes to store the string and the integer arrays and , that plus the space of the stack used to compute Goto2013 , which used exactly the same amount of memory used by the stack of BWT-Lyndon. The array is computed in the same space as . Baier-Lyndon uses bytes to store , and three auxiliary integer arrays of size .
The working space is the peak space not counting the space used by the input string and the output array ( bytes). The working space of BWT-Lyndon and NSV-Lyndon were by far the smallest in all experiments. Both algorithms use about of the working space used by Baier-Lyndon. For dataset proteins, BWT-Lyndon and NSV-Lyndon use GB less memory than Baier-Lyndon.
running time | peak space | working space | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
[secs] | [bytes/n] | [GB] | |||||||||
BWT-Lyndon |
NSV-Lyndon |
Baier-Lyndon |
BWT-Lyndon |
NSV-Lyndon |
Baier-Lyndon |
BWT-Lyndon |
NSV-Lyndon |
Baier-Lyndon |
|||
sources | 230 | 201 | 68 | 55 | 57 | 9 | 9 | 17 | 0.79 | 0.79 | 2.36 |
dblp | 97 | 282 | 104 | 87 | 90 | 9 | 9 | 17 | 1.10 | 1.10 | 3.31 |
dna | 16 | 385 | 198 | 160 | 113 | 9 | 9 | 17 | 1.50 | 1.50 | 4.51 |
english.1gb | 239 | 1,047 | 614 | 504 | 427 | 9 | 9 | 17 | 4.09 | 4.09 | 12.27 |
proteins | 27 | 1,129 | 631 | 524 | 477 | 9 | 9 | 17 | 4.41 | 4.41 | 13.23 |
einstein.de | 117 | 88 | 36 | 32 | 25 | 9 | 9 | 17 | 0.35 | 0.35 | 1.04 |
kernel | 160 | 246 | 100 | 75 | 73 | 9 | 9 | 17 | 0.96 | 0.96 | 2.88 |
fib41 | 2 | 256 | 120 | 93 | 18 | 9 | 9 | 17 | 1.00 | 1.00 | 2.99 |
cere | 5 | 440 | 215 | 169 | 114 | 9 | 9 | 17 | 1.72 | 1.72 | 5.16 |
Table 2 shows the running time (in seconds) for each step of algorithms BWT-Lyndon and NSV-Lyndon. Step 1, constructing , is the most time-consuming part of both algorithms, taking about of the total time. Incidentally, this means that if the input consists of the BWT rather than , our algorithm would clearly be the fastest. In Step 2, computing is faster than computing since is more cache-efficient than . Similarly in Step 3, computing is more efficient than computing Goto2013 . However, Step 4 of BWT-Lyndon, which computes during the Burrows-Wheeler inversion, is sufficiently slower (by a factor of ) than computing from and , so that the overall time of BWT-Lyndon is larger than NSV-Lyndon, as shown in Table 1.
Step 1 | Step 2 | Step 3 | Step 4 | ||||||
BWT-Lyndon |
NSV-Lyndon |
BWT-Lyndon |
NSV-Lyndon |
BWT-Lyndon |
NSV-Lyndon |
||||
sources | 230 | 201 | 50.27 | 2.28 | 3.49 | 0.97 | 1.65 | 14.12 | 0.13 |
dblp | 97 | 282 | 79.83 | 4.02 | 5.51 | 1.46 | 1.61 | 18.80 | 0.18 |
dna | 16 | 385 | 145.02 | 8.07 | 9.99 | 1.48 | 4.04 | 43.39 | 0.25 |
english.1gb | 239 | 1,047 | 459.29 | 21.86 | 34.35 | 4.70 | 10.31 | 127.65 | 0.72 |
proteins | 27 | 1,129 | 478.13 | 21.96 | 34.99 | 4.71 | 10.47 | 125.73 | 0.75 |
einstein-de | 117 | 88 | 28.79 | 1.27 | 1.91 | 0.48 | 0.85 | 5.10 | 0.06 |
kernel | 160 | 246 | 68.19 | 3.52 | 4.92 | 1.29 | 2.18 | 27.21 | 0.18 |
fib41 | 2 | 256 | 85.94 | 5.94 | 6.55 | 1.38 | 0.60 | 26.48 | 0.18 |
cere | 5 | 440 | 153.89 | 9.30 | 11.20 | 2.24 | 4.38 | 49.38 | 0.42 |
In this section we introduce a new representation for the Lyndon array of consisting of a balanced parenthesis string of length . The existence of this representation is not surprising in view of Observation 3 in Franek2016 stating that Lyndon words do not overlap (see also the bracketing algorithm in SawadaR03 ). Algorithm 2 gives an operational strategy for building such a representation, and the next lemma shows how to use it to retrieve individual values of . In the following, given a balanced parenthesis string , we write (resp. ) to denote the position in of the -th open parenthesis (resp. the position in of the closed parenthesis closing the -th open parenthesis).
First note that at the -th iteration we append an open parenthesis to and add the value to the stack. The value is removed from the stack as soon as a smaller element is encountered. Since the last value is the smallest element, at the end of the for loop the stack only contains the value 1, which is removed at the exit of the loop. Observing that we append a closed parenthesis to every time a value is removed from the stack, at the end of the algorithm indeed contains open and closed parentheses. Because of the use of the stack, the closing parenthesis follow a first-in last-out logic so the parenthesis are balanced.
By construction, for , the closed parenthesis corresponding to is written immediately before the open parenthesis corresponding to . Hence, between the open and closed parenthesis corresponding to there is a pair of open/closed parenthesis for each entry , . Hence, using the notation (1) and Lemma 1 it is
which implies (2). Finally, for we have and , so and the lemma follows.
Using the range min-max tree from NStalg12 we can represent in bits of space and support , and in time. We have therefore established the following result.
It is possible to represent the Lyndon array for a text in bits such that we can retrive every value in time.
Since the new representation takes bits, it is desirable to build it without storing explicitly , which takes words. In Section 3 we used the map to generate the values right-to-left (from to ). Since in Algorithm 2 we need to generate the values left-to-right, we use the inverse permutation of the map, known in the literature as the map. Formally, for is defined as
(3) |
Assume we have a data structure supporting the operation on the BWT in time. Then, we can generate the values in time using additional bits of space.
By (3) it follows that and, for , . To prove the lemma we need to show how to compute each in time. By definition, is the position in of the character prefixing row in the conceptual matrix defining the BWT. Let denote the binary array such that iff row is the first row of the BWT matrix prefixed by some character . Then, the character prefixing row is given by and
The thesis follows observing that using talg/RamanRS07 we can represent in bits supporting constant time queries.
Using Algorithm 2 we can compute from the BWT in time using words of space.
We represent using one of the many available data structures taking bits and supporting constant time queries (see BCGNNalgor13 and references therein). By Lemma 4 we can generate the values in overall time using auxiliary space. Since every other operations takes constant time, the running time is proportional to the number of stack operations which is since each is inserted only once in the stack.
Note that the space usage of Algorithm 2 is dominated by the stack, which uses words in the worst case. Since at any given time the stack contains an increasing subsequence of , if we can assume that is a random permutation the average stack size is words (see AD99 ).
We now present an alternative representation for the stack that only uses bits in the worst case and supports pop and push operations in time. We represent the stack with a binary array such that iff the value is currently in the stack. Since the values in the stack are always in increasing order, is sufficient to represent the current status of the stack. In Algorithm 2 when a new element is added to the stack we must first delete the elements larger than . This can be accomplished using rank/select operations. If the elements to be deleted are those returned by for . Summing up, the binary array must support the rank/select operations in addition to changing the value of a single bit. To this end we use the dynamic array representation described in spire/HeM10 which takes bits and support the above operations in (optimal) time. We have therefore established, this new time/space tradeoff for Lyndon array construction.
Using Algorithm 2 we can compute from the BWT in time using bits of space.
Finally, we point out that if the input consists of the text the asymptotic costs do not change, since we can build the BWT from in time and bits of space soda/MunroNN17 .
Given we can compute in time using words of space, or in time using bits of space.
In this paper we have described a previously unknown connection between the Burrows-Wheeler transform and the Lyndon array, and proposed a corresponding algorithm to construct the latter during Burrows-Wheeler inversion. The algorithm is guaranteed linear-time and simple, resulting in the good practical performance shown by the experiments.
Although not faster than other linear algorithms, our solution was one of the most space-efficient. In addition, if the input is stored in a BWT-based self index, our algorithm would have a clear advantage in both working space and running time, since it is the only one that uses the LF-map rather than the suffix array.
We also introduced a new balanced parenthesis representation for the Lyndon array using bits supporting time access. We have shown how to build this representation in linear time using words of space, and in time using asymptotically the same space as .
Over all the known algorithms surveyed in Smyth2017
, probably the fastest for real world datasets and the most space-efficient is the folklore
MaxLyn algorithm described in Franek2016 , which makes no use of suffix arrays and requires only constant additional space, but which however requires time in the worst-case. We tested MaxLyn on a string consisting of symbols ‘a’. While the linear-time algorithms run in no more than seconds, MaxLyn takes about hours to compute the Lyndon array. Thus, the challenge that remains is to find a fast and “lightweight” worst-case linear-time algorithm for computing the Lyndon array that avoids the expense of suffix array construction.The authors thank Uwe Baier for kindly providing the source code of algorithm Baier-Lyndon, and Prof. Nalvo Almeida for granting access to the machine used for the experiments.
F.A.L. was supported by the grant 2017/09105-0 from the São Paulo Research Foundation (FAPESP). The work of W.F.S. was supported in part by a grant from the Natural Sciences & Engineering Research Council of Canada (NSERC). G.M. was supported by the PRIN grant 201534HNXC and INdAM-GNCS Project Efficient algorithms for the analysis of Big Data. G.P.T. acknowledges the support of CNPq.
D. Adjeroh, T. Bell, A. Mukherjee, The burrows-wheeler transform: data compression, suffix arrays, and pattern matching, Springer, Boston, MA, 2008.
Comments
There are no comments yet.