Inducing the Lyndon Array

05/30/2019 ∙ by Felipe A. Louza, et al. ∙ Universidade de São Paulo Università del Piemonte Orientale UniPa University of Campinas 0

In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in O(n) time using σ + O(1) words of working space, where n is the length of the text and σ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use O(n) words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not only space-efficient but also fast in practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

lyndon-array

Computing the Lyndon Array in linear time [JDA 2018, arXiv'19]


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The suffix array is a central data structure for string processing. Induced suffix sorting is a remarkably powerful technique for the construction of the suffix array. Induced sorting was introduced by Itoh and Tanaka [10] and later refined by Ko and Aluru [11] and by Nong et al. [18]. In 2013, Nong [17] proposed a space efficient linear time algorithm based on induced sorting, called SACA-K, which uses only words of working space, where is the alphabet size and the working space is the space used in addition to the input and the output. Since a small working space is a very desirable feature, there have been many algorithms adapting induced suffix sorting to the computation of data structures related to the suffix array, such as the Burrows-Wheeler transform [20], the -array [8], the LCP array [4, 14], and the document array [13].

The Lyndon array of a string is a powerful tool that generalizes the idea of Lyndon factorization. In the Lyndon array () of string over the alphabet , each entry , with , stores the length of the longest Lyndon factor of starting at that position . Bannai et al. [2] used Lyndon arrays to prove the conjecture by Kolpakov and Kucherov [12] that the number of runs (maximal periodicities) in a string of length is smaller than . In [3] the authors have shown that the computation of the Lyndon array of is strictly related to the construction of the Lyndon tree [9] of the string (where the symbol is smaller than any symbol of the alphabet ).

In this paper we address the problem of designing a space economical linear time algorithm for the computation of the Lyndon array. As described in [5, 15], there are several algorithms to compute the Lyndon array. It is noteworthy that the ones that run in linear time (cf. [1, 3, 5, 6, 15]) use the sorting of the suffixes (or a partial sorting of suffixes) of the input string as a preprocessing step. Among the linear time algorithms, the most space economical is the one in [5] which, in addition to the bits for the input string plus words for the Lyndon array and suffix array, use a stack whose size depends on the structure of the input. Such stack is relatively small for non pathological texts, but in the worst case its size can be up to words. Therefore, the overall space in the worst case can be up to bits plus words.

In this paper we propose a variant of the algorithm SACA-K that computes in linear time the Lyndon array as a by-product of suffix array construction. Our algorithm uses overall bits plus words of space. This bound makes our algorithm the one with the best worst case space bound among the linear time algorithms. Note that the words of working space of our algorithm is optimal for strings from alphabets of constant size. Our experiments show that our algorithm is competitive in practice compared to the other linear time solutions to compute the Lyndon array.

2 Background

Let be a string of length over a fixed ordered alphabet of size , where denotes the -th symbol of . We denote as the factor of starting from the -th symbol and ending in the -th symbol. A suffix of is a factor of the form and is also denoted as . In the following we assume that any integer array of length with values in the range takes words ( bits) of space.

Given , the -th rotation of is the string . Note that, a string of length has possible rotations. A string is a repetition if there exists a string and an integer such that , otherwise it is called primitive. If a string is primitive, all of its rotations are different.

A primitive string is called a Lyndon word if it is the lexicographical least among its rotations. For instance, the string is not a Lyndon word, while it is its rotation . A Lyndon factor of a string is a factor of that is a Lyndon word. Note that is a Lyndon factor of .

Definition 1

Given a string , the Lyndon array (LA) of is an array of integers in the range that, at each position , stores the length of the longest Lyndon factor of starting at :

The suffix array ([16] of a string is an array of integers in the range that gives the lexicographic order of all suffixes of , that is . The inverse suffix array () stores the inverse permutation of , such that . The suffix array can be computed in time using words of working space [17].

Usually when dealing with suffix arrays it is convenient to append to the string a special end-marker symbol (called sentinel) that does not occur elsewhere in and is smaller than any other symbol in . Here we assume that . Note that the values , for do not change when the symbol is appended at the position . Also, string is always primitive.

Given an array of integers of size , the next smaller value () array of , denoted , is an array of size such that contains the smallest position such that . Formally

As an example, in Figure 1 we consider the string , and its Suffix Array (), Inverse Suffix Array (), Next Smaller Value array of the (), and Lyndon Array (). We also show all the Lyndon factors starting at each position of .

Figure 1: , , , and all Lyndon factors for

If the of is known, the Lyndon array can be computed in linear time thanks to the following lemma that rephrases a result in [9]:

Lemma 1

The factor is the longest Lyndon factor of starting at iff , for , and . Therefore, .∎

Lemma 1 can be reformulated in terms of the inverse suffix array [5], such that iff , for , and . In other words, . Since given we can compute in linear time using an auxiliary stack [7, 19] of size words, we can then derive , in the same space of , in linear time using the formula:

(1)

Overall, this approach uses bits for plus words for and , and the space for the auxiliary stack.

Alternatively, can be computed in linear time from the Cartesian tree [21] built for  [3]. Recently, Franek et al. [6] observed that can be computed in linear time during the suffix array construction algorithm by Baier [1] using overall bits plus words for and plus words for auxiliary integer arrays. Finally, Louza et al. [15] introduced an algorithm that computes in linear time during the Burrows-Wheeler inversion, using bits for plus words for and an auxiliary integer array, plus a stack with twice the size as the one used to compute (see Section 4).

Summing up, the most economical linear time solution for computing the Lyndon array is the one based on (1) that requires, in addition to and , words of working space plus an auxiliary stack. The stack size is small for non pathological inputs but can use words in the worst case (see also Section 4). Therefore, considering only as output, the working space is words in the worst case.

2.1 Induced Suffix Sorting

The algorithm SACA-K [17] uses a technique called induced suffix sorting to compute in linear time using only words of working space. In this technique each suffix of

is classified according to its lexicographical rank relative to

.

Definition 2

A suffix is S-type if , otherwise is L-type. We define as S-type. A suffix is LMS-type (leftmost S-type) if is S-type and is L-type.

The type of each suffix can be computed with a right-to-left scanning of  [18], or otherwise it can be computed on-the-fly in constant time during Nong’s algorithm [17, Section 3]. By extension, the type of each symbol in can be classified according to the type of the suffix starting with such symbol. In particular is LMS-type if and only if is LMS-type.

Definition 3

An LMS-factor of is a factor that begins with a LMS-type symbol and ends with the following LMS-type symbol.

We remark that LMS-factors do not establish a factorization of since each of them overlaps with the following one by one symbol. By convention, is always an LMS-factor. The LMS-factors of are shown in Figure 2, where the type of each symbol is also reported. The LMS types are the grey entries. Notice that in all suffixes starting with the same symbol can be partitioned into a -bucket. We will keep an integer array where gives either the first (head) or last (tail) available position of the -bucket. Then, whenever we insert a value into the head (or tail) of a -bucket, we increase (or decrease) by one. An important remark is that within each -bucket S-type suffixes are larger than L-type suffixes. Figure 2 shows a running example of algorithm SACA-K for .

Figure 2: Induced suffix sorting steps (SACA-K) for

Given all LMS-type suffixes of , the suffix array can be computed as follows:

Steps: Sort all LMS-type suffixes recursively into , stored in . Scan from right-to-left, and insert the LMS-suffixes into the tail of their corresponding -buckets in . Induce L-type suffixes by scanning left-to-right: for each suffix , if is L-type, insert into the head of its bucket. Induce S-type suffixes by scanning right-to-left: for each suffix , if is S-type, insert into the tail of its bucket.

Step considers the string obtained by concatenating the lexicographic names of all the consecutive LMS-factors (each different string is associated with a symbol that represents its lexicographic rank). Note that is defined over an alphabet of size and that its length is at most . The SACA-K algorithm is applied recursively to sort the suffixes of into , which is stored in the first half of . Nong et al. [18] showed that sorting the suffixes of is equivalent to sort the LMS-type suffixes of . We will omit details of this step, since our algorithm will not modify it.

Step obtains the sorted order of all LMS-type suffixes from scanning it from right-to-left and bucket sorting then into the tail of their corresponding -buckets in . Step induces the order of all L-type suffixes by scanning from left-to-right. Whenever suffix is L-type, is inserted in its final (corrected) position in .

Finally, Step induces the order of all S-type suffixes by scanning from right-to-left. Whenever suffix is S-type, is inserted in its final (correct) position in .

Theoretical costs.

Overall, algorithm SACA-K runs in linear time using only an additional array of size words to store the bucket array [17].

3 Inducing the Lyndon array

In this section we show how to compute the Lyndon array () during Step of algorithm SACA-K described in Section 2.1. Initially, we set all positions , for . In Step , when is scanned from right-to-left, each value , corresponding to , is read in its final (correct) position in . In other words, we read the suffixes in decreasing order from . We now show how to compute, during iteration , the value of .

By Lemma 1, we know that the length of the longest Lyndon factor starting at position , that is , is equal to , where is the next suffix (in text order) that is smaller than . In this case, will be the first suffix in that was still not read in , which means that . Therefore, during Step , whenever we read , we compute by scanning to the right up to the first position , and we set .

The correctness of this procedure follows from the fact that every position in is initialized with zero, and if are no longer equal to zero, their corresponding suffixes has already been read in positions larger than in , and such suffixes are larger (lexicographically) than . Then, the first position we find corresponds to a suffix that is smaller than , which was still not read in . Also, is the next smaller suffix (in text order) because we read from left-to-right.

Figure 3 illustrates iterations , , and of our algorithm for . For example, at iteration , the suffix is read at position , and the corresponding value is computed by scanning up to find the first empty position, which occurs at . Therefore, .

Figure 3: Running example for

At each iteration , the value of is computed in additional steps, that is our algorithm adds time for each iteration of SACA-K.

Therefore, our algorithm runs in time, where . Note that computing does not need extra memory on top of the space for . Thus, the working space is the same as SACA-K, which is words.

Lemma 2

The Lyndon array and the suffix array of a string over an alphabet of size can be computed simultaneously in time using words of working space, where is equal to average value in .∎

In the next sections we show how to modify the above algorithm to reduce both its running time and its working space.

3.1 Reducing the running time to

We now show how to modify the above algorithm to compute each entry in constant time. To this end, we store for each position the next smaller position such that . We define two additional pointer arrays and :

Definition 4

For , . In addition, we define .

Definition 5

For , , such that and . In addition, we define .

The above definitions depends on and therefore and are updated as we compute additional entries. Initially, we set and , for . Then, at each iteration , when we compute with setting:

(2)

we update the pointers arrays as follows:

(3)
(4)

The cost of computing each entry is now constant, since only two additional computations (Equations 3 and 4) are needed. Because of the use of the arrays and the working space of our algorithm is now words.

Theorem 3.1

The Lyndon array and the suffix array of a string over an alphabet of size can be computed simultaneously in time using words of working space.∎

3.2 Getting rid of a pointer array

We now show how to reduce the working space of Section 3.1 by storing only one array, say , keeping information together. In a glace, we store initially into the space of , then we reuse to store the (useful) entries of .

Note that, whenever we write , the value in , that is is no more used by the algorithm. Then, we can reuse to store . Also, we know that if then . Therefore, we can redefine in terms of :

(5)

The running time of our algorithm remains the same since we have added only one extra verification to obtain (Equation 5). Observe that whenever is overwritten the algorithm does not need it anymore. The working space is therefore reduced to words.

Theorem 3.2

The Lyndon array and the suffix array of a string over an alphabet of size can be computed simultaneously in time using words of working space.∎

3.3 Getting rid of both pointer arrays

Finally, we show how to use the space of to store both the auxiliary array and the final values of . First we observe that it is easy to compute when is an L-type suffix.

Lemma 3

iff is an L-type suffix, or .

Proof

If is an L-type suffix, then and . By definition .∎

Notice that at Step 4 during iteration , whenever we read an S-type suffix , with , its succeeding suffix (in text order) has already been read in some position in the interval ( have induced the order of ). Therefore, the -entries corresponding to S-type suffixes are always inserted on the left of a block (possibly of size one) of non-zero entries in .

Moreover, whenever we are computing and we have (stored in ), we know the following entries are no longer zero, and we have to update , corresponding to (Equation 5). In other words, we update information only for right-most entry of each block of non empty entries, which corresponds to a position of an L-type suffix because S-type are always inserted on the left of a block.

Then, at the end of the modified Step 4, if then is an L-type suffix, and we know that . On the other hand, the values with remain equal to at the end of the algorithm. And we can use them to compute (Equation 2).

Thus, after the completion of Step 4, we sequentially scan overwriting its values with as follows:

(6)

The running time of our algorithm is still linear, since we added only a linear scan over at the end of Step 4. On the other hand, the working space is reduced to words, since we need to store only the bucket array .

Theorem 3.3

The Lyndon array and the suffix array of a string of length over an alphabet of size can be computed simultaneously in time using words of working space.∎

Note that the bounds on the working space given in the above theorems assume that the output consists of and . If one is interested in only, then the working space of the algorithm is words which is still smaller that the working space of the other linear time algorithms that we discussed in Section 2.

4 Experiments

and
dataset

NSV-Lyndon [9]

Baier-LA [1, 6]

BWT-Lyndon [15]

Baier-LA+SA [1, 6]

SACA-K+LA-17n

SACA-K+LA-13n

SACA-K+LA-9n

SACA-K [17]

pitches 133 53 0.15 0.20 0.20 0.26 0.26 0.22 0.18 0.13
sources 230 201 0.26 0.28 0.32 0.37 0.46 0.41 0.34 0.24
xml 97 282 0.29 0.31 0.35 0.42 0.52 0.47 0.38 0.27
dna 16 385 0.39 0.28 0.49 0.43 0.69 0.60 0.52 0.36
english.1GB 239 1,047 0.46 0.39 0.56 0.57 0.84 0.74 0.60 0.42
proteins 27 1,129 0.44 0.40 0.53 0.66 0.89 0.69 0.58 0.40
einstein-de 117 88 0.34 0.28 0.38 0.39 0.57 0.54 0.44 0.31
kernel 160 246 0.29 0.29 0.39 0.38 0.53 0.47 0.38 0.26
fib41 2 256 0.34 0.07 0.45 0.18 0.66 0.57 0.46 0.32
cere 5 440 0.27 0.09 0.33 0.17 0.43 0.41 0.35 0.25
bbba 2 100 0.04 0.02 0.05 0.03 0.05 0.04 0.03 0.03
Table 1: Running time (s/input byte).
and
dataset

NSV-Lyndon [9]

Baier-LA [1, 6]

BWT-Lyndon [15]

Baier-LA+SA [1, 6]

SACA-K+LA-17n

SACA-K+LA-13n

SACA-K+LA-9n

SACA-K [17]

pitches 133 53 9 17 9 17 17 13 9 5
sources 230 201 9 17 9 17 17 13 9 5
xml 97 282 9 17 9 17 17 13 9 5
dna 16 385 9 17 9 17 17 13 9 5
english.1GB 239 1,047 9 17 9 17 17 13 9 5
proteins 27 1,129 9 17 9 17 17 13 9 5
einstein-de 117 88 9 17 9 17 17 13 9 5
kernel 160 246 9 17 9 17 17 13 9 5
fib41 2 256 9 17 9 17 17 13 9 5
cere 5 440 9 17 9 17 17 13 9 5
bbba 2 100 13 17 17 17 17 13 9 5
Table 2: Peak space (bytes/input size).

We compared the performance of our algorithm, called SACA-K+LA, with algorithms to compute in linear time by Franek et al. [5, 9] (NSV-Lyndon), Baier [1, 6] (Baier-LA), and Louza et al. [15] (BWT-Lyndon). We also compared a version of Baier’s algorithm that computes and together (Baier-LA+SA). We considered the three linear time alternatives of our algorithm described in Sections 3.13.3. We used for bytes for each computer word so the total space usage of our algorithms was respectively , and bytes. We included the performance of SACA-K [17] to evaluate the overhead added by the computation of in addition to the .

The experiments were conducted on a machine with an Intel Xeon Processor E5-2630 v3 20M Cache 2.40-GHz, 384 GB of internal memory and a 13 TB SATA storage, under a 64 bits Debian GNU/Linux 8 (kernel 3.16.0-4) OS. We implemented our algorithms in ANSI C. The time was measured with clock() function of C standard libraries and the memory was measured using malloc_count library111https://github.com/bingmann/malloc_count. The source-code is publicly available at https://github.com/felipelouza/lyndon-array/.

We used string collections from Pizza & Chili dataset222http://pizzachili.dcc.uchile.cl/texts.html. In particular, the datasets einstein-de, kernel, fib41 and cere are highly repetitive texts333http://pizzachili.dcc.uchile.cl/repcorpus.html, and the english.1G is the first 1GB of the original english dataset. We also created an artificial repetitive dataset, called bbba, consisting of a string with copies of followed by one occurrence of , that is, . This dataset represents a worst-case input for the algorithms that use a stack (NSV-Lyndon and BWT-Lyndon).

Table 1 shows the running time of each algorithm in s/input byte. The results show that our algorithm is competitive in practice. In particular, the version SACA-K+LA-9n was only about times slower than the fastest algorithm (Baier-LA) for non-repetitive datasets, and times slower for repetitive datasets. Also, the performance of SACA-K+LA-9n and Baier-LA+SA were very similar. Finally, the overhead of computing in addition to was small: SACA-K+LA-9n was times slower than SACA-K, whereas Baier-LA+SA was times slower than Baier-LA, on average.

Table 2 shows the peak space consumed by each algorithm given in bytes per input symbol. The smallest values were obtained by NSV-Lyndon, BWT-Lyndon and SACA-K+LA-9n. In details, the space used by NSV-Lyndon and BWT-Lyndon was bytes plus the space used by the stack. The stack space was negligible (about 10KB) for almost all datasets, except for bbba where the stack used bytes for NSV-Lyndon and bytes for BWT-Lyndon (the number of stack entries is the same, but each stack entry consists of a pair of integers). On the other hand, our algorithm, SACA-K+LA-9n, used exactly bytes for all datasets.

5 Conclusions

We have introduced an algorithm for computing simultaneously the suffix array and Lyndon array () of a text using induced suffix sorting. The most space-economical variant of our algorithm uses only words of working space making it the most space economical algorithm among the ones running in linear time; this includes both the algorithm computing the and and the ones computing only the . The experiments have shown our algorithm is only slightly slower than the available alternatives, and that computing the is usually the most expensive step of all linear time construction algorithms. A natural open problem is to devise a linear time algorithm to construct only using words of working space.

Acknowledgments

The authors thank Uwe Baier for kindly providing the source codes of algorithms Baier-LA and Baier-LA+SA, and Prof. Nalvo Almeida for granting access to the machine used for the experiments.

Funding:

F.A.L. was supported by the grant 2017/09105-0 from the São Paulo Research Foundation (FAPESP). G.M. was partially supported by PRIN grant 2017WR7SHH and by INdAM-GNCS Project 2018 Innovative methods for the solution of medical and biological big data. S.M. and M.S. are partially supported by MIUR-SIR project CMACBioSeq Combinatorial methods for analysis and compression of biological sequences grant n. RBSI146R5L. G.P.T. acknowledges the support of Brazilian agencies Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).

References

  • [1]

    Baier, U.: Linear-time suffix sorting — a new approach for suffix array construction. In: Proc. Annual Symposium on Combinatorial Pattern Matching (CPM). pp. 23:1–23:12 (2016)

  • [2] Bannai, H., I, T., Inenaga, S., Nakashima, Y., Takeda, M., Tsuruta, K.: The "runs" theorem. SIAM J. Comput. 46(5), 1501–1514 (2017)
  • [3] Crochemore, M., Russo, L.M.: Cartesian and Lyndon trees. Theoretical Computer Science (2018). https://doi.org/10.1016/j.tcs.2018.08.011
  • [4] Fischer, J.: Inducing the LCP-Array. In: Proc. Workshop on Algorithms and Data Structures (WADS). pp. 374–385 (2011)
  • [5] Franek, F., Islam, A.S.M.S., Rahman, M.S., Smyth, W.F.: Algorithms to compute the Lyndon array. In: Proc. PSC. pp. 172–184 (2016)
  • [6] Franek, F., Paracha, A., Smyth, W.F.: The linear equivalence of the suffix array and the partially sorted Lyndon array. In: Proc. PSC. pp. 77–84 (2017)
  • [7] Goto, K., Bannai, H.: Simpler and faster Lempel Ziv factorization. In: 2013 Data Compression Conference, DCC 2013, Snowbird, UT, USA, March 20-22, 2013. pp. 133–142 (2013)
  • [8] Goto, K., Bannai, H.: Space efficient linear time Lempel-Ziv factorization for small alphabets. In: Proc. IEEE Data Compression Conference (DCC). pp. 163–172 (2014)
  • [9] Hohlweg, C., Reutenauer, C.: Lyndon words, permutations and trees. Theor. Comput. Sci. 307(1), 173–178 (2003)
  • [10] Itoh, H., Tanaka, H.: An efficient method for in memory construction of suffix arrays. In: Proceedings of the sixth Symposium on String Processing and Information Retrieval (SPIRE ’99). pp. 81–88. IEEE Computer Society Press (1999)
  • [11] Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Proc. 14th Symposium on Combinatorial Pattern Matching (CPM ’03). pp. 200–210. Springer-Verlag LNCS n. 2676 (2003)
  • [12] Kolpakov, R.M., Kucherov, G.: Finding maximal repetitions in a word in linear time. In: Proc. FOCS. pp. 596–604 (1999)
  • [13] Louza, F.A., Gog, S., Telles, G.P.: Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678, 22–39 (2017)
  • [14] Louza, F.A., Gog, S., Telles, G.P.: Optimal suffix sorting and LCP array construction for constant alphabets. Inf. Process. Lett. 118, 30–34 (2017)
  • [15] Louza, F.A., Smyth, W.F., Manzini, G., Telles, G.P.: Lyndon array construction during Burrows-Wheeler inversion. J. Discrete Algorithms 50,  2–9 (2018)
  • [16] Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
  • [17] Nong, G.: Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3),  15 (2013)
  • [18] Nong, G., Zhang, S., Chan, W.H.: Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)
  • [19] Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag (2013)
  • [20] Okanohara, D., Sadakane, K.: A linear-time Burrows-Wheeler transform using induced sorting. In: Proc. International Symposium on String Processing and Information Retrieval (SPIRE). pp. 90–101 (2009)
  • [21] Vuillemin, J.: A unifying look at data structures. Commun. ACM 23(4), 229–239 (1980)