A Simple Algorithm for Computing the Document Array

12/21/2018
by   Felipe A. Louza, et al.
Universidade de São Paulo
0

We present a simple algorithm for computing the document array given the string collection and its suffix array as input. Our algorithm runs in linear time using constant workspace for large collections of short strings.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

10/27/2017

Lyndon Array Construction during Burrows-Wheeler Inversion

In this paper we present an algorithm to compute the Lyndon array of a s...
06/21/2021

Computing the original eBWT faster, simpler, and with less memory

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of t...
02/17/2021

Linear Time Runs over General Ordered Alphabets

A run in a string is a maximal periodic substring. For example, the stri...
09/02/2020

An Algorithm for Automatically Updating a Forsyth-Edwards Notation String Without an Array Board Representation

We present an algorithm that correctly updates the Forsyth-Edwards Notat...
10/03/2018

TWA – Ticket Locks Augmented with a Waiting Array

The classic ticket lock consists of ticket and grant fields. Arriving th...
01/16/2019

Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

We show that the Longest Common Prefix Array of a text collection of tot...
01/07/2019

An in-place, subquadratic algorithm for permutation inversion

We assume the permutation π is given by an n-element array in which the ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The suffix array Manber1993 is a fundamental data structure in string processing that is commonly accompanied by the document array Muthukrishnan2002 when indexing string collections (e.g. ValimakiM07 ; Mantaci2007 ; Makinen2010 ; Bauer2013 ; BelazzouguiNV13 ; KopelowitzKNS14 ; Pantaleoni2014 ; Gog2015b ; LouzaTHC17 ; GagieHKKNPS17 ; SirenGNPD18 ). Given a string collection of strings of total size , the document array is an array of integers in the range that gives which document a suffix in the suffix array belongs to.

It is well-known that can be represented in a compact form by using a bitvector with support to rank operations, requiring bits of space Sadakane07 . However, there are applications where must be accessed sequentially (e.g. Ohlebusch2010 ; Arnold2011 ; Tustumi2016 ; Louza2018a ; EgidiLMT18 ), and having the array computed explicitly is paramount.

In this paper we show how to compute given the string collection and its suffix array as input in linear time. Our uses bits of workspace, that is, the extra space used in addition to the input and output. The workspace of our algorithm is constant when , in practice, when indexing large collections of very short strings.

2 Background

Let be a string of length , over an alphabet of size , such that is an end-marker symbol that does not occur elsewhere in and precedes every symbol of . denotes the substring from to inclusive, for . A suffix of is a substring . We define as the number of occurrences of symbol in . The string needs in bits of space.

The suffix array (Manber1993 for is an array of integers in the interval that provides the lexicographical order of all suffixes of . The inverted permutation of , denoted as , is defined as . can be computed in time using bits of workspace Nong2013 . The arrays and use bits of space each one.

The Burrows-Wheeler transform (Burrows1994 of is obtained by sorting all rotations of in a conceptual matrix , and taking the last column as the . It can also be defined by and through the relation

(1)

The Last-to-First () mapping states that the occurrence of a symbol in column of and the occurrence of in the first column correspond to the same symbol in . Let be the number of symbols in . We define

(2)

We use shorthand for . may be computed on-the-fly in time querying a wavelet tree Grossi2003 for the rank queries on Equation 2. The wavelet tree requires additional bits of space.

-mapping allows us to navigate backwards, given , then . can be reconstructed backwards from starting with , and repeatedly applying for steps.

Let be a collection of strings of lengths . The suffix array of is the built for the concatenation of all strings , with total size and a new end-marker . can be computed in time using bits of workspace Louza2017c , such that an end-marker from string will be smaller than a from string iff , which is equivalent to using different end-markers as separators.

The may also be generalized for string collections. of is obtained from and through the Equation 1. However, LF-mapping through Equation 2 does not work for symbols , since the symbol in column does not (necessarily) corresponds to the symbol in column , in this case is undefined SirenGNPD18 .

can be pre-computed in an array through Equation 3 such that still works for -symbols.

Given and , we have

(3)

The array uses bits of space.

The document array () is an array of integers in the interval that tells us which document a suffix belongs to Muthukrishnan2002 . We define iff suffix came from string . for the last suffix . uses bits of space.

The array can also be represented using wavelet trees Grossi2003 , within the same bits but with more functionalities ValimakiM07 . can still be compressed using grammars when the string collection is repetitive NavarroPV11 .

2.1 Related work

Given and , can be constructed in time using additional bits to store , such that for , with and , see (Ohlebusch2013, , Alg. 5.29).

can also be computed in the same fashion as the is reconstructed from the . Given and , we can compute array (Equation 3), and obtain in time using bits of workspace, see (Ohlebusch2013, , Alg. 7.30). In particular, in Section 3 we show how to reduce the workspace of this algorithm.

Alternatively, can be computed using a compact data structure composed by a bitvector with rank support. is built over , such that

(4)

can be obtained using and as follows (Ohlebusch2013, , Alg. 7.29):

(5)

such that can be preprocessed in time so that rank queries are supported in time using additional bits Munro1996 . This algorithm computes in time using bits of workspace.

3 Computing

In this section we show how to compute from and in time using bits of workspace

At a glance, we traverse from rigth-to-left applying the mapping times. We compute . Starting with , each receives , and whenever is decremented by one.

Recall that we cannot traverse with the -mapping given on-the-fly by Equation 2. We can pre-compute each on array from and as in Equation 3.

Algorithm 1

The algorithm uses only two integers arrays, say and , with starting with . First, we compute in (Lines 1-3). Then, we overwrite with array by Equation 3 (Lines 4-6). Next, is computed in the space of (overwriting ). Initially, and (Lines 7-8). At each step (Lines 9-16), the value in (corresponding to ) is stored in a temporary variable (Line 10) and replaced by (Line 11), then is stored in (Line 12). Whenever , we reach a symbol in the BWT, and the value in is decremented by one (Lines 13-15). The next step will visit position (Line 16). At the end, is completely computed in the space of , and is reconstructed in the same space of .

1 for  do
       ;
        //
2      
3 end for
4for  do
       ;
        //
5      
6 end for
7; ; for  do
       ;
        //
       ;
        //
       ;
        //
8       if  then //
9             ;
10       end if
      ;
        //
11      
12 end for
Algorithm 1 Computing from and

Theoretical costs

The number of steps is and only four additional variables were needed. Therefore, the algorithm runs in time. The document array is stored in bits using , the workspace is bits, which is constant when .

In practice, the workspace of Algorithm 1 is constant when indexing large collections of short strings, and the array used to store uses the same space as the array used for .

4 Experimental results

We compared our algorithm with the lightweight alternative that uses a bitvector described in Section 2.1. We evaluated two versions of this procedure, using compressed (bit_sd) and plain bitvectors (bit_plain). We used C++ and SDSL library Gog2014a version 2.0. The algorithms receive as input the concatenated string () and its suffix array (), which was computed using gSACA-K Louza2017c . Our algorithm was implemented in ANSI C. The source codes are available at https://github.com/felipelouza/gsa-is/.

The experiments were conducted on a machine with Debian GNU/Linux 8 64 bits OS (kernel 3.16.0-4) with processor Intel Xeon E5-2630 v3 20M Cache -GHz, GB of RAM and a TB SATA disk. We used real data collections described in Table 1.

Dataset longest string
pages 205 3.74 1,000 4,019,585 362,724,758
revision 203 0.39 20,433 20,527 2,000,452
influenza 15 0.56 394,217 1,516 2,867
wikipedia 208 8.32 3,903,703 2,288 224,488
reads 4 2.87 32,621,862 94 101
proteins 25 15.77 50,825,784 333 36,805
Table 1: Datasets.
pages:

repetitive collection from a snapshot of Finnish-language Wikipedia. Each document is composed by one page and its revisions111http://jltsiren.kapsi.fi/data/fiwiki.bz2.

revision:

the same as pages, except that each revision is a separate document.

influenza:

repetitive collection of the genomes of influenza viruses222ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz.

wikipedia:

collection of pages from English-language of Wikipedia333http://algo2.iti.kit.edu/gog/projects/ALENEX15/collections/ENWIKIBIG/.

reads:

collection of DNA reads from Human Chromosome 14 (library 1)444http://gage.cbcb.umd.edu/data/index.html.

proteins:

collection of protein sequences from Uniprot/TrEMBL 2015_09555http://www.ebi.ac.uk/uniprot/download-center/.

Table 2 shows the running time (in seconds) and workspace (in KB) of each algorithm. The workspace is the peak space used subtracted by the space used for the input, and , and for the output, . We used -bits integers when , otherwise we use -bits to store the arrays of integers. Each symbol of uses 1 byte.

Results

bit_plain was the fastest algorithm in all tests. bit_plain was times faster than bit_sd, and times faster than Alg. 1, on the average. bit_sd was still times faster than Alg. 1, which shows that Alg. 1 is not competitive. On the other hand, Alg. 1 was the only algorithm that kept the workspace constant, namely bytes for inputs smaller than (2 GB) and bytes otherwise. The workspace of bit_plain and bit_sd were much larger, bit_plain spent exactly bytes, whereas bit_sd spent bytes, on the average.

Dataset Time (seconds) Workspace (KB)
Alg. 1 bit_plain bit_sd Alg. 1 bit_plain bit_sd
pages 739.14 141.99 141.25 0 613,341 4
revision 44.46 11.74 20.37 0 64,002 44
influenza 81.44 20.48 41.24 0 91,168 704
wikipedia 3,454.75 450.64 1,054.08 0 1,363,147 7,096
reads 1,263.72 150.40 549.65 0 470,389 38,980
protein 6,953.01 1,211.13 2,899.63 0 2,583,532 69,423
Table 2: Running time and workspace.

Acknowledgments

FAL was supported by the grant 2017/09105-0 from the São Paulo Research Foundation (FAPESP).

References

  • (1) U. Manber, E. W. Myers, Suffix arrays: A new method for on-line string searches, SIAM J. Comput. 22 (5) (1993) 935–948.
  • (2) S. Muthukrishnan, Efficient algorithms for document retrieval problems, in: Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), ACM/SIAM, 2002, pp. 657–666.
  • (3)

    N. Välimäki, V. Mäkinen, Space-efficient algorithms for document retrieval, in: Proc. Annual Symposium on Combinatorial Pattern Matching (CPM), 2007, pp. 205–215.

  • (4) S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, An extension of the Burrows-Wheeler transform, Theor. Comput. Sci. 387 (3) (2007) 298–312.
  • (5) V. Mäkinen, G. Navarro, J. Sirén, N. Välimäki, Storage and retrieval of highly repetitive sequence collections, Journal of Computational Biology 17 (3) (2010) 281–308.
  • (6) M. J. Bauer, A. J. Cox, G. Rosone, Lightweight algorithms for constructing and inverting the BWT of string collections, Theor. Comput. Sci. 483 (2013) 134–148.
  • (7) D. Belazzougui, G. Navarro, D. Valenzuela, Improved compressed indexes for full-text document retrieval, J. Discrete Algorithms 18 (2013) 3–13.
  • (8) T. Kopelowitz, G. Kucherov, Y. Nekrich, T. A. Starikovskaya, Cross-document pattern matching, J. Discrete Algorithms 24 (2014) 40–47.
  • (9) J. Pantaleoni, A massively parallel algorithm for constructing the BWT of large string sets, CoRR abs/1410.0562.
  • (10) S. Gog, G. Navarro, Improved single-term top-k document retrieval, in: Proc. Workshop on Algorithm Engineering and Experimentation (ALENEX), 2015, pp. 24–32.
  • (11) F. A. Louza, G. P. Telles, S. Hoffmann, C. D. A. Ciferri, Generalized enhanced suffix array construction in external memory, Algorithms for Molecular Biology 12 (1) (2017) 26:1–26:16.
  • (12) T. Gagie, A. Hartikainen, K. Karhu, J. Kärkkäinen, G. Navarro, S. J. Puglisi, J. Sirén, Document retrieval on repetitive string collections, Inf. Retr. Journal 20 (3) (2017) 253–291.
  • (13) J. Sirén, E. Garrison, A. M. Novak, B. Paten, R. Durbin, Haplotype-aware graph indexes, in: Proc. International Workshop on Algorithms in Bioinformatics (WABI), 2018, pp. 4:1–4:13.
  • (14) K. Sadakane, Succinct data structures for flexible text retrieval systems, J. Discrete Algorithms 5 (1) (2007) 12–22.
  • (15) E. Ohlebusch, S. Gog, Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem, Information Processing Letters 110 (3) (2010) 123–128.
  • (16) M. Arnold, E. Ohlebusch, Linear Time Algorithms for Generalizations of the Longest Common Substring Problem, Algorithmica 60 (4) (2011) 806–818.
  • (17) W. H. Tustumi, S. Gog, G. P. Telles, F. A. Louza, An improved algorithm for the all-pairs suffix-prefix problem, J. Discret. Algorithms 37 (2016) 34–43.
  • (18) F. A. Louza, G. P. Telles, S. Gog, Z. Liang, Computing Burrows-Wheeler Similarity Distributions for String Collections, in: Proc. International Symposium on String Processing and Information Retrieval (SPIRE), 2018, pp. 285–296.
  • (19) L. Egidi, F. A. Louza, G. Manzini, G. P. Telles, External memory BWT and LCP computation for sequence collections with applications, in: Proc. International Workshop on Algorithms in Bioinformatics (WABI), 2018, pp. 10:1–10:14.
  • (20) G. Nong, Practical linear-time O(1)-workspace suffix sorting for constant alphabets, ACM Trans. Inform. Syst. 31 (3) (2013) 1–15.
  • (21) M. Burrows, D. J. Wheeler, A block-sorting lossless data compression algorithm, Tech. rep., Digital SRC Research Report (1994).
  • (22) R. Grossi, A. Gupta, J. S. Vitter, High-order entropy-compressed text indexes, in: Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), ACM/SIAM, 2003, pp. 841–850.
  • (23) F. A. Louza, S. Gog, G. P. Telles, Inducing enhanced suffix arrays for string collections, Theor. Comput. Sci. 678 (2017) 22–39.
  • (24) G. Navarro, S. J. Puglisi, D. Valenzuela, Practical compressed document retrieval, in: Proc. Symposium on Experimental and Efficient Algorithms (SEA), 2011, pp. 193–205.
  • (25) E. Ohlebusch, Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction, Oldenbusch Verlag, 2013.
  • (26) J. I. Munro, Tables, in: Proc. of Foundations of Software Technology and Theoretical Computer Science (FSTTCS), Vol. 1180 of LNCS, Springer, 1996, pp. 37–42.
  • (27) S. Gog, T. Beller, A. Moffat, M. Petri, From theory to practice: Plug and play with succinct data structures, in: Proc. Symposium on Experimental and Efficient Algorithms (SEA), Vol. 8504 of LNCS, Springer, 2014, pp. 326–337.