Prefix-Free Parsing for Building Big BWTs
High-throughput sequencing technologies have led to explosive growth of genomic databases, the largest of which will soon be hundreds of terabytes or more. For many applications we want indexes of these databases but building them is a challenge. Fortunately, many genomic databases are highly repetitive and we should be able to use that to help us compute the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper we describe a preprocessing step, prefix-free parsing, that takes a text T and in one pass generates a dictionary D and a parse P of T, with the property that we can compute the BWT of T from D and P alone using workspace proportional only to their total size, and O (|T|) time when we can work in internal memory. Our experiments show that D and P are often significantly smaller than T in practice and so may fit in a reasonable internal memory even when T is very large.
READ FULL TEXT