Matching reads to many genomes with the r-index

08/04/2019
by   Taher Mun, et al.
Johns Hopkins University
0

The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an r-index for that file; and how to query that index with ri-align. Availability: The source code for these programs is released under GPLv3 and available at https://github.com/alshai/r-index .

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

10/07/2019

ER-index: a referential index for encrypted genomic databases

Huge DBMSs storing genomic information are being created and engineerize...
05/25/2019

EPCI: A New Tool for Predicting Absolute Permeability from CT images

A new and fast Matlab algorithm for predicting absolute permeability is ...
06/07/2022

Fast Exact String to D-Texts Alignments

In recent years, aligning a sequence to a pangenome has become a central...
06/14/2021

MetaCache-GPU: Ultra-Fast Metagenomic Classification

The cost of DNA sequencing has dropped exponentially over the past decad...
10/10/2019

E2FM: an encrypted and compressed full-text index for collections of genomic sequences

Next Generation Sequencing (NGS) platforms and, more generally, high-thr...
12/12/2017

Learning a Complete Image Indexing Pipeline

To work at scale, a complete image indexing system comprises two compone...
02/28/2021

An Efficient Indexing and Searching Technique for Information Retrieval for Urdu Language

Indexing techniques are used to improve retrieval of data in response to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

The Burrows-Wheeler Transform (BWT) [2] and FM-index [3] are central to the most popular short-read aligners, such as BWA [9] and Bowtie [8, 7], but until recently it was not known how to apply these concepts effectively to whole genomic databases. Building on previous authors’ work [11], Gagie, Navarro and Prezza [4] described how a fully functional variant of the FM-index for such a database could be stored in reasonable space: their variant takes machine words, where is the number of runs in the BWT of the database, and thus is called the -index. Prezza [14] gave a preliminary implementation, which was significantly extended by Boucher et al. [1] and Kuhnle et al. [6]. This paper is meant as a brief guide to the extended implementation. For help troubleshooting or to provide feedback, please submit an issue to our GitHub page, which also has more documentation.

2 Installation

In this section we give installation instructions, assuming users are using Ubuntu with c++17-compliant compilers and are familiar with the Unix command line. If they are using another Unix-like system then they should substitute the appropriate package manager for apt .

Users should first download some prerequisite packages, and the source code from the github repository:

$   apt-get update
$   apt-get install -y build-essential cmake git python3 zlib1g-dev
$   git clone --recursive https://github.com/alshai/r-index

They should then compile the code, as follows; any missing dependencies will be automatically downloaded and compiled locally.

$   cd r-index
$   mkdir build
$   cd build
$   cmake ..
$   make
$   make install
$   cd ..

These commands will install the binaries ri-buildfasta and ri-align in the system’s default bin location (e.g., /usr/local/bin for Ubuntu users), together with bigbwt [1] and the SDSL library [5] (if it is not already present). If users want the binaries elsewhere, then they should use

$   cmake -DCMAKE_INSTALL_PREFIX=<dest> ..

instead of  cmake ..  in the sequence of commands above, where <dest> is the desired destination directory.

3 Construction

Users can call ri-buildfasta to build the -index for a collection of genomic sequences stored in a FASTA file, which can be gzipped. For example, to index the 2.7 MB file data/dengue/genome.fa.gz, which contains 2042 Dengue Type 1 genomes downloaded from the Virus Pathogen Database and Analysis Resource (ViPR) [13] and occupies 22 MB uncompressed, they can use the following command:

$   ri-buildfasta -o dengue data/dengue/genome.fa.gz

This command saves an -index as the two files dengue.ri , which contains the main data structures for the index, and dengue.1.ri , which contains mappings from offsets in the indexed text to the names of the sequences starting at those offsets. If users want the .ri files to have different names, they should change the -o parameter in the command. In this example, the .ri files have total size 2.4 MB, or 11% and 89% of the sizes of the indexed text when uncompressed and gzipped, respectively; once they have been built, data/dengue/genome.fa.gz is no longer necessary.

By default, ri-buildfasta uses bigbwt, which efficiently handles most large repetitive text collections. For very small or insufficiently repetitive collections, however, the sais algorithm [12] might be faster, so users can change the construction algorithm using the -b parameter, as follows:

$   ri-buildfasta -b sais -o dengue data/dengue/genome.fa.gz

On an Intel i7-7700HQ 2.80 GHz laptop, building the index for data/dengue/genome.fa.gz with bigbwt takes about 3 seconds and 47 MB of memory while building it with sais takes about 10 seconds and 104 MB of memory. With bigbwt we have built a 665 MB -index for a collection of 2000 copies of human chromosome 19 from different individuals, which take 110 GB uncompressed, in under 5 hours on a server using 41 GB of RAM and one thread. At the time of writing the use of multiple processors does not significantly reduce the construction time, but that should change in the near future.

4 Alignment

Users can call ri-align to search for a collection of patterns stored in a FASTQ file. For example, to use the -index from the previous section to count the occurrences of the 1000 simulated 100-bp reads in the 219 KB file data/dengue/reads.fq, which were taken from the first genome in the dengue collection, they should use the following command:

$   ri-align count dengue data/dengue/reads.fq

Note that the index should be specified without the .ri extension. The output is piped to the standard output by default, but can of course be redirected to a file. If users actually want the positions of the reads’ occurrences, they can call ri-align locate :

$   ri-align locate dengue data/dengue/reads.fq

By default, ri-align locate returns the positions of all the occurrences of each read in the indexed text. This can be very slow, especially if the read occurs many times, so users can include --max-hits <k> to limit the number of occurrences that are reported, where <k> is the desired number of occurrences:

$   ri-align --max-hits <k> locate dengue data/dengue/reads.fq

Users can include --max-range <k> to obtain the occurrences only of reads that occur at most <k> times:

$   ri-align --max-range <k> locate dengue data/dengue/reads.fq

On the same Intel i7-7700HQ 2.80 GHz laptop and with the output redirected to a file, counting the occurrences of each read takes a total of about 0.06 seconds and locating one occurrence of each read takes a total of about 0.1 seconds.

5 Interpreting Counts

The output of the ri-align count command consists of a series of lines with each one containing the read name; a fraction with the length of the longest suffix of the read that occurs in the database as the numerator, and the length of the read as the denominator; and the number of times it occurs. The fields are separated by tabs. For the example in the previous section, the output is as follows:

$   ri-align count dengue data/dengue/reads.fq
>   simulated.0  100/100  22
>   simulated.1  100/100  2
>   simulated.2  100/100  2
>   simulated.3  100/100  2
...

With some errors in the reads, the output changes as follows:

$   ri-align count dengue data/dengue/reads_w_errors.fq
>   simulated.0.3edits  44/100  492
>   simulated.1.2edits  78/100  1
>   simulated.2.1edits  81/100  2
>   simulated.3.0edits  100/100 2
...

This means, for example, that after we add errors to simulated.0 to create simlulated.0.3edits , the last 44 characters of the edited read occur as a substring 492 times in the the database, but the last 45 characters never occur as a substring.

6 Interpreting Locations

The output of the ri-align locate is in the SAM format [10]. The SAM format begins with a line specifying the name and length of each sequence in the database (2042 for dengue), and follows with a line for each read consisting of the following tab-separated fields: read name, flag, reference name, position, MAPQ score, CIGAR string, reference name for the next read in the read-pair, position of the next read in the read-pair, observed template length, read sequence, read quality scores, and miscellaneous tags.

The most important fields are the reference name, which identifies the sequence in the database containing the occurrence, and the position, which specifies the location of the occurrence of the read is in that sequence. Since the -index does not currently support mapping-quality calculation or read-pair/mate-pair alignment, the MAPQ field, the read-pair fields and the observed template length field will all contain their respective “unknown”-type values. (For more details on the SAM format, see https://samtools.github.io/hts-specs/SAMv1.pdf .)

For our running example without errors, the output (limited to one occurrence per read) is

$   ri-align --max-hits 1 locate dengue data/dengue/reads.fq
>   @HD     VN:1.6  SO:unknown
>   @SQ     SN:gb:KY474305|Organism:Dengue  LN:10676
>   @SQ     SN:gb:JN638344|Organism:Dengue  LN:10735
...
>   @SQ     SN:gb:AY835999|Organism:Dengue  LN:10727
>   simulated.0Ψ0 gb:GQ868562|Organism:DengueΨ2525Ψ255Ψ100MΨ*Ψ0Ψ0
    TGG...CTAΨ~~~...~~~ΨNH:i:22
>   simulated.1Ψ0Ψgb:KY474306|Organism:DengueΨ10001Ψ255Ψ100MΨ*Ψ0Ψ0
    CAT...AGCΨ~~~...~~~ΨNH:i:2
>   simulated.2Ψ0Ψgb:KY474306|Organism:DengueΨ8833Ψ255Ψ100MΨ*Ψ0Ψ0
    GAT...TTGΨ~~~...~~~ΨNH:i:2
>   simulated.3Ψ0Ψgb:KY474306|Organism:DengueΨ8149Ψ255Ψ100MΨ*Ψ0Ψ0
    CTA...GAAΨ~~~...~~~ΨNH:i:2
...

with ellipses shortening the 100-character reads and strings of ~ symbols.

At the moment,

ri-align locate reports only on reads that match perfectly; of course, users have the option to use ri-align count to find the lengths of the longest suffixes that occur in the database, then search with ri-align locate only for those suffixes. This can be used to generate seeds for approximate pattern matching of reads with errors or unseen variations. Since Illumina reads tend to have more errors towards the end, users may want to truncate their reads to obtain longer matching suffixes.

7 Acknowledgements

AK and CB were funded by the NIH though NIAID grant R01AI141810-01 and by the NSF through grant IIS-1618814. TM and BL were funded by the NIH through NIGMS grant R01GM118568 and by the NSF through IIS grant 1349906. TG was funded by Fondecyt grant 1171058 and a start-up grant from Dalhousie University. GM was partially funded by PRIN grant 201534HNXC and by INdAM-GNCS Project 2019 “Innovative methods for the solution of medical and biological big data”.

References

  • [1] Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14(1):13, 2019.
  • [2] Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
  • [3] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, 2005.
  • [4] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1459–1477. SIAM, 2018.
  • [5] Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Proceedings of the 13th Symposium on Experimental Algorithms (SEA), pages 326–337. Springer, 2014.
  • [6] Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Efficient construction of a complete index for pan-genomics read alignment. In Proceedings of the 23rd Conference on Research in Computational Molecular Biology (RECOMB), pages 158–173. Springer, 2019.
  • [7] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4):357, 2012.
  • [8] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, 10(3):R25, 2009.
  • [9] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
  • [10] Heng Li, Bob Handsaker, Alec Wysoker, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16):2078–2079, 2009.
  • [11] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010.
  • [12] Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers, 60(10):1471–1484, 2010.
  • [13] Brett E Pickett, Eva L Sadat, Yun Zhang, et al. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Research, 40(D1):D593–D598, 2012.
  • [14] Nicola Prezza. r-index. https://github.com/nicolaprezza/r-index, 2018.