copMEM: Finding maximal exact matches via sampling both genomes

05/22/2018
by   Szymon Grabowski, et al.
University of Lodz
0

Genome-to-genome comparisons require designating anchor points, which are given by Maximum Exact Matches (MEMs) between their sequences. For large genomes this is a challenging problem and the performance of existing solutions, even in parallel regimes, is not quite satisfactory. We present a new algorithm, copMEM, that allows to sparsely sample both input genomes, with sampling steps being coprime. Despite being a single-threaded implementation, copMEM computes all MEMs of minimum length 100 between the human and mouse genomes in less than 2 minutes, using less than 10 GB of RAM memory.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

10/20/2020

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

One of the most computationally intensive tasks in computational biology...
04/06/2020

SOPanG 2: online searching over a pan-genome without false positives

The pan-genome can be stored as elastic-degenerate (ED) string, a recent...
11/15/2018

Vectorized Character Counting for Faster Pattern Matching

Many modern sequence alignment tools implement fast string matching usin...
05/03/2022

Computing Maximal Unique Matches with the r-index

In recent years, pangenomes received increasing attention from the scien...
07/19/2012

Quick HyperVolume

We present a new algorithm to calculate exact hypervolumes. Given a set ...
02/10/2022

MONI can find k-MEMs

Maximal exact matches (MEMs) have been widely used in bioinformatics at ...
12/02/2014

Learning interpretable models of phenotypes from whole genome sequences with the Set Covering Machine

The increased affordability of whole genome sequencing has motivated its...

Code Repositories

copmem

A tool for maximal exact matches (MEM) computation (in genomes).


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Maximal exact matches (MEMs) are exact matches between two strings (genomes) that cannot be extended to the left or right without producing a mismatch. For high-throughput sequencing data, finding MEMs has two basic applications: seeding alignments of sequencing reads for genome assembly, and designating anchor points for genome-genome comparisons.

Early algorithms for MEM finding were based on a suffix tree (Kurtz et al., 2004) or an enhanced suffix array (Abouelhoda et al., 2004). The space occupancy of these data structures instigated researchers to devise more succinct (and possibly also faster) solutions, including essaMEM (Vyverman et al., 2013) based on a sparse suffix array and E-MEM (Khiste and Ilie, 2015), which employs a hash table for sampled -mers. E-MEM is the most succinct and also often the fastest solution. Moreover, it allows to process the data in several passes, trading speed for a reduction in working memory.

Recently, Almutairy and Torng (2018) compared two approaches to sampling -mers in order to find MEMs: fixed sampling and minimizer sampling. The former, which is the approach of E-MEM, extracts -mers from one of the input sequences in fixed sampling steps and inserts into a dictionary. Then, all the -mers from the other genome are compared against the seeds in the dictionary, in order to extend the matches. The latter approach involves so-called minimizers, and allows for sampling both genomes, but the number of shared -mer occurrences is greater than with fixed sampling. Although minimizer sampling may be slightly faster, it needs more space, hence the authors concluded that fixed sampling was the right way to go.

In this article, we propose a novel approach to sampling -mers in both sequences, which combines speed and simplicity. Our idea is based on simple properties of coprime numbers.

MEM results. Times in seconds, memory (RAM) usages in GBs (G = ). MEM alg. H. sapiens vs M. musculus H. sapiens vs P. troglodytes T. aestivum vs T. durum Time RAM Time RAM Time RAM Time RAM Time RAM Time RAM essaMEM -t 10 -k 4 1779 14.03 1030 13.59 3474 13.93 1544 13.97 eMEM -t 1 0976 03.95 0575 02.18 2038 04.10 0906 02.34 0728 05.65 0576 03.17 eMEM -t 10 0216 04.03 0146 02.26 0509 04.18 0205 02.41 0219 05.72 0169 03.24 copMEM 0116 09.70 0055 08.92 0382 10.92 0093 09.52 0223 17.88 0107 15.65 Test platform: Intel i7-4930K 3.4 GHz CPU, 64 GB of DDR3-1333 RAM, SSD 512 GB, Ubuntu 17.10 64-bit OS. All codes written in C++ and compiled with g++ 7.2.0 -march=native -O3 -funroll-loops. I/O times are included. In the cases denoted with ‘—’ essaMEM produced a multi-GB temp file and we had to stop it after more than an hour.

2 Methods

Given two sequences, and , the task is to find all MEMs of length at least symbols. Following E-MEM,we sample -mers from the reference genome with a fixed sampling step. E-MEM sets the sampling step to and inserts all sampled -mers (seeds) in a hash table; the step choice guarantees that for any window of size one (and only one) -mer will be sampled. Then the query genome is scanned, with the step equal to 1; all matching sampled -mer occurrences from are found in the hash table and the tentative matches are left- and right-extended, to check if they are long enough.

Unlike E-MEM, our solution samples both genomes with step greater than 1. To this end, we choose two positive integer parameters, and , such that (where stands for the greatest common divisor) and . The -mers from genome are extracted with step and inserted in a hash table. The genome is scanned with step and similarly its -mers are extracted, candidate seeds are found in the hash table and then left- and right-extended. As the key mechanism is based on coprimality of the parameters and , we denote our algorithm as copMEM.

The correctness of our procedure implies from the following proposition.

Proposition 1.

Let and be two positive integers that are coprime. Let . For any , the set contains an element of the form , for some .

Proof.

Let us create . All elements of are distinct. Indeed, were it not the case, we would have for some , , such that , which, in light of the coprimality of and , implies that , a contradiction. As the size of is , all , , occur in it. ∎∎

Let and , where , form a MEM. Denote and . If fully contains at least -mers sampled with step , then, by the proposition above, for (at least) one of them, , we have . As and are equal, and all -mers sampled from start at position such that , the -mer is among the sampled -mers and it is equal to . We thus showed that our sampling scheme cannot miss a seed for a matching window, provided appropriate choice of the parameters. To this end, and to minimize the number of sampled positions (assuming also that the input genomes are of similar length, which is usually the case), we set and .

3 Results

To evaluate the performance of copMEM, we chose the same large real datasets as used in the E-MEM paper. The datasets are in multi-FASTA format, with sizes ranging from 2.7 Gbs to 4.5 Gbs. Supplementary Material contains the dataset URLs and characteristics, as well as details on the test methodology. MEM finding times and RAM usages are given in Table 1.

The parameter -t for E-MEM and essaMEM is the number of threads. We can see that despite the fact that copMEM in the current implementation is single-threaded, it is usually much faster than its competitors (running them with one thread yields close to an order of magnitude difference). In memory use, E-MEM is more frugal. The amount of space that copMEM occupies in the RAM memory is roughly described (in bytes) by the following formula: , where is the number of slots in the hash table.

Acknowledgement

Funding

This work was financed by the Lodz University of Technology, Faculty of Electrical, Electronic, Computer and Control Engineering as a part of statutory activity (project no. 501/12-24-1-5418).

References

  • Abouelhoda et al. (2004) Abouelhoda, M. I., Kurtz, S. and Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1): 53–86.
  • Almutairy and Torng (2018) Almutairy, M. Torng, E. (2018). Comparing fixed sampling with minimizer sampling when using -mer indexes to find maximal exact matches. PLoS One, 13(2):e0189960.
  • Khiste and Ilie (2015) Khiste, N., Ilie, L. (2015). E-MEM: Efficient computation of Maximal Exact Matches for very large genomes. Bioinformatics, 31(4): 509–514.
  • Kurtz et al. (2004) Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C. and Salzberg, S. L. (2004). Versatile and open software for comparing large genomes. Genome Biology, 5(2): R12.
  • Vyverman et al. (2013) Vyverman, M., De Baets, B., Fack, V. and Dawyndt, P. (2013). essaMEM: finding Maximal Exact Matches using enhanced sparse suffix arrays. Bioinformatics, 29(6): 802–804.

Used datasets

Please download the following datasets and extract them (each to single directory). If an archive contains multiple files, they have to be concatenated to be used as one of the two input files for copMEM (or other MEM tools in our test procedure).

Basic characteristics of the datasets are presented in Table 1.

How to run copMEM

Assuming that the input genomes are named human.all.fa and mus.all.fa, where the former is the reference and the latter the query genome, an exemplary command line may look like:

./copmem -o hm.mems -l 300 human.all.fa mus.all.fa >hm_times.txt

You can also use the switch -v (verbose), which prints more details.

Experiments

The tests were conducted on two machines. One was equipped with an Intel i7-4930K 3.4 GHz (max turbo frequency 3.9 GHz) CPU, 64 GB of DDR3-1333 RAM, SSD 512 GB (Samsung 840 Pro), plus two HDDs, and was running Linux Ubuntu 17.10 64-bit OS. The sources were compiled with g++ 7.2.0 -march=native -O3 -funroll-loops.

The other was a 64-bit Windows 10 machine, with an Intel i5-6600 3.3 GHz (max turbo frequency 3.9 GHz) CPU, 32 GB of DDR4-2133 RAM and a fast 512 GB SSD disk (Samsung 950 PRO NVMe M.2). This computer was used only for running copMEM, and the sources were compiled with Microsoft Visual Studio 2017, in the 64-bit release mode and using the following switches: /O2 /arch:AVX2.

The presented timings in the main part of the paper, as well as those given in Table 2 below, are median times of 3 runs in a particular setting (pair of datasets, chosen and possibly the number of threads). Before each run, the memory of the Linux machine was filled with some dummy value, to flush disk caches. In other words, all tests were run with a ‘‘cold cache’’.

Dataset Size (MB) Sequences
Homo sapiens (human) 3,137 93
Mus musculus (mouse) 2,731 66
Pan troglodytes (chimp) 3,218 24,132
Triticum aestivum (common wheat) 4,391 731,921
Triticum durum (durum wheat) 3,229 5,671,204
Table 1: Datasets used in the experiments.
Task default xx32 xx64 Metro64 City64
H.sap vs. M.mus, 70.65 79.90 82.13 80.20 72.39
H.sap vs. M.mus, 32.18 36.65 37.65 38.09 34.06
H.sap vs. P.tro, 299.35 305.22 304.88 300.91 291.96
H.sap vs. P.tro, 62.00 65.64 66.39 65.61 62.16
T.aest vs T.durum, 151.30 172.70 176.37 167.96 156.31
T.aest vs T.durum, 75.46 83.52 85.28 83.99 75.89
Table 2: The impact of the used hash function. Times given in seconds. Experiments were run on the Windows machine.

Table 2 presents the impact of the chosen hash function on the overall performance of copMEM. Of course, the memory usage is unaffected, yet the speeds vary somewhat. The tested hash functions include:

The default hash function (used in the ‘‘final’’ copMEM version), is taken from http://www.amsoftware.narod.ru/algo2.html (with slight changes in the code) and denoted there as maRushPrime1Hash.

Final notes

Apart from the main algorithmic idea, several factors allowed us to achieve high performance. It is important to use a large enough value of (set to 44 in copMEM) and an efficient hash function. A standard hash table (with chaining), as being a dynamic data structure, is not a good choice, due to millions of small data allocations. Instead, we scan in two passes, first counting the number of occurrences of each hash value and then writing the sampled positions into appropriate locations of an array. This strategy, resembling the textbook counting sort, avoids memory fragmentation. Finally, in a few places we applied software prefetching and other low-level optimizations.