Smooth q-Gram, and Its Applications to Detection of Overlaps among Long, Error-Prone Sequencing Reads

02/04/2018 ∙ by Haoyu Zhang, et al. ∙ Indiana University Bloomington Indiana University 0

We propose smooth q-gram, the first variant of q-gram that captures q-gram pair within a small edit distance. We apply smooth q-gram to the problem of detecting overlapping pairs of error-prone reads produced by single molecule real time sequencing (SMRT), which is the first and most critical step of the de novo fragment assembly of SMRT reads. We have implemented and tested our algorithm on a set of real world benchmarks. Our empirical results demonstrated the significant superiority of our algorithm over the existing q-gram based algorithms in accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

-gram, also called -gram, -mer/shingle, has been used extensively in the areas of bioinformatics (Altschul et al., 1990; Pevzner et al., 2001; Berlin et al., 2015; Li, 2016; Myers, 2014), databases (Xiao et al., 2008; Wang et al., 2012; Qin et al., 2011)

, natural language processing

(Manning and Schütze, 1999), etc. In particular, -gram was used to construct the de Bruijn graph (Pevzner et al., 2001; Compeau et al., 2011), a data structure commonly exploited for fragment assembly in genome sequencing, especially for short reads obtained using next-generation sequencing (NGS) technologies  (Miller et al., 2010). Another important application of -gram in bioinformatics is in sequence alignment, which aims to detect highly similar regions between long strings (e.g., genomic sequences). Following the seed-extension approach, many sequence alignment algorithms (including the popular BLAST (Altschul et al., 1990) and more recent algorithms  (Brudno et al., 2003; Schwartz et al., 2003; Kurtz et al., 2004)) first search for -gram matches (i.e., seeds) between each pair of input strings, and then extend these matches into full-length alignment by using dynamic programming algorithms. Recently, this approach was adopted for detecting overlaps between long, error-prone reads (Berlin et al., 2015; Li, 2016; Myers, 2014) generated by single molecule (also called the third generation) sequencing technologies, including the single molecule real time sequencing (SMRT) (Roberts et al., 2013) and the MinION sequencers (Mikheyev and Tin, 2014). Comparing with the NGS reads, the single molecule technologies generate reads much longer and more error-prone. As a result, two overlapping reads contain highly similar but not identical substrings (with a relatively small edit distance111The edit distance between two strings and is defined to be the minimum number of letter insertions, deletions and substitutions needed to transfer to . due to sequencing errors), which should be addressed by an overlap detection algorithm.

A straightforward application of the seed-extension approach to overlap detection may be hurdled by an inherent limitation: two strings sharing a highly similar substring may share only a small number of, or even zero, matched -gram pairs (seeds), due to the pattern of sequencing errors within the shared substring. Consequently, a seed-extension algorithm may fail to detect such overlaps because of the lack of seeds between the reads. Let us illustrate this point by an example. Consider the following two input strings:


Their edit distance is , however, they share no matched -gram pairs (seeds).

To address this issue, in this paper, we propose a variant of -gram called the smooth -gram, using which we can identify not only those exactly matched -gram pairs (with certainty), but also those

-gram pairs that have small edit distances (each with a high probability). Our smooth

-gram construction is based on a recent advance in metric embedding (Chakraborty et al., 2016) that maps a string from the edit distance space to the Hamming distance space while (approximately) preserving the distance; we will illustrate the details of this embedding in Section 2.1. For the example mentioned above, our smooth -gram based approach can, with a very high probability, find most pairs of -grams of the two input strings whose edit distances are at most .

Application in SMRT data

We applied the smooth -gram to the overlap detection among sequencing reads produced by SMRT, which is the first and most critical step of the de novo fragment assembly of SMRT reads. Notably, SMRT sequencers generate reads of 1,000-100,000 bps long with 12-18% sequencing errors (including most insertions/deletions and some substitutions); in comparison, Illumina sequencers (a common NGS platform) generate reads of 100-300 bps long with errors . We have evaluated our approach using real-world SMRT datasets.

We formalize the overlap detection problem as follows. Given a collection of strings , the goal is to output all overlapping string pairs and their shared substrings and , such that the lengths of the substrings are above a threshold , and their edit distance is below a threshold .

For long, error-prone reads produced by SMRT, finding a good number of exactly matched -grams between reads could be difficult (or just impossible). Thus the overlap detection problem becomes challenging for conventional “seed-extension” approaches. For example, in Figure 1 we plotted the average number of matched -grams between overlapping SMRT reads sampled from a real word dataset E.coli under different matching thresholds (edit distance threshold being , , and , respectively). We can see that when , the number of -gram pairs with edit distance (ED) no more than is times of that of exactly matched -gram pairs. Obviously, for a pair of reads, with more matched -grams (seeds) detected, the sensitivity increases for detecting putative overlaps between error-prone reads. Therefore, the smooth -gram approach proposed in this paper can outperform the existing -gram based “seed-extension” approaches. Indeed, our evaluation showed that, for the overlap detection in the real-world datasets that we have tested, the smooth -gram based algorithm always achieved

scores (i.e., the harmonic average of the precision and recall) above

, while the score achieved by the best -gram based algorithm can be as low as .

Figure 1. Average number of matched -gram pairs for overlapping read pairs in the E.coli dataset.

Our Contribution

We summarize our contribution below.

  1. We proposed smooth -gram, the first variant of -gram that captures -gram pair within a small edit distance.

  2. We applied smooth -gram to the problem of detecting overlapping pairs of error-prone reads produced by single molecule sequencing technologies, such as SMRT.

  3. We implemented our smooth -gram based algorithm and tested it on a set of real world benchmarks. Our empirical results demonstrated the significant superiority of our algorithm over the existing -gram based algorithms in precision, recall and scores.

Related Work

Since it is proposed recently, the problem of detecting overlaps among long, error-prone reads from SMRT has drawn a significant attention in bioinformatics (Berlin et al., 2015; Li, 2016; Myers, 2014). All existing overlap detection algorithms follow the “seed-extension” approach, in which the seeds are defined based on -grams.

The only line of work, as far as we have concerned, that has a similar spirit as ours is the gapped -gram (Burkhardt and Kärkkäinen, 2001, 2002, 2003) (also referred to as the spaced seeds in bioinformatics applications (Ma et al., 2002; Keich et al., 2004)). The idea of gapped -gram is to take substrings of each string of a specific pattern. For example, the gapped -grams of the string “ACGTACGT” with pattern “XX-X” are {ACT, CGA, GTC, TAG, ACT}. That is, instead of taking the contiguous substrings as that in the traditional -gram approach, the gapped -gram breaks the adjacency dependencies between the characters. Now if we are allowed to choose multiple gapped -gram patterns, then one will need more edits to make all gapped -grams between two strings mismatched. However, the optimal pattern of gapped -gram is difficult to find: it needs an exhaustive search on all possible patterns, and the running time for the search has an exponential dependency on length of the pattern (Keich et al., 2004). This might be the reason why there is no previous work applying gapped -gram to solve the overlap detection problem for SMRT data. In contrast, our smooth -grams are systematically generated, and always have the same theoretical guarantees on all datasets.

2. Smooth -Gram

As mentioned, the major innovation of this paper is to replace the standard -gram based approach for overlap detection with the smooth -gram based approach. The advantage of smooth -gram is that it tolerates a small edit distance between matched -grams and is thus able to identify similar strings at higher sensitivity. In this section, we discuss the details of smooth -gram construction and discuss its properties.

We will use to denote the length of a smooth -gram, and to denote the length of a -gram after CGK-embedding.

2.1. The CGK-Embedding

The key tool that we will use in our construction of smooth -gram is the CGK-embedding, which convert a string to for a value using a random string , where is the alphabet (for nucleotides, ).

More precisely, let denote the time steps of the embedding. We also maintain a pointer to the string , initialized to be . At each step , we first copy to , and set . We then determine whether we should increment or not. We sort characters in in an arbitrary but fixed order. For a character , let denote the index of in this order. We set

When reaches while

, we simply pad

copies of ‘’ to to make its length equal to , where is an arbitrary character.

Denote the CGK-embedding as a function for a fixed string (sampled randomly from ). Given , let and . It has been shown in (Chakraborty et al., 2016) that for any for some large enough constant , we have with probability that

where and denote the edit distance and the Hamming distance respectively.

It is easy to see that after the CGK-embedding, -grams with small edit distance will likely have small Hamming distance, and those with large edit distance will likely have large Hamming distance. In particular, if , then we have with certainty.

The CGK-embedding has recently been used for sketching edit distance (Belazzougui and Zhang, 2016) and performing edit similarity joins (Zhang and Zhang, 2017).

2.2. From -Gram to Smooth -Gram

We show how to construct a smooth -gram from a standard -gram using random string . For convenience we will write “smooth -gram” instead of “smooth -gram” although the resulting smooth -gram will have length . Our algorithm is very simple. Given a -gram , we first perform the CGK-embedding on to get a string of length , and then construct a substring of length by picking the coordinates in where . The algorithm is depicted in Algorithm 1.

1:: -gram ; : random string from ; : random string from under the constraint that there are -bit;
2:: smooth -gram of of size
3: CGK(, )
4: is generated by removing coordinates in s.t.
Algorithm 1 Generate-Smooth--Gram()

The motivation of introducing smooth -gram is that we hope that the corresponding smooth -grams of two -grams  and for which is small, can be identical with a good probability. More precisely, let , and let and . By the property of the CGK-embedding, we know that . Let . If we randomly sample without replacement bits from two -bit strings and at the same indices, the probability that all the sampled bits are the same is


In our experiments we typically choose for a constant , and we are only interested in being at most . In this case we can approximate (1) as for some constant . In other words, for a non-trivial fraction of pairs of -gram, their corresponding smooth -gram will be matched. Finally, we note that when , with fixed and we must have with certainty.

We note that our construction of smooth -gram is very different from just a subsampling of the original -grams. Indeed, given two -grams  and where is obtained by a cyclic shift of by one coordinate (that is, we move the first coordinate of to the end of , and ), if we just sample say a constant fraction of coordinates from and using common randomness, getting and , then and will be different with very high probability.

As mentioned in the introduction, if we are able to match near-identical -grams (under edit distance), then we are able to catch similar pairs of strings which will otherwise be missed by standard -gram approaches. In this way we can significantly improve the recall of the algorithm. Of course, by allowing approximate matching we may also increase the number of false positives, that is, dissimilar pairs of strings may have many identical smooth -grams, and will thus be considered as similar pairs. To maintain a good precision we may need to perform a verification step on the candidate pairs of similar strings, which will increase the running time. Therefore, for a particular application, one needs to select a good tradeoff between the accuracy improvement and the extra running time cost.

We also comment that we can further enhance the precision by performing multiple CGK-embeddings (say, times), and/or multiple subsamplings (say, times), so that for each -gram we will create smooth -grams. However, these operations will increase the number of false positives as well, and consequently the running time. In our experiments in Section 4 we have computed the number of matching -grams on various datasets when varying the number of CGK-embeddings and subsamplings. But for our application of detecting overlapping error-prone sequencing reads, we have noticed that a single run of CGK-embedding and subsampling already gives satisfactory accuracy.

3. Applications to Overlap Detection among Long, Error-Prone Sequencing Reads

In this section we show how to use smooth -gram to solve the overlap detection problem for long, error-prone sequence reads. We approach the problem in two steps. In the first step (Section 3.1), we show how to use smooth -gram to detect putative pairs of overlapping strings. And then for each of such pairs, we design an efficient verification procedure to reduce the number of false positives (Section 3.2).

In Table 1 we have listed a set of global parameters/notations that will be used in our algorithms. Let denote the set .

length of smooth -gram
length of -gram after CGK-embedding
signature selection rate
frequency filtering threshold
edit distance threshold
threshold for #matched signatures
targeting overlap length
error tolerance rate
a random hash function
random string from
random string from
Table 1. List of Global Parameters

3.1. Detecting Putative Pairs of Overlapping Strings

Our algorithm for detecting overlapping pairs of strings is presented in Algorithm 2.

We will use the following data structure to store useful information of a -gram.

Definition 3.1 (-gram signature).

Let be a signature for a -gram; the parameters are interpreted as follows:

  • is the -gram;

  • , which can be seen as a hash rank of ;

  • denote that is taken from the -th input string from the position , that is, .

It is easy to see that and are fully determined by given the randomness and , but for convenience we still include them as parameters in the definition of the signature.

1:: set of input strings;
2: {overlapping pair and their shared substrings and
3:Initialize an empty table
4:for each  do
6:     for each  do
8:          Generate-Smooth--Gram()
12:     end for
13:end for
14:Count for all the number of signatures in the form of in , and remove all in with for all
15:for each  do
16:     Construct from by keeping signatures in with the smallest of hash ranks .
18:     for each  in  do
19:          Search-Similar--Grams(, )
20:     end for
21:     for each  do
23:     end for
24:end for
25:for each  do
26:      Verify(, , )
27:     if  then
28:          Find-Shared-Substrings(, , , , , )
30:     end if
31:end for
Algorithm 2 Find-overlapping-Strings()

We now describe Algorithm 2 in words. The algorithm can be divided into three stages. The first stage (Line 3 -13) is the initialization: for each input string , and for each of its -gram, we generate the corresponding -gram signature. In the second stage (Line 14 - 24) we try to find a set of candidate overlapping pairs of input strings. We will explain how this works in the rest of this section. The last stage (Line 25-31) is a verification step and will be illustrated in Section 3.2.

In the second stage of the algorithm, the first step is to filter out those smooth -grams whose frequency is above a certain threshold (Line 14). This is a common practice, and has been used in a number of previous algorithms, such as MHAP(Berlin et al., 2015), Minimap(Li, 2016), DALIGNER(Myers, 2014). The motivation of this pruning step is that frequent smooth -grams often correspond to frequent -grams, which do not carry much important features/information about the sequence (similar to the frequent words like ‘a’, ‘the’ in English sentences). On the other hand, these common smooth -grams will contribute to many false positives and consequently increase the running time of subsequent steps. It is inevitable that in this pruning procedure some true positives are also filtered out. However, we have observed that by appropriately choosing the filtering threshold ,222In our experiments, we choose for E.coli dataset, and for Human and S.cerevisiae datasets. we can significantly reduce the number of false positives at the cost of introducing a small number of false negatives.

After the filtering step we perform a subsampling of an -fraction of -grams using the random hash function (Line 16). We then only focus on these sampled -grams when measuring the string similarity. The purpose of performing such a subsampling is to reduce the total running time of the verification step (Line 26) by producing a set of smaller matching lists (Line 22). On the other hand, it will not affect the accuracy of the algorithm by much. This is because in the verification step we will consider a pair of input strings who have at least matched -gram pairs, and subsampling -grams by a ratio of corresponds to subsampling the matched -gram pairs by a ratio of . Therefore we can scale the threshold correspondingly to obtain a similar set of candidate string pairs.

1:: a signature for -gram  (see Definition 3.1 for detailed explanation of the parameters); : a table with buckets indexed by ;
4:for each  stored in  do
5:     if  then
7:     end if
8:end for
9:Add to the
Algorithm 3 Search-Similar--Grams(, )

We next try to find for each pair of input strings , their set of matching -grams (Line 15-24). This is done by calling a subroutine Algorithm 3 to find for each -gram, a list of its matching -grams (with edit distances less than or equal to ). More precisely, in Algorithm 3 we try to find for a -gram  an (incomplete) list of matching -grams  by considering all -gram  such that the corresponding smooth -grams of and fall into the same bucket in table (Line 4). We then perform a brute-force edit distance computation (Line 5) to make sure that ; if this holds then we record the pair and the positions of the match into . Finally at Line 9 we add the signature of into table to build the table gradually while performing the search.

3.2. Verification

In this section we discuss how to verify whether a pair of input strings overlap at a significant length given a list of their matching -grams, and if it is the case, what are the shared substrings in the respective strings. For this purpose we employ two subroutines: Algorithm 4 performs a basic verification, and outputs a pair of positions on and inside the shared substrings if is considered as an overlapping pair. We then use Algorithm 5 to recover the actual shared substrings.

1:: two input strings; : set of pairs of matched -gram positions in and ;
2:: reference offset : reference position
4:for each  do
6:end for
7:Find a value s.t. is maximized
8:Remove all pairs s.t. or
10:for each  do
12:end for
13:Find a value s.t. is maximized
14:Remove all pairs s.t. or
15:if  then
16:     return
18:     return
19:end if
Algorithm 4 Verify(, , )
1:: two input strings; : reference offset; : reference position; : sets of -gram signatures of and
2:: and are shared substrings in and
6:Remove s.t. from
7:Sort matches using in the increasing order
8:for each   do
9:     if   then
11:     end if
12:     if  then
14:     end if
15:end for
Algorithm 5 Find-Shared-Substrings(, , , , , )

We now describe Algorithm 4 and Algorithm 5 in words. Let be the list of starting positions of the matching pairs of -grams of input strings and . We construct bipartite graph with characters of as nodes on the left side, and characters of as nodes on the right side. For each matching pair , there is an edge connecting and . For convenience, we slightly abuse the notation by using to denote the edge between and , and call the shift of the edge.

It is not hard to imagine that if and overlap, there must be a large cluster of edges of similar shifts in . Algorithm 4 consists of two filtering steps. In the first step we try to identify a good reference shift (Line 3-7), and remove all the edges whose shifts are far away from (Line 8) (more precisely, those pairs with ). According to the previous literature, SMRT sequencing reads have accuracy (Koren and Phillippy, 2015). We thus set the error tolerance rate to be .

After finding a good reference shift, we try to find a dense area (or simply, a reference position in ) which contains many edges whose shifts are close to (Line 9-13). We then remove all the edges that are not in this dense area (Line 14). Finally, we count the number of edges in the dense areas; if the number is at least , then we consider an overlapping pair and return the reference edge (determined by and ); otherwise we simply return (Line 15-19).

We should note that all of these operations are performed on a subset of matched -gram pairs in and . By “subset” we mean that is constructed after the subsampling step at Line 16 in Algorithm 2. As mentioned above, the purpose of the subsampling is to reduce the running time in the verification step. In contrast, when the actual shared substrings between and are found by Algorithm 5, we exploit the complete set of matched -gram pairs, which will not significantly increase the overall running time because after verification, the number of input string pairs becomes much smaller.

Now, we turn to the details of the algorithm for determining the actual shared substrings between and (Algorithm 5). We again first construct the list of matching -grams. This can be done by a synchronized linear scan on the two sets and , after sorting the tuples by their values. Next, starting from the reference edge determined by and , we first locate the corresponding dense areas (Line 4-6). We then try to extend this dense area by adding one by one the matching edges outside this dense areas but still within a distance of from the dense area, in the increasing order of the distances between these matching edges to the dense area (Line 7-15). Finally the algorithm returns the extended area as the shared substrings between and .

4. Experiments

In this section we present experimental studies of smooth -gram and its application to detect overlaps among SMRT sequencing reads.

4.1. Tested Algorithms

We have implemented our algorithms presented in previous sections in C++, and complied them using GCC 5.4.0 with O3 flag.

To facilitate the investigation of properties of smooth -grams, we introduce an additional algorithm named Find-Similar--Gram-Pairs, which uses the smooth -gram technique to find pairs of input -grams whose edit distances are at most for a given distance threshold . The algorithm is depicted in Algorithm 6. Let us describe it in words briefly. Essentially, Find-Similar--Gram-Pairs can be seen as running Search-Similar--Grams (Algorithm 3) for each input -gram. Of course in this investigation we do not need to carry the data structure for each -gram  that we used in Algorithm 3 (for the application of overlap detection). Moreover, as mentioned at the end of Section 2, we can choose to repeat the CGK-embedding and the subsampling for and times respectively, so that for each -gram  we create smooth -grams. By doing this we can generate more similar -gram pairs which can be used to potentially boost the accuracy of our application. We will test Algorithm 6 for various and values. While in our applications in Section 4.4 we only perform the embedding and the subsampling once, which is enough for obtaining good accuracy.

1:: set of -grams; : number of CGK-embeddings; : number of subsamplings;
4:for each  do
5:     Pick a random string from ,
6:     for each  do
7:         Pick a random string from under the constraint that it contains -bit
8:         Initialize a new table
9:         for each  do
10:               Generate-Smooth--Gram (, , )
11:         end for
12:         Count for each distinct smooth -gram its frequency
13:         for each  do
14:              if frequency of is less than  then
15:                  for each -gram stored in the  do
17:                  end for
18:                  Store in
19:              end if
20:         end for
21:     end for
22:end for
23:Remove duplicate pairs in
24:for each  do
25:     if  then
27:     end if
28:end for
Algorithm 6 Find-Similar--Gram-Pairs(, , )

In Section 4.4 we compared Algorithm 2 with existing overlap detection algorithms. For convenience, we call our algorithm SmoothQGram. We briefly describe each of the competitors below.

MHAP(Berlin et al., 2015)333Implementation obtained from this algorithm generates -grams of all sequences and then filters out those with frequencies greater than times the total number of -grams. Next, it uses multiple Minhash (Broder et al., 1998) functions to find matching -grams between sequences, and then select pairs of sequences that have at least matching -grams as candidate pairs. For each candidate pair, it uses a modified sort-merge algorithm to find more accurate

-gram matches, and then computes the boundary of the overlap region using a uniformly minimum-variance unbiased (UMVU) estimator

(Cheng and Amin, 1983).

Minimap(Li, 2016)444Implementation obtained from this algorithm generates -grams of all sequences and then filters out the top fraction of the most frequent ones. Next, it hashes each -gram to a value in , and selects -grams with the smallest hash values in every consecutive -grams as signatures of the input sequence. It then find all matching signatures between input sequences; pairs of sequences that have at least one shared signature are identified as candidate pairs. Minimap then calculates a cluster of -gram matches for each candidate pair, and then finds a maximum colinear subset of matches by solving a longest increasing sequence problem. If the size of the subset is larger than , then Minimap computes and outputs the overlap region using the subset of matches.

DALIGNER(Chaisson and Tesler, 2012)555Implementation obtained from this algorithm generates -grams of all sequences and then filters out those that occur more than times. It then considers all the remaining -grams directly and computes all the matching -grams between pairs of sequences. Pairs of sequences with at least one shared -gram are identified as a candidate. Next, for each candidate, it uses a linear time difference algorithm (Myers, 1986) to compute a local alignment between the two sequences, and outputs the pair if the alignment length is greater than the given threshold.

We note that all these algorithms are under the same “seed-extension” framework as ours: they first find all the matched -grams between input sequence pairs, and then extend seeds to potential overlaps. The major difference between our SmoothQGram and the existing tools is that we have relaxed the strict -gram matches to approximate -gram matches (via smooth -gram) to improve the accuracy of the output. Our specifics for overlap detection and determination of shared substrings are also different from the existing algorithms.

In our experiments we run SmoothQGram with parameters , and (except for Figure 2, where we have tested different values) . We choose for E.coli, and for Human and S.cerevisiae. We run MHAP with parameters “-num-hashes 1256”, Minimap with parameters “-k 15 -Sw5 -L100 -m0 -t8”, and DALIGNER with parameter “-H500”. All other parameters were selected as default settings.

We note that the performance of Minimap is sensitive to the filter threshold (“-f”) it uses. Thus, besides the default parameter, we also choose an alternative parameter “” (according to (Koren et al., 2017)’s recommendation), which essentially means that almost no -gram will be filtered out. Intuitively, such a change will lead to better recall values (but possibly worse precision values) at the cost of greater memory usage and running time. We call the original version Minimap-default and the new version Minimap-alternative.

4.2. The Setup


We test algorithms using real world datasets from PacBio SMRT sequencing.666The data was downloaded from MHAP’s supporting data website: The statistics of these datasets are described in Table 2. In Section 4.3 we test Algorithm 6 using the dateset E.coli-small; the number of q-grams for ecoli-small is . In Section 4.4 we compare different algorithms using much larger datasets E.coli, S.cerevisiae and Human.

Datasets number of strings Average Length
E.coli-small 100 4781
E.coli 46960 4221
S.cerevisiae 48279 3032
Human 47535 3100
Table 2. Statistics of Tested Datasets


In Section 4.3 we report the number of matching -gram pairs detected by the Algorithm 6. Each result is an average of runs. In Section 4.4, we report four types of measurements in our experiments: recall, precision, memory usage and running time. We choose the evaluation program used by MHAP to calculate precision and recall; all parameters are selected as default except that the evaluation overlap length threshold is set to be or . The evaluation program learns the ground truths from the reads to references mappings obtained by Blasr (Chaisson and Tesler, 2012). We note that the evaluation algorithm does not simply use edit distance as the criteria to compute precise and recall. Instead, it maps all the reads to the reference sequences, and computes for each pair of reads their overlap positions and lengths from the mapping results. This evaluation method is widely used in overlap detection because it considers biological meanings of overlaps between reads.

We also present the score:

which is an integrated metric evaluating both precision and recall.

All algorithms use multiple threads in execution; we thus measure the CPU time for comparison. The memory usage we report is the maximum memory usage of a program during its execution. We note that although all the tested algorithms are randomized, we use a fixed random seed for all of them to guarantee the consistency among outputs.

Computing Environment

All experiments are conducted on a Dell PowerEdge T630 server with 2 Intel Xeon E5-2667 v4 3.2GHz CPU with 8 cores each, and 256GB memory.

4.3. Finding -gram Pairs with Small Edit Distance

In this section we present the performance of Find-Similar--Gram-Pairs (Algorithm 6). We choose the parameters for Algorithm 6 from Table 3; parameters underlined are default values.

Parameter Values
Table 3. Parameters for Algorithm 6

Matching Pairs of -grams

We first study how different parameter values (the length of the smooth -gram), (the number of CGK-embeddings), and (the number of samplings) influence the number of almost matching -gram pairs that we can find. Our results are presented in Figures 2, 3 and 4.

Figure 2. Number of matching -gram pairs vs smooth -gram size ; on E.coli-small
Figure 3. Number of matching -gram pairs vs number of embeddings ; on E.coli-small
Figure 4. Number of matching -gram pairs vs number of subsamplings ; on E.coli-small

We observe that the number of matching -grams increases when decreases, and significantly increases when and increase. For example, fix to be . For , we can detect times -gram matches with edit distance being at most of that of exact matches, using only one subsampling and one embedding. With , we could detect times (distinct) -gram matches with edit distance being at most of that of exact matches; and with subsamplings, we could detect times (distinct) -gram matches.

In the rest of this section we will simply set , mainly for the sake of time/space saving. We found that by setting we can already obtain very good accuracy, though higher and values can potentially lead to better accuracy.

True and False Positives

We next study how different parameters influence the number of false positives. We call the pairs of -grams whose edit distances are at most true positives, and those with edit distances larger than false positives. Our results are presented in Figures 5 and 6.

Figure 5. Number of true/false positive -gram pairs vs smooth -gram size ; on E.coli-small
Figure 6. Number of true/false positive -gram pairs vs filter threshold ; on E.coli-small

We note that Figure 5 and Figure 2 come from the same set of experiments, but with different edit distance ranges and scales recorded. We observed that the number of false positives increases with , and decreases sharply with .

Figure 2 and Figure 5 also guide us on how to choose to balance the number of true positives and false positives. Under the condition that we get a good number of true positives (i.e., -gram pairs whose edit distance is at most ), and we do not have too many false positives, it seems that is a good choice and we set it as the default parameter.

When , and (i.e., no filter), we can detect times true positives while introduce times false positives (of the number of exact matches). By setting the filter threshold , we can detect times true positives while only introduce times false positives. This convinces us that removing frequent smooth -grams have a greater impact on reducing false positives than true positives, and is thus very useful for our purpose (i.e., to save the verification time at a minimal cost on the accuracy).

4.4. Finding Overlapping Sequencing Reads

In this section we present the experimental results on detecting overlapping sequencing reads with Algorithm 2.


We study the precision, recall and scores of all tested algorithms. The results are presented in Table 4 and Table 5.

78.0% 99.8% 0.87 75.1% 92.5% 0.83 74.1% 83.7% 0.79
92.4% 100.0% 0.96 15.4% 99.9% 0.26 68.9% 98.8% 0.81
94.1% 99.8% 0.97 89.5% 93.1% 0.91 71.4% 40.4% 0.52
86.1% 97.8% 0.92 82.8% 94.8% 0.88 80.6% 67.1% 0.73
95.1% 100.0% 0.97 90.9% 99.2% 0.95 84.7% 99.2% 0.91
Table 4. Accuracy for pairs with overlaps of lengths
66.3% 99.8% 0.80 65.8% 94.3% 0.77 77.1% 84.8% 0.81
77.2% 99.9% 0.78 12.0% 99.8% 0.21 50.0% 99.2% 0.66
79.8% 99.8% 0.89 72.4% 99.6% 0.84 58.2% 57.7% 0.58
79.7% 94.3% 0.86 71.8% 90.9% 0.80 61.5% 63.7% 0.63
89.9% 100.0% 0.95 85.1% 98.6% 0.91 84.7% 95.5% 0.90
Table 5. Accuracy for pairs with overlaps of lengths

Based on our results, SmoothQGram has the best recall values at all times, the best precision values in most cases, and the best scores (the harmonic average of precision and recall) at all times. Its scores are always greater than . While the lower bound of the score of the best competitor is only (Minimap on S.cerevisiae). The performance of SmoothQGram is also robust on data from different species and different overlap lengths .

We note again that we can further improve the accuracy of SmoothQGram by using multiple embeddings and subsamplings, at the cost of larger space and time.

Comparing the results for the three species, we found that E.coli is generally easier to deal with than S.cerevisiae and Human, which may be due to the fact that S.cerevisiae and Human genome contain more repeats. For different overlap lengths , we notice that all algorithms generally perform better on the greater length than the smaller one, which is reasonable because longer overlaps are generally easier to be detected.

9476 68.1 8025 75.2 7472 74.6
103 6.4 61 4.8 56 4.7
666 14.9 2836 15.5 2550 13.3
2072 19.7 6376 17.4 8821 17.6
6734 85.5 7400 63.5 6736 63.1
Table 6. Running time and memory usage

Time and Space

Finally, we study the running time and memory usage of tested algorithms. Our results are presented in Table 6. We observe that Minimap has the best time and memory performance among all algorithms. DALIGNER spends similar running time as SmoothQGram, but smaller amount of memory. SmoothQGram has the similar (slightly better) memory and time performance than MHAP. The reason why SmoothQGram uses relatively large time and memory is that SmoothQGram considers smooth -gram instead of -gram, which captures more matching information between sequences, and thus needs more time to verify candidate sequence pairs and uses more space. On the other hand, this is also why SmoothQGram significantly improved the accuracy for overlap detection.

We note that in our experimental studies, we mainly focused on accuracy which we think is the most important; our codes were not fully optimized for space and running time.

4.5. Summary

In this section we have performed an extensive experimental study on smooth -gram and its application to overlap detection. We observed that the smooth -gram based approach achieved much better accuracy than the conventional -gram based approaches for overlap detection, which due to the fact that smooth -gram is capable of capturing near-matches between subsequences. Employing smooth -gram may introduce a larger number of false positives, but the number can be greatly reduced by applying a frequency-based filter. The performance of our algorithm is stable and robust on genome sequences from various species that we have tested, and using different overlap lengths .


  • (1)
  • Altschul et al. (1990) Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. 1990. Basic local alignment search tool. Journal of molecular biology 215, 3 (1990), 403–410.
  • Belazzougui and Zhang (2016) Djamal Belazzougui and Qin Zhang. 2016. Edit Distance: Sketching, Streaming, and Document Exchange. In FOCS. 51–60.
  • Berlin et al. (2015) Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology 33, 6 (2015), 623–630.
  • Broder et al. (1998) Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. 1998. Min-wise independent permutations. In STOC. ACM, 327–336.
  • Brudno et al. (2003) Michael Brudno, Chuong B Do, Gregory M Cooper, Michael F Kim, Eugene Davydov, Eric D Green, Arend Sidow, Serafim Batzoglou, NISC Comparative Sequencing Program, and others. 2003. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome research 13, 4 (2003), 721–731.
  • Burkhardt and Kärkkäinen (2001) Stefan Burkhardt and Juha Kärkkäinen. 2001. Better filtering with gapped q-grams. In CPM. 73–85.
  • Burkhardt and Kärkkäinen (2002) Stefan Burkhardt and Juha Kärkkäinen. 2002. One-gapped q-gram filters for Levenshtein distance. In CPM. 225–234.
  • Burkhardt and Kärkkäinen (2003) Stefan Burkhardt and Juha Kärkkäinen. 2003. Better filtering with gapped q-grams. Fundamenta informaticae 56, 1-2 (2003), 51–70.
  • Chaisson and Tesler (2012) Mark Chaisson and Glenn Tesler. 2012. Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): Theory and Application. BMC Bioinformatics 13 (2012), 238.
  • Chakraborty et al. (2016) Diptarka Chakraborty, Elazar Goldenberg, and Michal Koucký. 2016. Streaming algorithms for embedding and computing edit distance in the low distance regime. In STOC. 712–725.
  • Cheng and Amin (1983) R. C. H. Cheng and N. A. K. Amin. 1983. Estimating Parameters in Continuous Univariate Distributions with a Shifted Origin. Journal of the Royal Statistical Society 45, 3 (1983), 394–403.
  • Compeau et al. (2011) Phillip EC Compeau, Pavel A Pevzner, and Glenn Tesler. 2011. How to apply de Bruijn graphs to genome assembly. Nature biotechnology 29, 11 (2011), 987–991.
  • Keich et al. (2004) Uri Keich, Ming Li, Bin Ma, and John Tromp. 2004. On spaced seeds for similarity search. Discrete Applied Mathematics 138, 3 (2004), 253–263.
  • Koren and Phillippy (2015) Sergey Koren and Adam M Phillippy. 2015. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Current opinion in microbiology 23 (2015), 110–120.
  • Koren et al. (2017) Sergey Koren, Brian P Walenz, Konstantin Berlin, Jason R Miller, Nicholas H Bergman, and Adam M Phillippy. 2017. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research 27, 5 (2017), 722–736.
  • Kurtz et al. (2004) Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven L Salzberg. 2004. Versatile and open software for comparing large genomes. Genome biology 5, 2 (2004), R12.
  • Li (2016) Heng Li. 2016. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 14 (2016), 2103–2110.
  • Ma et al. (2002) Bin Ma, John Tromp, and Ming Li. 2002. PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 3 (2002), 440–445.
  • Manning and Schütze (1999) Christopher D Manning and Hinrich Schütze. 1999. Foundations of statistical natural language processing. MIT press.
  • Mikheyev and Tin (2014) Alexander S Mikheyev and Mandy MY Tin. 2014. A first look at the Oxford Nanopore MinION sequencer. Molecular ecology resources 14, 6 (2014), 1097–1102.
  • Miller et al. (2010) Jason R Miller, Sergey Koren, and Granger Sutton. 2010. Assembly algorithms for next-generation sequencing data. Genomics 95, 6 (2010), 315–327.
  • Myers (1986) Eugene W. Myers. 1986. An O ( ND ) difference algorithm and its variations. Algorithmica 1, 1-4 (1986), 251–266.
  • Myers (2014) Gene Myers. 2014. Efficient Local Alignment Discovery amongst Noisy Long Reads. In WABI. 52–67.
  • Pevzner et al. (2001) Pavel A Pevzner, Haixu Tang, and Michael S Waterman. 2001. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences 98, 17 (2001), 9748–9753.
  • Qin et al. (2011) Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, and Xuemin Lin. 2011. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD. 1033–1044.
  • Roberts et al. (2013) Richard J Roberts, Mauricio O Carneiro, and Michael C Schatz. 2013. The advantages of SMRT sequencing. Genome biology 14, 6 (2013), 405.
  • Schwartz et al. (2003) Scott Schwartz, W James Kent, Arian Smit, Zheng Zhang, Robert Baertsch, Ross C Hardison, David Haussler, and Webb Miller. 2003. Human–mouse alignments with BLASTZ. Genome research 13, 1 (2003), 103–107.
  • Wang et al. (2012) Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD. 85–96.
  • Xiao et al. (2008) Chuan Xiao, Wei Wang, and Xuemin Lin. 2008. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1, 1 (2008), 933–944.
  • Zhang and Zhang (2017) Haoyu Zhang and Qin Zhang. 2017. EmbedJoin: Efficient Edit Similarity Joins via Embeddings. In SIGKDD. 585–594.