eBWTclust
Positional clustering of the extended Burrows-Wheeler transform
view repo
In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the eBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the eBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity.
READ FULL TEXT VIEW PDFPositional clustering of the extended Burrows-Wheeler transform
The cheap and fast Next Generation Sequencing (NGS) technologies are producing huge amounts of data that put a growing pressure on data structures designed to store raw sequencing information, as well as on efficient analysis algorithms: collections of billion of DNA fragments (reads) need to be efficiently indexed for downstream analysis. After a sequencing experiment, the most traditional analysis pipeline begins with an error-prone and lossy mapping of the collection of reads onto a reference genome. Among the most widespread tools to align reads on a reference genome we can mention BWA [LiDurbin09], Bowtie2 [LangmeadSalzberg12], SOAP2 [LiYLLYKW09]. These methods share the use of the FM-index [FerraginaM00], an indexing machinery based on the Burrows-Wheeler Transform (BWT) [bwt94]. Other approaches [KKbmc15, KKbioinf15] combine an index of the reference genome with the BWT of the reads collection in order to boost efficiency and accuracy. In some applications, however, aligning reads on a reference genome presents limitations mainly due to the difficulty of mapping highly repetitive regions, especially in the event of a low-quality reference genome (not to mention the cases in which the reference genome is not even available).
For this reason, indices of reads collections have also been suggested as a lossless dictionary of sequencing data, where sensitive analysis methods can be directly applied without mapping the reads to a reference genome (thus without needing one), nor assembling [kissnp, kissplice12, Leggett2014, Iqbal-CORTEX]. In [Dolle2017] the BWT, or more specifically its extension to string collections (named eBWT [MantaciRRS07, BauerCoxRosoneTCS2013]), is used to index reads from the 1000 Genomes Project [1000GP] in order to support -mer search queries. An eBWT-based compressed index of sets of reads has also been suggested as a basis for both RNA-Seq [CoxJakobiRosoneST2012] and metagenomic [Ander2013] analyses. There exist also suffix array based data structures devised for indexing reads collections: the array [PhilippeSLLCR11, ISBRA-2013] and the PgSA [KGD15]. Finally, there are several tools ([kissnp, kissplice12, takeabreak14, bubble-spire, Leggett-2013, Leggett2014, Iqbal-CORTEX]) that share the idea of using the de Bruijn graph (dBG) of the reads’ -mers. The advantages of dBG-based indices include allowing therein the characterization of several biologically-interesting features of the data as suitably shaped and sized bubbles^{2}^{2}2A bubble in a graph is a pair of disjoint paths sharing the same source node and target node. (e.g. SNPs, INDELs, Alternative Splicing events on RNA-Seq data, sequencing errors can all be modeled as bubbles in the dBG of sequencing data [kissnp, kissplice12, takeabreak14, bubble-spire, Leggett-2013]). The drawback of these dBG representation, as well as those of suffix array based indices such as the ones introduced in [PhilippeSLLCR11, ISBRA-2013], is the lossy aspect of getting down to -mers rather than representing the actual whole collection of reads. Also [KKbmc15, KKbioinf15] have this drawback as they index -mers. An eBWT-based indexing method for reads collections, instead, has the advantages to be easy to compress and, at the same time, lossless: (e)BWT indexes support querying -mers without the need to build different indexes for different values of .
Here we introduce the Positional Clustering framework: an eBWT-based index of reads collections where we give characterizations of (i) positions sharing a context as clusters in the eBWT, and (ii) the onset of these clusters by means of the LCP. This clustering allows to locate and investigate, in a lossless index of reads collections, genome positions possibly equivalent to bubbles in the dBG [takeabreak14, kissnp] independently from the k-mer length (a major drawback of dBG-based strategies). We thus gain the advantages of dBG-based indices while maintaining those of (e)BWT-based ones. Besides, the eBWT index also contains abundance data (useful to distinguish errors from variants, as well as distinct variant types) and does not need the demanding read-coherency check at post processing as no micro-assembly has been performed. With these promising advantages with respect to dBG-based strategies, the positional clustering framework allows likewise reference-free and assembly-free detection of SNPs [kissnp, discoSNP2017bioRxiv, Iqbal-CORTEX], small INDELs [SOAPindel, Iqbal-CORTEX], sequencing errors [SR2014, SWRU17, LFP17], alternative splicing events [kissplice12], rearrangements breakpoints [takeabreak14] on raw reads collections.
As a proof-of-concept, we test our theoretical framework with a prototype tool named eBWTclust designed to detect positional clusters and post-process them for assembly-free and reference-free SNPs detection directly on the eBWT of reads collection. Among several reference-free SNPs finding tools in the literature [kissnp, discoSNP2015, discoSNP2017bioRxiv, Iqbal-CORTEX], the state-of-the-art is represented by the well documented and maintained KisSNP and DiscoSnp suite [kissnp, discoSNP2015, kSNP2biblio], where DiscoSnp++ [discoSNP2017bioRxiv] is the latest and best performing tool. In order to validate the accuracy of positional clustering for finding SNPs, we compared DiscoSnp++ sensitivity and precision to those of our prototype eBWTclust. Preliminary results on human chromosomes show that, even when using relatively low coverages (22x), our tool is able to find of all SNPs (vs of DiscoSnp++) with an accuracy of (vs of DiscoSnp++).
Let be a finite ordered alphabet with , where denotes the standard lexicographic order. For , we denote its letters by , where is the length of , denoted by . We append to an end-marker symbol that satisfies . Note that, for , and . A substring of is denoted as , with being called a prefix and a suffix of .
We denote by a collection of strings (reads), and by the end-marker appended to (for ), with if . Let us denote by the sum of the lengths of all strings in .The generalized suffix array GSA of the collection (see [Shi:1996, CGRS_JDA_2016, Louza2017]) is an array containing pairs of integers , corresponding to the lexicographically sorted suffixes , where and . In particular, (for ) if the suffix is the -th smallest suffix of the strings in . Such a notion is a natural extension of the suffix array of a string (see [Manber:1990]). The Burrows-Wheeler Transform (BWT) [bwt94], a well known text transformation largely used for data compression and self-indexing compressed data structure, has also been extended to a collection of strings (see [MantaciRRS07]). Such an extension, known as extended Burrows-Wheeler Transform (eBWT) or multi-string BWT, is a reversible transformation that produces a string that is a permutation of the letters of all strings in : is obtained by concatenating the symbols cyclically preceding each suffix in the list of lexicographically sorted suffixes of all strings in . The eBWT applied to can also be defined in terms of the generalized suffix array of ([BauerCoxRosoneTCS2013]): if then ; when , then . For gsa,ebwt, and , the LF mapping (resp. FL) is a function that associates to each (e)BWT symbol the position preceding (resp. following) it on the text.
The longest common prefix (LCP) array of a collection of strings (see [CGRS_JDA_2016, Louza2017, Manzini2017]), denoted by , is an array storing the length of the longest common prefixes between two consecutive suffixes of in lexicographic order. For each , if and , is the length of the longest common prefix of suffixes starting at positions and of the strings and , respectively. We set .
For gsa, ebwt, and lcp, the set will be omitted when clear from the context.
Let be a read sequenced from a genome . We say that is a read-copy of iff is copied from during the sequencing process (and then possibly changed due to sequencing errors). Let us consider the eBWT of a set of reads of length^{3}^{3}3For simplicity of exposition, here we assume that all the reads have the same length . With little more effort, it can be shown that our results hold also when is the average read length. , sequenced from a genome . Assuming that is the coverage of , let us denote with the read-copies of . Should not there be any sequencing error, if we consider such that the genome fragment occurs only once in (that is, nowhere else than right after ) and if
is large enough so that with high probability each
is followed by at least nucleotides, then we observe that the read copies of would appear contiguously in the eBWT of the reads. We call this phenomenon eBWT positional clustering. Due to sequencing errors, and to the presence of repetitions with mutations in real genomes, a clean eBWT positional clustering is not realistic. However, in this section we show that, even in the event of sequencing errors, in the eBWT of a collection of reads sequenced from a genome , the read-copies of still tend to be clusteredtogether according to a suitable Poisson distribution.
We make the following assumptions: (i) the sequencing process is uniform, i.e. the positions from where each read is sequenced are uniform and independent random variables, (ii) the probability
that a base is subject to a sequencing error is a constant^{4}^{4}4We assume this to simplify the theoretical framework. In Section 4 we will see that our framework works even on real data with sequencing simulated using realistic errors distribution., (iii) a sequencing error changes a base to a different one uniformly (i.e. with probability for each of the three possible variants), and (iv) the number of reads is large (hence, in our theoretical analysis we can assume ).[eBWT cluster] The eBWT cluster of , with being a position on , is the substring such that is the range of read suffixes prefixed by , where is the smallest value for which appears only once in . If no such value of exists, we take and say that the cluster is ambiguous.
If no value guarantees that appears only once in , then the eBWT cluster of does not contain only read-copies of but also those of other characters . We call the multiplicity of the eBWT cluster. Note that for non-ambiguous clusters.
[eBWT positional clustering] Let be the read-copies of . An expected number of these read copies will appear in the eBWT cluster of , where is a Poisson random variable with mean
and where is defined as in Definition 3.
The probability that a read covers is . However, we are interested only in those reads such that, if is a read-copy of , then the suffix contains at least nucleotides, i.e. . In this way, the suffix will appear in the GSA range of suffixes prefixed by or, equivalently, will appear in . The probability that a random read from the set is uniformly sampled from such a position is . If the read contains a sequencing error inside , however, the suffix will not appear in the GSA range . The probability that this event does not happen is . Since we assume that these events are independent, the probability of their intersection is therefore
This is a Bernoullian event, and the number of read-copies of falling in is the sum of independent events of this kind. Then, follows a Poisson distribution with mean . ∎
Theorem 3 states that, if there exists a value such that appears only once in (i.e. if the cluster of is not ambiguous), then of the letters in are read-copies of . The remaining letters are noise introduced by suffixes that mistakenly end up inside due to sequencing errors. It is not hard to show that this noise is extremely small under the assumption that is a uniform text; we are aware that this assumption is not realistic, but we will experimentally show in Section 4 that also on real genomes we can safely assume without affecting results.
Note that the expected coverage of position is also a Poisson random variable, with mean equal to the average coverage. On expectation, the size of non-ambiguous ebwt clusters is thus times the average coverage. E.g., with , (the study [schirmer2016illumina] reports this maximum average substitution rate for Illumina HiSeq platforms), and the expected cluster size is the average coverage.
Finally, it is not hard to prove, following the proof of Theorem 3, that in the general case with multiplicity the expected cluster size follows a Poisson distribution with mean (because the read-copies of positions are clustered together). This observation will allow us to detect and discard ambiguous clusters using a significance test.
So far, we have demonstrated the eBWT positional clustering property but we don’t have a way for identifying the eBWT clusters. A naive strategy could be to fix a value of and define clusters to be ranges of -mers in the gsa. This solution, however, fails to separate read suffixes differing after positions (this is, indeed, a drawback of all -mer-based strategies). The aim of Theorem 3 is precisely to fill this gap, allowing us to move from theory to practice. Intuitively, we show that clusters lie between local minima in the LCP array. This strategy automatically detects the value satisfying Definition 3 in a data-driven approach.
Our result holds only if two conditions on the eBWT cluster under investigation are satisfied (see Proposition 3 for an analysis of the success probability). More specifically, we require that the eBWT cluster of position satisfies:
The cluster does not have noise, i.e. , and
Let . For any such that , if both and contain their leftmost sequencing errors in and , then .
Proposition 3 will show that with probability high enough Condition (2) is satisfied in practice. E.g., with , (see [schirmer2016illumina]), mean coverage (the coverage of one of our experiments in Section 4.4), and for any , Proposition 3 shows that the condition holds (in expectation) on at least of the non-ambiguous clusters.
Let be the eBWT cluster of a position meeting Conditions (1) and (2). Then, there exists a value such that is a non-decreasing sequence and is a non-increasing sequence.
Let us denote by the largest index in such that , where is the maximum value of LCP in (if occurs multiple times, take the rightmost occurrence). We claim the theorem holds for . Let us denote by and , , , the positive integers such that . This means, by using Condition (2), that the read contains the longest prefix without sequencing errors, . Consider any other suffix in the range and let be the leftmost mismatch letter in , with . Note that if does not contain sequencing errors, then , otherwise has been mutated with an error. We suppose that the mutation at position generated a letter lexicographically smaller than that of the genome (the other case is symmetric). By Condition (2), and no other suffix in the range satisfies . Then, falls right after a suffix such that either properly prefixes or the leftmost mismatch occurs at position . Similarly, falls right before a suffix such that and whose leftmost mismatch position satisfies . This shows that, before suffix , other suffixes are ordered by increasing position of their leftmost mismatch letter (since ) and, then, by the lexicographic order among mismatch letters, which in particular implies that before suffix the lcp values are non-decreasing. Symmetrically, with a similar reasoning one can easily prove that after suffix the lcp values are non-increasing. ∎
Given a eBWT cluster with multiplicity , Condition (2) holds with probability at least
for any integer and where: is the value defined in Definition 3, is the mean of the Poisson distribution of Theorem 3,
is the cumulative distribution function of a Poisson random variable with mean
evaluated in , and , , .First note that, by Theorem 3, the number of suffixes in the cluster sharing their first bases is at most with probability . We first analyze the probability that the condition holds for a fixed offset (i.e. looking at the -th base of all suffixes in the range sharing their first characters), and then compute the joint probability for all .
Note that the condition cannot fail if the number of bases that have been subject to errors is either or . The condition does not fail also if we have or distinct errors, while it always fails for (since at least two errors will produce the same base). On a generic number of suffixes (those sharing their first bases), one can verify that the probability that the condition does not fail for a fixed number of errors is , where , , . Since these events are disjoint, we can sum these probabilities to get a lower bound to the probability that condition (2) holds on a specific offset on suffixes. Since this probability decreases as increases, we get a lower bound by taking the maximum , and obtain probability .
To get a lower bound to the probability that the condition holds simultaneously on all , we take the product of these probabilities (note that errors at different offsets are independent events). This reasoning holds only if there are at most suffixes sharing their first bases — which happens with probability — so we must include this multiplicative correction factor to get our final lower bound. ∎
Note: to get a lower bound that is independent from in Proposition 3, use the that maximizes the expression (since the lower bound holds for any ).
According to Theorem 3, clusters are delimited by local minima in the LCP array of the read set. This gives us a strategy for finding clusters that is independent from . Importantly, the proof of Theorem 3 also gives us the suffix in the range (the -th suffix) whose longest prefix without sequencing errors is maximized. This will be useful in the next section to efficiently compute a consensus of the reads in the cluster.
When the reads dataset contains variations (e.g. two allele of the same individual, or two or more distinct individuals, or different isoforms of the same gene in RNA-Seq data, or different reads covering the same genome fragment in a sequencing process, etc.), the eBWT positional clustering described in the previous section can be used to detect, directly from the raw reads (hence, without assembly and without the need of a reference genome), positions exhibiting possibly different values, but followed by the same context: they will be in a cluster delimited by LCP minima and containing possibly different letters (corresponding to the read copies of the variants of in the read set). This general idea can be used in several applications: error correction, assembly (the strategy finds overlaps between reads, so it can be used for assembly purposes), haplotype discovery (if fed with reads from just one diploid sample, this strategy can find heterozygous sites), and so on.
In this section, with the purpose of experimentally validating the theoretical framework of Section 3, we describe a new alignment-free and reference-free method that, with a simple scan of the eBWT and LCP array, detects SNPs in read collections.
Since (averagely) half of the reads comes from the forward (F) strand, and half from the reverse-complement (RC) strand, we denote with the term right (resp. left) breakpoint those variants found in a cluster formed by reads coming from the F (resp. RC) strand, and therefore sharing the right (resp. left) context adjacent to the variant. A non-isolated SNP [discoSNP2015] is a variant at position such that the closest variant is within bases from , for some fixed (we use in our validation procedure, see below). The SNP is isolated otherwise. Note that, while isolated SNPs are found twice with our method (one as a right breakpoint and one as a left breakpoint), this is not true for non-isolated SNPs: variants at the sides of a group of non-isolated SNPs are found as either left or right breakpoint, while SNPs inside the group will be found with positional clustering plus a partial local assembly of the reads in the cluster. In the next two subsections we give all the details of our strategy.
Since we do not aim at finding matches between corresponding pairs of clusters on the forward and reverse strands, we augment the input adding the reverse-complement of the reads: for a reads set , we add as well. Hence, given two reads sets and , in the pre-processing phase we compute , , and , for [Louza2017, CGRS_JDA_2016, LouzaGT17b]. We also compute because we will need it (see Subsection 4.2) to extract left and right contexts of the SNP. Though this could be achieved by performing (in external memory) multiple steps of LF- and FL-mappings on the eBWT, this would significantly slow-down our tool. Note that our approach can also be generalized to more than two reads collections.
Our SNPs calling approach takes as input , , and and outputs SNPs in KisSNP2 format [kSNP2biblio]: a fasta file containing a pair of sequences per SNP (one per sample, containing the SNP and its context). The SNP calling is divided in two main steps.
Build clusters. First, we scan and , find clusters using Theorem 3, and store them to file as a sequence of ranges on the eBWT. In addition, while computing clusters we also apply a threshold of minimum LCP (by default, 16): we cut clusters’ tails containing LCP values smaller than the threshold. This additional filtering drastically reduces the number of clusters saved to file (and hence memory usage and running time), since the original strategy would otherwise output many short clusters containing small LCP values corresponding to noise.
Call SNPs. The second step takes as input the clusters file, , , , and , and processes clusters from first to last as follows:
We test the cluster’s length using the Poisson distribution predicted by Theorem 3; if the cluster’s length falls in one of the two tails at the sides of the distribution (by default, the two tails summing up to of the distribution), then the cluster is discarded; Moreover, due to -mers that are not present in the genome but appear in the reads because of sequencing errors (which introduce noise around cluster length equal to 1), we also fix a minimum value of length for the clusters (by default, 4 letters per sample).
In the remaining clusters, we find the most frequent nucleotides and of samples 1 and 2, respectively, and check whether ; if so, then we have a candidate SNP: for each sample, we use the GSA to retrieve the coordinate of the read containing the longest right-context without errors (see explanation after Proposition 3); moreover, we retrieve, and temporarily store in a buffer, the coordinates of the remaining reads in the cluster.
After processing all events, we scan the fasta file storing to retrieve the reads of interest (those whose coordinates are in the buffer); for each one of them, we compute a partial assembly of the read prefixes preceding the SNP, for each of the two samples. This allows us to compute a left-context for each SNP (by default, of length 20), and represents a further validation step: if the assembly cannot be built because a consensus cannot be found, then the cluster is discarded. Note that these left-contexts preceding SNPs (which are actually right-contexts if the cluster is formed by reads from the RC strand) allow us to capture non-isolated SNPs.
Complexity In the clustering step, we process the eBWT and LCP and on-the-fly output clusters to disk. The SNP-calling step performs one scan of the eBWT, GSA, and clusters file to detect interesting clusters, plus one additional scan of the read set to retrieve contexts surrounding SNPs. Both these phases take linear time in the size of the input and do not use disk space in addition to the input and output. Due to the fact that we store in a buffer the coordinates of reads inside interesting clusters, this step uses an amount of RAM proportional to the number of SNPs times the average cluster size times the read length (e.g. a few hundred MB in our case study of Section 4.4). Notice that our method is very easy to parallelize, as the analysis of each cluster is independent from the others.
Here we describe the validation tool we designed to measure the sensitivity and precision of any tool outputting SNPs in KisSNP2 format. Note that we output SNPs as pairs of reads containing the actual SNPs plus their contexts (one sequence per sample). This can be formalized as follows: the output is a series of pairs of triples (we call them calls) where , , , are the left/right contexts of the SNP in the two samples, and letters , are the actual variant. Given a .vcf file (Variant Call Format) containing the ground truth, the most precise way to validate this kind of output is to check that the triples actually match contexts surrounding true SNPs on the reference genome (used here just for accuracy validation purposes). That is, for each pair in the output calls:
If there is a SNP in the .vcf that is surrounded in the first sample by contexts (or their RC), then is a true positive (TP).
Any pair that is not matched with any SNP in the ground truth (as described above) is a false positive (FP).
Any SNP in the ground truth that is not matched with any call is a false negative (FN).
We implemented the above validation strategy with a (quite standard) reduction of the problem to the 2D range reporting problem: we insert in a two-dimensional grid two points per SNP (from the .vcf) using as coordinates the ranks of its right and (reversed) left contexts among the sorted right and (reversed) left contexts of all SNPs (contexts from the first sample) on the F and RC strands. Given a pair , we find the two-dimensional range corresponding to all SNPs in the ground truth whose right and (reversed) left contexts are prefixed by and (the reversed) , respectively. If there is at least one point in the range matching the variation , then the call is a TP^{5}^{5}5Since other tools such as DiscoSnp++ do not preserve the order of samples in the output, we actually check also the variant and also search the range corresponding to and (case 1 above; note: to be a TP, a SNP can be found either on the F or on the RC strand, or both). Otherwise, it is a FP (case 2 above). Finally, pairs of points (same SNP on the F/RC strands) that have not been found by any call are marked as FN (case 3 above). We repeat the procedure for any other SNP found between the two strings and to find non-isolated SNPs.
In order to valuate our method, we compare eBWTclust with DiscoSnp++, that is a revisiting of the DiscoSnp algorithm: while DiscoSnp detects both heterozygous and homozygous isolated SNPs from any number of read datasets without a reference genome, DiscoSnp++ is designed for detecting and ranking all kinds of SNPs and small indels from raw read set(s). As shown in [discoSNP2017bioRxiv], DiscoSnp++ performs better than state-of-the-art methods in terms of both computational resources and quality of the results.
DiscoSnp++ is composed of several independent tools. As a preprocessing step, the dBG of the input datasets is built, by also removing erroneous -mers. Then, DiscoSnp++ detects bubbles generated by the presence of SNPs (isolated or not) and indels and it outputs a fasta file containing the variant sequences (KisSnp2 module). A final step (kissreads2) maps back the reads from all input read sets on the variant sequences, mainly in order to determine the read coverage per allele and per read set of each variant. This module also computes a rank per variant, indicating whether it exhibits discriminant allele frequencies in the datasets. The last module generates a .vcf of the predicted variants. If no reference genome is provided this step is a change of format from fasta to .vcf (VCFcreator module).
We propose two experiments simulating two human chromosomes haploid read sets obtained mutating (with real .vcf files) real reference chromosomes^{6}^{6}6ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz. The final goal of the experiments is to reconstruct the variations contained in the original (ground truth) .vcf files. We generated the mutated chromosomes using the 1000 genome project (phase ) .vcf files^{7}^{7}7ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ related to chromosomes and , suitably filtered to keep only SNPs of individuals HG00100 (ch.) and HG00096 (ch.). From these files, we simulated Illumina sequencing with SimSeq [SimSeq2011], both for reference and mutated chromosomes: individual HG00096 (ch.) at a x getting of -bp reads, and individual HG00100 (ch.) a x getting of -bp reads.
Our framework has been implemented in C++ and is available at https://github.com/nicolaprezza/eBWTclust. All tests were done on a DELL PowerEdge R630 machine, used in non exclusive mode. Our platform is a -core machine with Intel(R) Xeon(R) CPU E5-2620 v3 at GHz, with GB of shared memory. The system is Ubuntu 14.04.2 LTS. Note that, unlike DiscoSnp++, our tool is currently able to use one core only.
We experimentally observed that the pre-processing step is more computationally expensive than the actual SNP calling step. The problem of computing the eBWT is being intensively studied, and improving its efficiency is out of the aim of this paper. However, a recent work [Dolle2017] suggests that direct storing of read data with a compressed eBWT leads to considerable space savings, and could therefore become the standard in the future. Our strategy can be easily adapted to directly take as input these compressed formats. For both tools, we thus omit time/space requirements of the preprocessing steps: computing the data structures described in Section 4.1 for eBWTclust, and constructing the dBG and removing erroneous -mers for DiscoSnp++. Building the dBG requires a few minutes and, in order to keep the RAM usage very low, no other information other than -mer presence is stored in the dBG used by DiscoSnp++. On the other hand, the construction of eBWT, LCP and GSA can take a few hours (around 90 minutes by using Egsa [Louza2017]). So, overall DiscoSnp++ is faster than eBWTclust when considering both pre-processing and post-processing.
We run DiscoSnp++ with default parameters (includes -mers size ) except for (it searches up to SNPs per bubble) and ( forbids variants for which any of the two paths is branching; imposes no limitation on branching; is inbetween).
eBWTclust takes as input few main parameters, among which the most important are the lengths of right and left contexts surrounding SNPs in the output (-L and -R), the minimum cluster size (-m), and (-v) the maximum number of non-isolated SNPs to seek in the left contexts (like parameter of DiscoSnp++). We decided to output 20 nucleotides preceding (and including) the SNP (-L 20), 30 following the SNP (-L 30), a minimum cluster size of -m 6 and -m 4, and -v 3.
In Table 1, we show the number of TP, FP and FN as well as sensitivity (SEN), precision (PREC), and the number of non-isolated SNPs found by the tools. The outcome is that eBWTclust is always more precise and sensitive than DiscoSnp++. Moreover, while in our case precision is stable and always quite high (% with -m 6 for HG00096 and -m 4 for HG00100, and % with -m 6 HG00100), for DiscoSnp++ precision is much lower in general, and even drops with , especially with lower coverage, when inversely sensitivity grows. Sensitivity of DiscoSnp++ gets close to that of eBWTclust only in case , when its precision drops and memory and time get worse than ours. Note that precision and sensitivity of DiscoSnp++ are consistent with those reported in [discoSNP2017bioRxiv].
Individual HG00096 vs reference (chromosome 22, bp), coverage per sample | |||||||||
Tool | Param. | Wall | RAM | TP | FP | FN | SEN | PREC | Non-isolated |
Clock | in MB | SNP | |||||||
DiscoSnp++ | b=0 | 5:07 | 101 | 32773 | 3719 | 13274 | 71.17% | 89.81% | 4707/8658 |
DiscoSnp++ | b=1 | 16:39 | 124 | 37155 | 10599 | 8892 | 80.69% | 77.80% | 5770/8658 |
DiscoSnp++ | b=2 | 20:42 | 551 | 40177 | 58227 | 5870 | 87.25% | 40.83% | 6325/8658 |
eBWTclust | m=4 | 19:43 | 415 | 42973 | 2639 | 3074 | 93.32% | 94.21% | 7268/8658 |
eBWTclust | m=6 | 24:58 | 411 | 41972 | 630 | 4075 | 91.15% | 98.52% | 6940/8658 |
Individual HG00100 vs reference (chromosome 16, bp), coverage per sample | |||||||||
Tool | Param. | Wall | RAM | TP | FP | FN | SEN | PREC | Non-isolated |
Clock | in MB | SNP | |||||||
DiscoSnp++ | b=0 | 6:20 | 200 | 48119 | 10226 | 18001 | 72.78% | 82.47% | 6625/11055 |
DiscoSnp++ | b=1 | 31:57 | 208 | 53456 | 24696 | 12664 | 80.85% | 68.40% | 7637/11055 |
DiscoSnp++ | b=2 | 51:45 | 1256 | 57767 | 124429 | 8353 | 87.37% | 31.71% | 8307/11055 |
eBWTclust | m=4 | 41:12 | 423 | 61264 | 943 | 4856 | 92.66% | 98.48% | 9314/11055 |
eBWTclust | m=6 | 43:51 | 419 | 58085 | 391 | 8035 | 87.85% | 99.33% | 8637/11055 |
We introduced a positional clustering framework for the characterization of breakpoints in the eBWT, paving the way to several possible applications in assembly-free and reference-free analysis of NGS data. The experiments proved the feasibility and potential of our approach. Further work will focus on improving the prediction in highly repeated genomic regions and using our framework to predict SNPs, predict INDELs, haplotyping, correcting sequencing errors, detecting Alternative Splicing events in RNA-Seq data, and sequence assembly.