The colored longest common prefix array computed via sequential scans

07/19/2018
by   F. Garofalo, et al.
UniPa
University of Pisa
0

Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the effectiveness of our approach.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/13/2020

Update Query Time Trade-off for dynamic Suffix Arrays

The Suffix Array SA(S) of a string S[1 ... n] is an array containing all...
07/03/2022

Suffix sorting via matching statistics

We introduce a new algorithm for constructing the generalized suffix arr...
05/17/2018

External memory BWT and LCP computation for sequence collections with applications

We propose an external memory algorithm for the computation of the BWT a...
06/21/2020

PFP Data Structures

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a p...
11/15/2018

Vectorized Character Counting for Faster Pattern Matching

Many modern sequence alignment tools implement fast string matching usin...
11/23/2010

Evolutionary distances in the twilight zone -- a rational kernel approach

Phylogenetic tree reconstruction is traditionally based on multiple sequ...
05/10/2019

Alignment- and reference-free phylogenomics with colored de-Bruijn graphs

We present a new whole-genome based approach to infer large-scale phylog...

1 Introduction

The rapid increase in the availability of large sets of biological sequences observed in the last two decades, particularly triggered by the human sequencing project, posed several challenges in the analysis of such data. So far, traditional methods based on sequence alignment worked well for small and closely related sequences, but scaling these approaches up to multiple divergent sequences, especially of large genomes and proteomes, is a difficult task. To keep pace with this, several algorithms that go beyond the concept of sequence alignment have been developed, called alignment-free [Zielezinski2017]. Alignment-free approaches have been explored in several large-scale biological applications ranging, for instance, from DNA sequence comparison [Chang1994, GogMS2010, VerzottoIrredundant, CoxJakobiRosoneST2012, MantaciRRS08] to whole-genome phylogeny construction [ACS2, VerzottoUA, Chor2012, kmacs2014, ALFRED2016] and the classification of protein sequences [VerzottoIrredundant]. Most alignment-free approaches above mentioned require, each with its own specific approach and with the use of appropriate data structures, the computation of statistics of the sequences of the analyzed collections. However, it is interesting to note that the increasing number of completely sequenced genomes has caused the computation of many statistics to be impracticable in internal memory, determining the need for lightweight strategies for the comparative analysis of very large collections of long sequences.

In this paper, we propose a new conceptual data structure, the colored longest common prefix array (cLCP), that implicitly stores all the information necessary for computation of statistics on distinguishing, repeating or matching strings within a collection or different collections. Loosely speaking, given a collection , in which each string (or subset of strings) can be identified by a specific color, we can generally define cLCP as an integer array representing the longest common prefix between any specific suffix of a string and the nearest suffix of a specific string in the sorted list of suffixes of . Here, we assume that is partitioned in two subsets and consider the suffixes of strings belonging to different subsets, but we remark that one can consider any situation and note also that the definition can be easily adapted to more than two sets. We also show that cLCP can be computed via sequential scans and therefore acquires the characteristics of an appropriate structure for analyzing large collections of strings stored in external memory.

Such a data structure can be used in several applicative contexts. In this paper we explore the multi-string Average Common Substring (ACS) [ACS2] problem. More specifically, the ACS measure is a simple and effective alignment-free method for pairwise string comparison [Chor2012, ALFRED2016], based on the concept of matching statistics (MS[Chang1994, Gusfield97, GogMS2010]. Given two strings and , it can be defined as a couple of arrays and that store, at each position , the length of the longest substring that starts at position of the string given as first parameter that is also a substring of the string given as second parameter.

ACS approach has been employed in several biological applications  [ACS2, VerzottoUnic, kmacs2014, VerzottoUA, VerzottoBeyond]. Generalization of measures based on longest matches with mismatches have been proposed in [ApostolicoGP14], also with distributed approaches [PetrilloGP17]. Similarly to [SunReinert2013], we define the multi-string ACS problem as the pairwise comparison of a single string, say of length , against a set of strings, say with , by considering the strings in all together, in order to obtain ACS induced distances at once. A major bottleneck in the computation (and application) of ACS and MS initially consisted in the construction of a suffix tree. More recent approaches use efficient indexing structures [BelazzouguiCunial_SPIRE2014], CDAWG [BelazzouguiCunial_SPIRE2017], backward search [GogMS2010] or enhanced suffix arrays [kmacs2014]. However, to our knowledge, the above mentioned approaches would require a great effort, especially in terms of RAM space, if applied to very large collections of long strings.

In this paper we use cLCP to efficiently solve the above mentioned multi-string ACS problem. Preliminary experimental results show that our algorithm is a competitive tool for the lightweight simultaneous computation of a pairwise distance of a string versus each string in a collection, allowing us to suppose that this data structure and its computational strategy can be used for more general versions of the multi-string ACS problem.

2 Preliminaries

Let be a finite ordered alphabet with , where denotes the standard lexicographic order. We consider finite strings such as , where denote its characters and its length. A substring of a string is written as , with a substring being called a prefix, while a substring is referred to as a suffix. A range is delimited by a square bracket if the correspondent endpoint is included, whereas a parenthesis means that the endpoint is excluded.

The BWT [bwt94] is a well known reversible string transformation widely used in data compression. The BWT can be extended to a collection of strings . Such an extension, known as EBWT or multi-string BWT, is a reversible transformation that produces a string (denoted by ) that is a permutation of the characters of all strings in [MantaciRRS07]. Lightweight implementations of EBWT have been proposed [BauerCoxRosoneTCS2013, Louza2017, EgidiLouzaManziniTelles2018_arxiv]. We append to each string of length a distinct end-marker symbol (for implementation purposes, we could simply use a unique end-marker for all strings in ). The output string is the concatenation of the symbols (circularly) preceding each suffix of the strings in , sorted according to the lexicographic order. More in detail, the length of is denoted by and , with , if circularly precedes the -th suffix (for some and ), according to the lexicographic sorting of the suffixes of all strings in . In this case we say the suffix is associated with the position in . We can associate to each string a color in . The output string , enhanced with the array of length where , with and , if is a symbol of the string , is called colored EBWT.

The longest common prefix (LCP) array of the collection  [PuglisiTurpin2008, CGRS_JDA_2016, Louza2017] is the array of length , such that , with , is the length of the longest common prefix between the suffixes associated to the positions and in and set by default. We denote by the length of the LCP between the suffixes associated with positions and in , i.e. . An interval with , is an k-lcp interval if , , . The set will be later omitted if it is clear from the context.

3 Colored Longest Common Prefix Array

In this section we present a novel data structure, the colored longest common prefix array (cLCP). Loosely speaking, the cLCP array represents the longest common prefix between a suffix that belongs to a string of the collection and the nearest suffix belonging to another string of , in the list of sorted suffixes of . In this paper, for simplicity of description, we assume that is partitioned in two subsets and consider the suffixes of strings belonging to different subsets, but we remark that one can consider any situation and note also that the definition can be easily adapted for more than two sets.

For and , let and (if such an exists, and null otherwise).

In order to give the notion of the cLCP array, we first define the Upper colored LCP array (UcLCP) and the Lower colored LCP array (LcLCP), as follows.

Definition 1

The upper (resp. lower) colored longest common prefix array (UcLCP) (resp. LcLCP) is a ()-integer array where, for each with and , (resp. ). Both and are set equal to .

Definition 2

The colored longest common prefix array (cLCP) is a ()-integer array where, for each with and , .

Table 1: Let =, =, =, and =. Then, =, =, =, =, and thus = and =. In bold are all positions associated with suffixes of (i.e. the limits of the -intervals).

For simplicity UcLCP, LcLCP, and cLCP are also defined when . For all such that , coincides with the correspondent value in the usual . As mentioned before, note that the notion of UcLCP, LcLCP, and cLCP can be also given for a pair of disjoint collections of strings and by obtaining an array defined for the pairs with and such that and belong to a different collection.

A given string having color implicitly induces a partition of in open intervals delimited by consecutive suffixes having color (or the position and of lcp), called -intervals. Let us consider a position contained within a -interval such that and . Then, we can use a similar procedure as the one employed in [kmacs2014], such that

(1)
(2)

Additionally, we notice that there exists a relationship between the values of UcLCP related to the suffixes of and the values of LcLCP related to the suffixes of . Indeed, if is a position where , then

(3)

Similarly, there exists a relationship between the values of UcLCP related to suffixes of and the values of LcLCP related to suffixes of . In particular,

(4)

Table 1 shows the values of UcLCP, LcLCP and cLCP of the running example, in which the collection is partitioned into two subsets and .

4 Lightweight computation of the cLCP

In this section we describe a lightweight strategy to compute the colored longest common prefix array . For sake of simplicity we consider the case in which the collection is partitioned into two subsets and , and consists of a single string of length . The general case can be treated analogously.

Definition 3

A colored -lcp interval is a -lcp interval such that, among all the suffixes associated to the range , at least one suffix belongs to and at least one suffix belongs to .

Definition 4

Let denote an integer array, such that if a colored -lcp interval starts at position and for every colored -lcp interval starting at position then .

Table 1 highlights the conceptual blocks of suffixes that are associated to the positions of D such that .

Note that the array D can be easily computed in time by linearly scanning the arrays and and by using a stack that simulates the computation of the colored -lcp intervals. During the sequential scan, each element can be inserted or deleted from the stack once, at most. Furthermore, considering that each suffix could take part in no more than nested blocks, the stack requires space, at most. We note that this upper bound in space is unlikely to be reached in practice, especially since the stack is emptied when two consecutive values of id corresponding to strings of different subsets are found. It is important to specify that the above mentioned stack could be stored in external memory.

In the following we describe a sequential strategy to construct cLCP of the collection by using , , and .

Without loss of generality, let us consider a generic string and . Assume that , with , is associated with a suffix of , i.e. , and let and . Moreover, for simplicity, let (resp. ) denote UcLCP of versus (resp. LcLCP of versus ), i.e. the values (resp. ) for all such ; and (resp. ) denote LcLCP (resp. UcLCP) of versus , i.e. the values (resp. ) for all such that .

computation —

This is the easiest case, since Equation 1 allows us to directly compute sequentially and linearly in the total size of lcp. This enables us to scan lcp forward only once for all suffixes of all strings in , by keeping track of the minimum value found since the beginning of each conceptual -interval (see column in Table 1). If we consider the -interval , by employing a variable we can iteratively compute the minimum value among consecutive elements of lcp and determine, for every , the LCP between the suffix associated with position and the suffix associated with position : .

Example 1 (Running example)

If the -interval is and , then

computation —

Since is strictly related to by Equation 3, we would like to compute it sequentially and linearly as well. Suppose that we have just computed and represents the first suffix of encountered since the beginning in . Then, by Equation 3, . To keep track of the first instance of every in the interval, we could resort to a bit-array of elements for .

Nevertheless, this is not sufficient to complete the construction of , because there might be no suffixes of a particular string within , but other suffixes of might exist at positions . To tackle this issue, once we have thoroughly read lcp and filled using the above procedure, we can propagate the computed values of backward from lower to higher lexicographically ranked suffixes of , in order to complete . For example, to propagate the information from to , we must compute:

(5)

Thus, iteratively, we can propagate the information backward from the lowest ranked suffixes of to the top of .

Example 2 (Running example)

After the first scan of lcp in the example of Table 1, (i.e. suffix of in row 12 versus string ) would be 0, whereas by propagating the information back from the suffix of in row 14, we obtain: .

computation —

The most interesting part is computing in such a way as to avoid the backward scan of id and lcp suggested by Equation 2 and, concomitantly, reduce the memory footprint required for keeping both and to a somehow negligible one. Thus, we show how to sequentially determine, for every , the LCP between the suffix associated with position and the suffix associated with position .

Let us consider the array D introduced in Definition 4. Intuitively, D provides an interlacing forward information that could be exploited to compute sequentially, as soon as we reach position . Firstly, observe that, for any with and any , and .

Remark 1

For any , with and , .

Lemma 1

For any , if then there exists an , with , such that if and only if .

Moreover, it follows that would be (the only) maximum in the range and its value is . Hence, we can determine .

Theorem 4.1

For any such that , if then , otherwise .

By using Theorem 4.1 we need to keep track of the maximum value (decreased by ) among consecutive D values since the beginning of each conceptual -interval (see column in Table 1). An immediate example of Theorem 4.1 is given in column of Table 1, which provides the final values of , where is computed using Equation 1 and through (or, shortly, ).

Example 3 (Running example)

Let (with and ) such that ; then, . Conversely, by considering (with and , as before), ; therefore, . Furthermore, we consider the third case of Theorem 4.1 such that, for (where and ), and thus .

computation —

Similarly to , we can compute by exploiting Equation 4 and the previously computed within each -interval (compare columns and against column in Table 1). To complete the construction of , we need then to propagate forward the information from higher to lower lexicographically ranked suffixes of . For example, to propagate the information from to , we must compute .

To reduce the memory footprint, we can use a single matrix (initialized with all s) to keep track of the maximum values between the corresponding positions of and , which could be then refined at most twice by propagation. Observe that , alone, can be directly computed sequentially, eventually reducing the additional space to a negligible one of size , as seen before for and .

Example 4 (Running example)

After the first scan of lcp, (i.e. suffix of in row 22 versus string ) would be , whereas by propagating the information forward from the suffix of at row 19, we obtain: .

Computational complexity —

The first phase of the algorithm consists of the computation of the D in time and . Notice that and can be determined sequentially (forward) requiring nothing but to update variables and keeping track respectively of the minimum among consecutive lcp values and of the maximum among consecutive D values since the last suffix encountered. Moreover one can observe that also in and computation both and are actually accessed either sequentially forward or sequentially backward, up to one position before or after the currently processed one, allowing them to reside in external memory too. This means that we need additional space in RAM. In order to optimally use the available size of RAM, assuming is the number of -elements rows of

that we could accommodate in RAM, at any moment we could just keep in memory and process only a single block of

and of size proportional to . Such a block, together with the bit-array of size required in first part of computation, yield overall required space (with a configurable parameter). Further, since values could be refined at most twice by propagation, a global cost of time is deduced. Note that, instead, a straightforward approach that just uses Equations 1 and 2 would have required to process in RAM at least three data structures, each of size , using time (without propagation). In order to evaluate the operations, we denote by the disk block size and we assume that both the RAM size and are measured in units of ()-bit words. The overall complexity of the algorithm, included the operations, is synthesized in the following theorem. The number of disk operations is due the fact that the files storing the values of , , , and are used.

Theorem 4.2

Let a collection of strings. Given , and , can be computed by sequential scans in time and additional space, where . The total number of disk operations is , where .

5 Multi-string ACS Computation by

The cLCP is a data structure that implicitly stores information useful to compute distinguishing and repeating strings in different collections. Its lightweight computation described in previous section enables the use of cLCP in several contexts in which large collections of long strings are considered.

Here, we describe its use in the case of computing the matching statistics MS and the average common substring ACS measure. The Average common substring (ACS) induced distance is based on the concept of matching statistics (MS[Chang1994, Gusfield97] and is typically computed by proceeding in two steps. Let us first consider two strings , of length , and , of length , over of size . In the first step, ACS asymmetrically computes the longest match lengths of versus , , where is the base string. is an integer array such that, for any position of , is the length of the longest prefix of the suffix of starting at position that is also a substring of (see Table 2). In the second step, ACS takes the average of these scores ; normalizes it by the lengths of , , and ; and finally makes the measure symmetrical by defining , in order to achieve an induced distance. We observe that ACS is not a metric, because the triangular inequality might not hold in general. Nevertheless, if we assume and

be generated by finite-state Markovian probability distributions, it follows that ACS is a natural distance measure between these distributions 

[ACS2].

Table 2: Matching statistics and for = and = on . It follows that , and, thus, .

For simplicity we assume that we have a set consisting of only one string , of length , and a set of strings , of length , and we want to compute the pairwise ACS induced distances between (or, more explicitly, ) and every other string in simultaneously, as in the multi-string ACS problem. Our approach could be also applied to a more general case.

There is a clear correspondence between cLCP of versus all strings in , and MS of ACS. More precisely:

Proposition 1

Given any two strings , is a permutation of all values in related to the suffixes of (the base string) versus : , where such that , and is the starting position in of the suffix associated with .

Indeed, for each suffix of every string , associated with , would account for the longest prefix that is a substring of , and this must correspond to one of the closest suffixes belonging to immediately above () or below () row in the sorted suffixes list.

We can thus exploit the above proposition to compute MS using cLCP, by using the strategy described in previous section. In fact, computing MS by straightly using the Equations 1 and 2 would require to explicitly keep track of cLCP for each -interval, which could have width in the worst case. In this section we show that this additional space can be controlled and reduced by using our lightweight computation of cLCP.

Using the construction described in Section 4 we can determine and sequentially (forward) and these values are definitive (they are not subject to refinement by propagation). Therefore we can reduce multi-string ACS memory footprint by summing up all the maximum values between the respective positions in and for every specific string , and for every position , and storing them into an array of size as they are computed during forward phase, without explicitly maintaining values in internal or external memory. On the other hand, since and require propagation to be completed, we need to maintain (a -sized portion of) matrix and similarly cumulate values for every position and for every string into an array of size as these values became definitive during backward phase. Accordingly, multi-string ACS computation does not add to cLCP construction more than space and time. Note that in a typical application, can be assumed and negligible compared to the internal memory available. Here, we show a simplified version of our strategy described in Section 5. For simplicity, we index the files as array, but the reader can note that we only access to them sequentially. We need to keep in memory the length of the strings for the ACS scores.

6 Preliminary Experiments

As a proof-of-concept, we test our new data structure (cLCP) with a prototype tool named cLCP-mACS [cLcpMultiACStool] designed to solve the multi-string ACS problem. All tests were done on a DELL PowerEdge R630 machine, -core machine with Intel(R) Xeon(R) CPU E5-2620 v3 at GHz, with GB of shared memory, used in not exclusive mode. The system is Ubuntu 14.04.2 LTS.

We show that our preliminary experiments confirm the effectiveness of our approach for the multi-string ACS problem, that consists in the pairwise comparison of a single string against a set of strings simultaneously, in order to obtain ACS induced distances. This is not a limitation, because the computation of pairwise distances between strings of a collection can be treated analogously, in the sense that one could execute our tool more times, without computing the data structures of the preprocessing step.

We experimentally observed that the preprocessing step is more computationally expensive than the step for computing the ACS distances via cLCP. The problem of computing the are being intensively studied, and improving its efficiency is out of the aim of this paper. So, we omit time/space requirements of the preprocessing step, since (i) these data structures can be reused and (ii) different programs [BEETLTool, extLCPtool, egsaTool, eGAPTool] are used to construct them. So, we solely focus on the phase of computation of the matrix distances.

To assess the performance of our algorithm we consider the two collections of genomes listed in [cLcpMultiACStool] and described in Table 3.

Min length Max length Max lcp Program Wall clock Memory
(Gbytes) (mm:ss) (kbytes)
1 3.434 1,080,084 10,657,107 1,711,194 cLCP-mACS 27:42 6,160
kmacs 18:50 3,723,728
2 9.258 744 14,782,125 5,714,157 cLCP-mACS 53:03 6,372
kmacs 58:08 14,717,332
Table 3: The first collection contains genomes, the second one contains genomes. Note that for the first collection and for the second one. In both cases these values are greater than the average length of the strings in the respective collection. The amount of time elapsed from the start to the completion of the instance. The column memory is the peak Resident Set Size (RSS). Both values were taken with /usr/bin/time command.

The auxiliary disk space for the first collection is Gbytes and Gbytes for the second one. But, note that the D file is sparse, in the sense that it contains many zero-values. One can reduce its size by storing only non-zero values or using a different encoding of these values, for instance Sadakane’s [Sadakane:2007] encoding. However, this could slow down the computation.

Notice that an entirely like-for-like comparison between our implementation and the above existing implementation is not possible, since, for the best of our knowledge, our tool is the first lightweight tool.

The ACS computation of two strings has been faced in [ACS2] and it has been implemented by using the Mismatch Average Common Substring Approach (kmacs) tool [kmacsTool]. For kmacs exactly computes the ACS. Other algorithms besides kmacs [ALFRED2016, Pizzi16] have been designed to compute alignment-free distances based on longest matches with mismatches, but for the special case kmacs is the software that has the better change to scale with the dataset size. The implementation of kmacs works in internal memory and not in sequential way. More in detail, it works in steps, at each step it builds the suffix array [Manber:1990] and the lcp array of two strings and (for ) in order to compute the ACS distance between and . In our experiments we have fixed , so we compare with (for ). Note that the performance in terms of time of kmacs could be improved by separately considering the computation of the auxiliary data structures. However, the occupation of RAM would remain almost the same.

The experimental results in Table 3 show that our algorithm is a competitive tool for the lightweight simultaneous computation of a pairwise distance of a string versus each string in a collection, in order to obtain an efficient solution for the multi-string ACS problem.

7 Conclusions and Further works

We have introduced the colored longest common prefix array (cLCP) that, given a strings collection , stores the length of the longest common prefix between the suffix of any string in and the nearest suffix of another string of , according to the lexicographically sorted list of suffixes. This notion is extended in a natural way considering two collections of strings and in order to consider the longest common prefix between any pair of strings in different collections. We have also provided a lightweight method that computes cLCP via sequential scans when consists of a single string, but the strategy can be applied to the general case. This fact makes this data structure appropriate for computing several types of statistics on large collections of long strings. In particular, we have proved that produces a permutation of the matching statistics (MS) for the strings of the collection of . Based on this result, we have used cLCP to efficiently solve the multi-string ACS problem — i.e. computing pairwise MS between a string in and all strings in simultaneously, — that is nowadays crucial in many practical applications, but demanding for large string comparisons. This is also supported by experimental results.

It is interesting to note that the data structure proposed in this paper, and its sequential strategy of computation, are intrinsically dynamic, i.e. cLCP can be efficiently updated when the collection is modified by inserting or removing a string. In particular, after the removal of a string, cLCP can be updated by exploiting the mathematical properties of the permutation associated with the . The insertion of a new string in the collection can be managed by using merging strategy proposed in [EgidiLouzaManziniTelles2018_arxiv] that works in semi-external memory. In this case the array D can be constructed during this merging phase. Finally, we plan to extend our framework to solve the many-to-many pairwise ACS problem on a collection of sequences or between all strings of a collection versus all strings of another collection at the same time.

References