Constructing Antidictionaries in Output-Sensitive Space

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1,y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ_y_1#...#y_N||=o(n), for all N∈[1,k]. For instance, in the human genome, n ≈ 3× 10^9 but ||M^12_y_1#...#y_k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ_y_1,...,M^ℓ_y_1#...#y_k can be computed in O(kn+∑^k_N=1||M^ℓ_y_1#...#y_N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in {y_1,...,y_k} and MaxOut={||M^ℓ_y_1#...#y_N||:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/07/2018

Alignment-free sequence comparison using absent words

Sequence comparison is a prerequisite to virtually all comparative genom...
09/07/2020

On prefix palindromic length of automatic words

The prefix palindromic length PPL_𝐮(n) of an infinite word 𝐮 is the mini...
10/27/2020

Mutual Borders and Overlaps

A word is said to be bordered if it contains a non-empty proper prefix t...
06/22/2019

Prefix palindromic length of the Thue-Morse word

The prefix palindromic length PPL_u(n) of an infinite word u is the mini...
09/26/2017

Polysemy Detection in Distributed Representation of Word Sense

In this paper, we propose a statistical test to determine whether a give...
11/06/2018

Knuth's Moves on Timed Words

We give an exposition of Schensted's algorithm to find the length of the...
10/25/2021

Palindromic factorization of rich words

A finite word w is called rich if it contains | w|+1 distinct palindromi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The word is an absent word of the word if it does not occur in . The absent word of is called minimal if and only if all its proper factors occur in . The set of all minimal absent words for a word is denoted by . The set of all minimal absent words of length at most of a word is denoted by . For example, if , then and . The upper bound on the number of minimal absent words is  [10], where is the size of the alphabet and is the length of , and this is tight for integer alphabets [6]; in fact, for large alphabets, such as when , this bound is also tight even for minimal absent words having the same length [1].

State-of-the-art algorithms compute all minimal absent words of in time [10, 2] or in time [16, 7] for integer alphabets. There also exist space-efficient data structures based on the Burrows-Wheeler transform of that can be applied for this computation [5, 4]. In many real-world applications of minimal absent words, such as in data compression [11, 12, 14, 22], in sequence comparison [6, 7]

, in on-line pattern matching 

[9], or in identifying pathogen-specific signatures [23], only a subset of minimal absent words may be considered, and, in particular, the minimal absent words of length (at most) . Since, in the worst case, the number of minimal absent words of is , space is required to represent them explicitly. In [7], the authors presented an -sized data structure for outputting minimal absent words of a specific length in optimal time for integer alphabets.

The problem with existing algorithms for computing minimal absent words is that they make use of space; and the same amount is required even if one is merely interested in the minimal absent words of length at most . This is because all of these algorithms construct global data structures, such as the suffix array [2]. In theory, this problem can be addressed by using the external memory algorithm for computing minimal absent words presented in [19]. The I/O-optimal version of this algorithm, however, requires a lot of external memory to build the global data structures for the input [20]. One could also use the algorithm of [15] that computes in time using space, where is the size of the LZ77 factorisation of . This algorithm also requires constructing the truncated DAWG, a type of global data structure which could take space . Thus, in this paper, we investigate whether can be computed efficiently in output-sensitive space. As can be “decomposed” into a collection of words—with a suitable overlap of length so as not to lose information—we consider the following, general, computational problem.

Problem

Given words over an alphabet and an integer , compute the set of minimal absent words of length at most of , .

In data compression, this scenario corresponds to computing the antidictionary of documents [11, 12]. In bioinformatics, it corresponds to computing words that are absent from a genome of chromosomes. As discussed above, this computation generally requires space for . We do the identical computation incrementally using output-sensitive space. This goal is reasonable when , for all . In the human genome, but , where is the total number of chromosomes.

Our Results

Antidictionary-based compressors work on and in bioinformatics we have ; we thus consider a constant-sized alphabet for stating our results. We show that all can be computed in total time using space, where MaxIn is the length of the longest word in and . Proof-of-concept experimental results are provided confirming our theoretical findings and justifying our contribution.

2 Preliminaries

We generally follow [8]. An alphabet is a finite ordered non-empty set of elements called letters. A word is a sequence of elements of . The set of all words over of length at most is denoted by . We fix a constant-sized alphabet , i.e., . Given a word over , we say that is a prefix of , is a factor (or subword) of , and is a suffix of . We also say that is a superword of . A factor of is called proper if .

Given a word over , the set of minimal absent words (MAWs) of is defined as

For instance, over , for we have . MAWs of length 1 for can be found in time using working space, and so, in what follows, we focus on the computation of MAWs of length at least 2.

The suffix tree of a non-empty word of length is the compact trie representing all suffixes of  [8]. The branching nodes of the trie as well as the terminal nodes, that correspond to non-empty suffixes of , become explicit nodes of the suffix tree, while the other nodes are implicit. We let denote the path-label from the root node to node . We say that node is path-labeled ; i.e., the concatenation of the edge labels along the path from the root node to . Additionally, is used to denote the word-depth of node . A node such that the path-label , for some , is terminal and is also labeled with index . Each factor of is uniquely represented by either an explicit or an implicit node of called its locus. The suffix-link of a node with path-label is a pointer to the node path-labeled , where is a single letter and is a word. The suffix-link of exists by construction if is a non-root branching node of . The matching statistics of a word with respect to word is an array , where is a pair such that (i) is the longest prefix of that is a factor of ; and (ii)  [18]. is constructible in time , and, given , we can compute in time  [18].

3 Combinatorial Properties

For convenience, we consider the following setting. Let be words over the alphabet and let , with . Let be a positive integer and set and . We want to construct . Let . We have two cases:

Case 1

: ;

Case 2

: .

The following auxiliary fact follows directly from the minimality property.

Fact 1.

Word is absent from word if and only if is a superword of a MAW of .

For Case 1, we prove the following lemma.

Lemma 1 (Case 1).

A word (resp. ) belongs to if and only if is a superword of a word in (resp. in ).

Proof.

Let (the case is symmetric). Suppose first that is a superword of a word in , that is, there exists such that is a factor of . If , then and therefore, using the definition of MAW, . If is a proper factor of , then is an absent word of and again, by definition of MAW, .

Suppose now that is not a superword of any word in . Then is not absent in by Fact 1, and hence in , thus cannot belong to . ∎

It should be clear that the statement of Lemma 1 implies, in particular, that all words in belong to . Furthermore, Lemma 1 motivates us to introduce the reduced set of MAWs of with respect to as the set obtained from after removing those words that are superwords of words in . The set is defined analogously.

Example 1.

Let , and . We have and The word bab is contained in so it belongs to . The word is a superword of hence . On the other hand, the words bbb, aaaa and abb are superwords of words in , hence they belong to . The remaining MAWs are not superwords of MAWs of the other word. The reduced sets are therefore and . In conclusion, we have for Case 1 that .∎

We now investigate the set (Case 2).

Fact 2.

Let , , be such that and . Then occurs in but not in and occurs in but not in , or vice versa.

The rationale for generating the reduced sets should become clear with the next lemma.

Lemma 2 (Case 2).

Let . Then has a prefix in and a suffix in , for such that .

Proof.

Let , , be a word in . By Fact 2, occurs in but not in and occurs in but not in , or vice versa. Let us assume the first case holds (the other case is symmetric). Since does not occur in , there is a MAW that is a factor of . Since occurs in , is not a factor of . Consequently, is a prefix of .

Analogously, there is an that is a suffix of . Furthermore, and cannot be factors one of another. Inspect Figure 1 in this regard. ∎

Figure 1: occurs in but not in ; occurs in but not in ; therefore does not occur in . By construction, occurs in and occurs in ; therefore is a Case 2 MAW.
Example 2.

Let , and . Consider (Case 2 MAW). We have that abaa occurs in but not in and baaa occurs in but not in . Since abaa does not occur in , there is a MAW that is a factor of abaa. Since baaa occurs in , is not a factor of baaa. So is a prefix of abaa and this is aba. Analogously, there is MAW that is a suffix of abaaa and this is aaa. ∎

As a consequence of Lemma 2, in order to construct the set , we should consider all pairs with in and in , . In order to construct the final set , we use incrementally Lemmas 1 and 2. We summarise the whole approach in the following general theorem, which forms the theoretical basis of our technique.

Theorem 1.

Let , and let . Then, either (Case 1 MAWs) or, otherwise, for some . Moreover, in this latter case, has a prefix in and a suffix in , or the converse, i.e., has a prefix in and a suffix in (Case 2 MAWs).

Proof.

Let and . Then, and .

Let be a word of length . By the definition of MAW, and must both be factors of . However, both cannot be factors of and both cannot be factors of . Therefore, we have one of the two cases:

Case 1

: is factor of but not of and is a factor of but not of .

Case 2

: is factor of but not of and is a factor of but not of .

These two cases are symmetric, thus only proof of Case 1 will be presented here. If does not occur in then . Otherwise, let be the longest prefix of that is a factor of .

Because then is a factor of . Therefore, . In addition, all factors of occur in , so .

Now, does not occur in , so either does not occur in which means that , or let be the longest suffix of that occurs in .

Because then occurs in , therefore . Since all factors of occur in , we have . ∎

4 Algorithm

Let us first introduce an algorithmic tool. In the weighted ancestor problem, introduced in [13], we consider a rooted tree with an integer weight function defined on the nodes. We require that the weight of the root is zero and the weight of any other node is strictly larger than the weight of its parent. A weighted ancestor query, given a node and an integer value , asks for the highest ancestor of such that , i.e., such an ancestor that and is the smallest possible. When is the suffix tree of a word of length , we can locate the locus of any factor using a weighted ancestor query. We define the weight of a node of the suffix tree as the length of the word it represents. Thus a weighted ancestor query can be used for the terminal node decorated with to create (if necessary) and mark the node that corresponds to .

Theorem 2 ([3]).

Given a collection of weighted ancestor queries on a weighted tree on nodes with integer weights up to , all the queries in can be answered off-line in time.

4.1 The Algorithm

At the th step, we have in memory the set . Our algorithm works as follows:

  1. We read word from the disk and compute in time . We output the words in the following constant-space form: per word [2]; such that .

  2. Here we compute Case 1 MAWs. We apply Lemma 1 to construct set and the sets as follows.

    1. We first want to find the elements of that are superwords of any word . We build the generalised suffix tree  [18]. We find the locus of the longest proper prefix of each element of in via answering off-line a batch of weighted ancestor queries (Theorem 2). From there on, we spell and mark the corresponding node on , if any. After processing all in the same manner, we traverse to gather all occurrences (starting positions) of words in the elements of , thus finding the elements of that are superwords of any . By definition, no MAW is a prefix of another MAW , thus the marked nodes form pairwise disjoint subtrees, and the whole process takes time , the size of .

    2. We next want to check if the words are superwords of any element of . We first sort all tuples using radixsort and then check this using the matching statistics algorithm for with respect to considering the tuples in ascending order (from left to right) at the same time. By definition, no element in is a factor of another element in the same set. Thus if a factor of corresponds to an element in this is easily located in while running the matching statistics algorithm. The whole process takes time: time to construct the suffix tree and a further time for the matching statistics algorithm and for processing the tuples.

    We create set explicitly since it is a subset of . We create set implicitly: every element is stored as a tuple such that . We store every element of with the same representation. All other elements of are stored explicitly.

  3. Construct the suffix tree of and use it to locate all occurrences of words in in and store the occurrences as pairs (starting position, ending position). This step can be done in time . By definition, no element in is a prefix of another element in , and thus this can be done within the claimed time complexity.

  4. For every , we perform the following to compute Case 2 MAWs:

    1. Read word from the disk. Construct the suffix tree of word in time . Use to locate all occurrences of elements of in and store the occurrences as pairs (starting position, ending position). This step can be done in time similar to step 2. By definition, no element in is a prefix of another element in , and thus this can be done within the claimed time complexity.

    2. During a bottom-up traversal of mark, at each explicit node of , the smallest starting position of the subword represented by that node, and the largest starting position of the same subword. This can be done in time by propagating upwards the labels of the terminal nodes (starting positions of suffixes) and updating the smallest and largest positions analogously.

    3. Compute the set and output the words in the following constant-space form: per word; such that is a MAW. This can be done in time .

    4. For each element of , we need to locate the node representing word and the node representing word . This can be done in time via answering off-line a batch of weighted ancestor queries (Theorem 2). At this point, we have located the two nodes on . We assign a pointer from the stored starting position of to the ending position of , only if is before and is after ( can be trivially computed using the stored starting position of and the length of ). Conversely, we assign a pointer from the ending position of to the stored starting position of , only if is before and is after .

    5. Suppose occurs in and in . We make use of the pointers as follows. Recall steps 3 and 4(a) and check whether starts where a word of starts and ends where a word of ends. If this is the case and , then by Theorem 1 is added to our output set , otherwise discard it. Inspect Figure 2 in this regard. Conversely, if occurs in and in check whether starts where a word of starts and whether ends where a word of ends. If this is the case and , then is added to , otherwise discard it.

Figure 2: starts where a word of starts in and ends where a word of ends in . Moreover, if , then is a Case 2 MAW.

Finally, we set as the output of the th step. Let MaxIn be the length of the longest word in and .

Theorem 3.

Given words and an integer , all can be computed in total time using space, where .

Proof.

From the above discussion, the time is bounded by . We can bound the first term as follows.

Therefore the time is bounded by .

The space is bounded by the maximum time spent at a single step; namely, the length of the longest word in the collection plus the maximum total size of set elements across all output sets. Note that the total output size of the algorithm is the sum of all its output sets, that is , and MaxOut could come from any intermediate set.

The correctness of the algorithm follows from Lemma 1 and Theorem 1. ∎

5 Proof-of-Concept Experiments

In this section, we do not directly compare against the fastest internal [2] or external [19] memory implementations because the former assumes that we have the required amount of internal memory, and the latter assumes that we have the required amount of external memory to construct and store the global data structures for a given input dataset. If the memory for constructing and storing the data structures is available, these linear-time algorithms are surely faster than the method proposed here. In what follows, we rather show that our output-sensitive technique offers a space-time tradeoff, which can be usefully exploited for specific values of , the maximal length of MAWs we wish to compute.

The algorithm discussed in Section 4 (with the exception of storing and searching the reduced set words explicitly rather than in the constant-space form previously described) has been implemented in the C++ programming language111The implementation can be made available upon request.. The correctness of our implementation has been confirmed against that of [2]. As input dataset here we used the entire human genome (version hg38) [21], which has an approximate size of 3.1GB. The following experiment was conducted on a machine with an Intel Core i5-4690 CPU at 3.50 GHz and 128GB of memory running GNU/Linux. We ran the program by splitting the genome into blocks and setting . Figure 3 depicts the change in elapsed time and peak memory usage as and increase (space-time tradeoff).

Graph (a) shows an increase of time as and increase; and graph (b) shows a decrease in memory as increases (as proved in Theorem 3). Notice that the space to construct the block-wise data structures bounds the total space used for the specific values and that is why the memory peak is essentially the same for the values used. This can specifically be seen for where all words of length are present in the genome. The same dataset was used to run the fastest internal memory implementation for computing MAW[2] on the same machine. It took only seconds to compute all MAWs but with a peak memory usage of GB. The results confirm our theoretical findings and justify our contribution.

                                Number of blocks