Classical and Quantum Algorithms for Constructing Text from Dictionary Problem

05/28/2020 ∙ by Kamil Khadiev, et al. ∙ 0

We study algorithms for solving the problem of constructing a text (long string) from a dictionary (sequence of small strings). The problem has an application in bioinformatics and has a connection with the Sequence assembly method for reconstructing a long DNA sequence from small fragments. The problem is constructing a string t of length n from strings s^1,..., s^m with possible intersections. We provide a classical algorithm with running time O(n+L +m(log n)^2)=Õ(n+L) where L is the sum of lengths of s^1,...,s^m. We provide a quantum algorithm with running time O(n +log n·(log m+loglog n)·√(m· L))=Õ(n +√(m· L)). Additionally, we show that the lower bound for the classical algorithm is Ω(n+L). Thus, our classical algorithm is optimal up to a log factor, and our quantum algorithm shows speed-up comparing to any classical algorithm in a case of non-constant length of strings in the dictionary.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Quantum computing [nc2010, a2017, aazksw2019part1] is one of the hot topics in computer science of last decades. There are many problems where quantum algorithms outperform the best known classical algorithms [dw2001, quantumzoo, ks2019, kks2019].

One of such problems are problems for strings. Researchers show the power of quantum algorithms for such problems in [m2017, bbbv1997, rv2003, ki2019].

In this paper, we consider the problem of constructing text from dictionary strings with possible intersections. We have a text of length and a dictionary . The problem is constricting only from strings of with possible intersections. The problem is connected with the sequence assembly method for reconstructing a long DNA sequence from small fragments [msdd2000, bt2013].

We suggest a classical algorithm with running time , where , is a length of and does not consider log factors. The algorithm uses segment tree[l2017guide] and suffix array[mm90] data structures, concepts of comparing string using rolling hash[kr87, Fre79] and idea of prefix sum [cormen2001].

The second algorithm is quantum. It uses similar ideas and quantum algorithm for comparing two strings with quadratic speed-up comparing to classical counterparts [ki2019]. The running time for our quantum algorithm is

.

Additionally, we show the lower bound in a classical case that is . Thus, we get the optimal classical algorithm in a case of . It is true, for example, in a case of strings form has length at least or in a case of . In the general case, the algorithm is an optimal algorithm up to a log factor. The quantum algorithm is better than any classical counterparts in a case of . It happens if strings from has length at least .

Our algorithm uses some quantum algorithms as a subroutine, and the rest part is classical. We investigate the problems in terms of query complexity. The query model is one of the most popular in the case of quantum algorithms. Such algorithms can do a query to a black box that has access to the sequence of strings. As a running time of an algorithm, we mean a number of queries to the black box.

The structure of the paper is the following. We present tools in Section 2. Then, we discuss the classical algorithm in Section 3.1 and quantum algorithm in Section 3.2. Section 4 contains lower bound.

2 Tools

Our algorithms uses several data structures and algorithmic ideas like segment tree[l2017guide], suffix array[mm90], rolling hash[kr87] and prefix sum [cormen2001]. Let us describe them in this section.

2.1 Preliminaries

Let us consider a string for some integer . Then, is the length of the string. is a substring of .

In the paper, we compare strings in lexicographical order. For two strings and , the notation means precedes in lexicographical order.

2.2 Rolling Hash for Strings Comparing

The rolling hash was presented in [kr87]. It is a hash function

where is some prime integer, is a size of the alphabet and is the index of a symbol in the alphabet. For simplicity we consider binary alphabet. So, and .

We can use rolling hash and the fingerprinting method [Fre79] for comparing two strings . Let us randomly choose from the set of the first primes, such that for some . Due to Chinese Theorem and [Fre79], the following statements are equivalent and

with error probability at most

. If we have invocations of comparing procedure, then we should choose primes. Due to Chebishev’s theorem, the -th prime number . So, if our data type for integers is enough for storing , then it is enough for computing the rolling hash.

Additionally, for a string , we can compute prefix rolling hash, that is . It can be computed in running time using formula

Assume, that we store mod . We can compute all of them in running time using formula mod .

Using precomputed prefix rolling hash for a string we can compute rolling hash for any substring in running time by formula

For computing the formula in we should precompute mod . We can compute it in by the formula mod and . Due to Fermat’s little theorem mod mod . We can compute it with running time using Exponentiation by squaring algorithm.

Let be a procedure that computes and up to the power with running time. Let be a procedure that computes all prefix rolling hashes for a string and store them.

Assume, that we have two strings and and already computed prefix rolling hashes. Then, we can compare these strings in lexicographical order in running time. The algorithm is following. We search the longest common prefix of and , that is . We can do it using binary search.

  • If a , then .

  • If a , then .

Using binary search we find the last index such that and . In that case

After that, we compare and for . If , then ; if , then ; if , then .

Binary search works with running time because we have computed all prefix rolling hashes already.

Let be a procedure that compares and and returns if ; if ; and if .

2.3 Segment Tree with Range Updates

We consider a standard segment tree data structure [l2017guide] for an array for some integer . A segment tree for and array can be constructed in running time. The data structure allows us to invoke the following requests in running time.

  • Update. Parameters are three integers (). We assign for .

  • Push. We push all existing range updates.

  • Request. For an integer (), we should return .

Let be a function that constructs and returns a segment tree for an array in running time.

Let be a procedure that updates a segment tree in running time.

Let be a procedure that push all existing range updates for a segment tree in running time.

Let be a function that returns from a segment tree . The running time of the procedure is . At the same time, if we invoke Push procedure and after that do not invoke Update procedure, then the running time of Request is .

2.4 Suffix Array

Suffix array [mm90] is an array for a string and . The suffix array is a lexicographical order for all suffixes of . Formally, for any .

The suffix array can be computed in running time.

Lemma 1 ([llh2018])

A suffix array for a string can be constructed in running time.

Let be a procedure that constructs a suffix array for a string .

3 Algorithms

Let us formally present the problem.

Problem. For some positive integers and , we have a sequence of strings . Each where is some finite size alphabet and . We call dictionary. Additionally, we have a string of length , where . We call text. The problem is searching a subsequence and positions such that , , for . Additionally, for .

For simplicity, we assume that , but all results are right for any finite alphabet.

Informally, we want to construct from with possible intersections.

Firstly, let us present a classical algorithm.

3.1 A Classical Algorithm

Let us present the algorithm. Let be an index of a longest string from that can start in position . Formally, if is a longest string from such that . Let if there is no such string . If we construct such array, then we can construct and that is solution of the problem in . A procedure from Algorithm 1 shows it. If there is no such decomposition of , then the procedure returns .

while  do
     
     
     if  then
         
     end if
     for  do
         if  and  then
              
              
         end if
     end for
     if  or  then
         Break the While loop and return We cannot construct other part of the string
     end if
     
     
     
     
     
end while
return
Algorithm 1 . Constructing and from

Let us discuss how to construct array.

As a first step, we choose a prime that is used for rolling hash. We choose randomly from the first primes. In that case, due to results from Section 2.2, the probability of error is at most in a case of at most strings comparing invocations.

As a second step, we construct a suffix array for . Then, we consider an array of pairs . One element of corresponds to one element of the suffix array . After that, we construct a segment tree for and use parameter of pair for maximum.

As a next step, we consider strings for . For each string we find the smallest index and the biggest index such that all suffixes for has as a prefix. We can use binary search for this action. Because of sorted order of suffixes in suffix array, all suffixes with the prefix are situated sequently. As a comparator for strings, we use Compare procedure. Let us present this action as a procedure in Algorithm 2. The algorithm returns if no suffix of contains string as a prefix.

,
while  and  do
     
     
     
     
     if  and  then
         
         
     end if
     if  then
         
     end if
     if  then
         
     end if
end while
if  then
     
     
     
     while  and  do
         
         
         
         
         if  and  then
              
              
         end if
         if  then
              
         end if
         if  then
              
         end if
     end while
end if
return
Algorithm 2 . Searching a indexes segment of suffixes for that have as a prefix

Then, we update values in the segment tree by a pair .

After processing all strings from , the array is constructed. We can construct array using and the suffix array . We know that -th element stores the longest possible string that starts from . It is almost definition of array. So we can put , if .

Finally, we get the following Algorithm 3 for the text constructing from a dictionary problem.

is randomly chosen from
for  do
     
end for
Initialization by -array
for  do
     
     if  then
         
     end if
end for
for  do
     
     
end for
return
Algorithm 3 The classical algorithm for the text constructing from a dictionary problem for an error probability

Let us discuss properties of Algorithm 3.

Theorem 3.1

Algorithm 3 solves the text constructing from a dictionary problem with running time end error probability for some , and . The running time is in a case of .

Proof

The correctness of the algorithm follows from construction.

Let us discuss running time of the algorithm. Note, that and for .

Due to results from Section 2.2, ComputeKI works with running time. Let us convert this statement.

Due to results from Section 2.2, ComputePrefixRollingHashes works in linear running time. Therefore, all invocations of ComputePrefixRollingHashes procedure works in running time.

Due to Lemma 1, ConstructSuffixArray works in running time. Initializing of does steps. Due to results from Section 2.3, ConstructSegmentTree works in running time.

SearchSegment procedure invokes Compare procedure times due to binary search complexity. Compare procedure works in running time. Therefore, SearchSegment works in running time. Due to results from Section 2.3, Update procedure works in running time. Hence, the total complexity of processing all strings from the dictionary is .

The invocation of Push works in running time due to results from Section 2.3. The invocation of Request works in running time because we do not invoke Update after Push. Therefore, constructing of the array takes steps.

The running time of ConstructQI is because we pass each element only once.

So, the total complexity of the algorithm is

Let us discuss the error probability. We have invocations of Compare procedure. Each invocation of Compare procedure compares rolling hashes at most times. Due to results from Section 2.2, if we compare strings of length at most using rolling hash times and choose from primes, then we get error probability at most .

3.2 A Quantum Algorithm

Firstly, let us discuss a quantum subroutine. There is a quantum algorithm for comparing two strings in a lexicographical order with the following property:

Lemma 2 ([ki2019])

There is a quantum algorithm that compares two strings of length in lexicographical order with query complexity and error probability for some positive integer .

Let be a quantum subroutine for comparing two strings and of length in lexicographical order. We choose . In fact, the procedure compares prefixes of and of length . returns if ; it returns if ; and it returns if .

Next, we use a that compares and string in lexicographical order. Assume that . Then, if is a prefix of , then . If is not a prefix of , then the result is the same as for . In the case of , the algorithm is similar. The idea is presented in Algorithm 4.

if  then
     
end if
if  then
     
     if  then
         
     end if
end if
if  then
     
     if  then
         
     end if
end if
return
Algorithm 4 The quantum algorithm for comparing two string in lexicographical order

Let us present a quantum algorithm for the text constructing form a dictionary problem. For the algorithm, we use the same idea as in the classical case, but we replace Compare that uses the rolling hash function by QCompare. In that case, we should not construct rolling hashes. Let QSearchSegment be a quantum counterpart of SearchSegment that uses QCompare.

The quantum algorithm is presented as Algorithm 5.

Initialization by -array
for  do
     
     
end for
for  do
     
     
end for
return
Algorithm 5 The quantum algorithm for the text constructing from a dictionary problem

Let us discuss properties of Algorithm 5.

Theorem 3.2

Algorithm 5 solves the text constructing from a dictionary problem in running time end error probability .

Proof

The algorithm does almost the same actions as the classical counterpart. That is why the correctness of the algorithm follows from Theorem 3.1.

Let us discuss the running time. Due to Theorem 3.1, the running time of the procedure ConstructSuffixArray is , the running time of the procedure ConstructSegmentTree is , the running time of the procedure Push is , the running time of the array construction is , the running time of ConstructQI is .

Due to Lemma 2, the running time of QCompare for is . The procedure QSearchSegment invokes QCompare procedure times for each string . So, the complexity of processing all strings from is

Let us use the Cauchy-Bunyakovsky-Schwarz inequality and equality for simplifying the statement.

The total running time is

Let us discuss the error probability. The algorithm invokes QCompare procedure times. The success probability is


4 Lower Bound

Let us discuss the lower bound for the running time of classical algorithms.

Theorem 4.1

Any randomized algorithm for the text constructing from a dictionary problem works in running time, where .

Proof

Assume . Let us consider such that and for all .

Let for each . Note, that in a general case . Therefore, we reduce the input data only at most twice. Assume that we have two options:

  • all contains only s;

  • there is such that we have two conditions:

    • all contains only s, for ;

    • for , and for .

In a case of all s, we cannot construct the text . In a case of existing in the first half, we can construct by putting on the position and we get of the required position. Then, we complete by other -strings.

Therefore, the solution of the problem is equivalent to the search of in unstructured data of size . The randomized complexity of this problem is due to [bbbv1997].

Assume . Let , and . Assume that we have two options:

  • contains only s;

  • there is such that and for all .

In the first case, we can construct from . In the second case, we cannot do it.

Here the problem is equivalent to search among symbols of . Therefore, the problem’s randomized complexity is .

Finally, the total complexity is .

References