Fast and linear-time string matching algorithms based on the distances of q-gram occurrences

02/19/2020
by   Satoshi Kobayashi, et al.
0

Given a text T of length n and a pattern P of length m, the string matching problem is a task to find all occurrences of P in T. In this study, we propose an algorithm that solves this problem in O((n + m)q) time considering the distance between two adjacent occurrences of the same q-gram contained in P. We also propose a theoretical improvement of it which runs in O(n + m) time, though it is not necessarily faster in practice. We compare the execution times of our and existing algorithms on various kinds of real and artificial datasets such as an English text, a genome sequence and a Fibonacci string. The experimental results show that our algorithm is as fast as the state-of-the-art algorithms in many cases, particularly when a pattern frequently appears in a text.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/09/2021

Pattern Matching on Grammar-Compressed Strings in Linear Time

The most fundamental problem considered in algorithms for text processin...
11/05/2019

Fast Multiple Pattern Cartesian Tree Matching

Cartesian tree matching is the problem of finding all substrings in a gi...
08/14/2019

Fast Cartesian Tree Matching

Cartesian tree matching is the problem of finding all substrings of a gi...
04/24/2017

GaKCo: a Fast GApped k-mer string Kernel using COunting

String Kernel (SK) techniques, especially those using gapped k-mers as f...
02/17/2020

Detecting k-(Sub-)Cadences and Equidistant Subsequence Occurrences

The equidistant subsequence pattern matching problem is considered. Give...
08/23/2021

On Specialization of a Program Model of Naive Pattern Matching in Strings (Extended Abstract)

We have proved that for any pattern p the tail recursive program model o...
01/13/2022

Multiple Genome Analytics Framework: The Case of All SARS-CoV-2 Complete Variants

Pattern detection and string matching are fundamental problems in comput...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The exact string matching problem is a task to find all occurrences of in when given a text of length and a pattern of length . A brute-force solution of this problem is to compare with all the substrings of of length . It takes time. The Knuth-Morris-Pratt (KMP) algorithm [16] is well known as an algorithm that can solve the problem in time. However, it is not efficient in practice, because it scans every position of the text at least once. The Boyer-Moore algorithm [3] is famous as an algorithm that can perform string matching fast in practice by skipping many positions of the text, though it has worst-case time complexity. Like this, many efficient algorithms whose worst-case time complexity is the same or even worse than the naive method have been proposed so far [15, 20, 18]. For example, the HASH algorithm [18] focuses on the substrings of length in a pattern and obtains a larger shift amount. However, considering that such an algorithm is embedded in software and actually used, if the worst-case input strings are given, the operation of the software may be slowed down. Therefore, an algorithm that operates theoretically and practically fast is important. Franek et al. [13] proposed the Franek-Jennings-Smyth (FJS) algorithm, which is a hybrid of the KMP algorithm and the Sunday algorithm [20]. The worst-case time complexity of the FJS algorithm is and it works fast in practice, where is the alphabet size. Kobayashi et al. [17] proposed an algorithm that improves the speed of the FJS algorithm by combining a method that extends the idea of the Quite-Naive algorithm [6]. This algorithm has the same worst-case time complexity as the FJS algorithm, and it runs faster than the FJS algorithm in many cases. The LWFR algorithm [5] is a practically fast algorithm that works in linear time. This algorithm uses a method of quickly recognizing substrings of a pattern using a hash function. See [12, 14] for recent surveys on exact string matching algorithms.

In this paper, we propose two new exact string matching algorithms based on the HASH algorithm and the KMP algorithm incorporating a new idea based on the distances of occurrences of the same -grams. The time complexity of the preprocessing phase of the first algorithm is and the search phase runs in time. The second algorithm improves the theoretical complexity of the first algorithm, and the preprocessing and searching times are and , respectively. Our algorithms are as fast as the state-of-the-art algorithms in many cases. Particularly, our algorithms work faster when a pattern frequently appears in a text.

This paper is organized as follows. Section 2 briefly reviews the KMP and HASH algorithms, which are the basis of the proposed algorithms. Section 3 proposes our algorithms. Section 4 shows experimental results comparing the proposed algorithms with several other algorithms using artificial and practical data. Section 5 draws our conclusions.

2 Preliminaries

2.1 Notation

Let be a set of characters called an alphabet and be its size. denotes the set of all strings over . The length of a string is denoted by . The empty string, denoted by , is the string of length zero. The -th character of is denoted by for each . The substring of starting at and ending at is denoted by for . For convenience, let if . A string is called a prefix of and a string is called a suffix of . In particular, a prefix (resp. suffix ) of is a proper prefix (resp. proper suffix) of when (resp. ). A string is a border of if is both a prefix and a suffix of . Note that the empty string is a border of any string. Moreover, it is a proper border of if . The length of the longest proper border of for is given by

Throughout this paper, we assume is an integer alphabet.

2.2 The exact string matching problem

The exact string matching problem is defined as follows:

Input: A text of length and a pattern of length ,
Output: All positions such that for .

We will use a text of length and a pattern of length throughout the paper.

Let us consider comparing and . The naive method compares characters of the two strings from left to right. When a character mismatch occurs, the pattern is shifted to the right by one character. That is, we compare and . This naive method takes time for matching. There are a number of ideas to shift the pattern more so that searching for can be performed more quickly, using shift functions obtained by preprocessing the pattern.

2.3 Knuth-Morris-Pratt algorithm

The Knuth-Morris-Pratt (KMP) algorithm [16] is well known as a string matching algorithm that has linear worst-case time complexity. When the KMP algorithm has confirmed that and for some , it shifts the pattern so that a suffix of matches a prefix of and we do not have to re-scan any part of again. That is, the pattern can be shifted by for . In addition, if , the same mismatch will occur again after the shift. In order to avoid this kind of mismatch, we use given by

The amount of the shift is given by

This function has a domain of and is implemented as an array in the algorithm. Hereafter, we identify some functions and the arrays that implement them.

Fact 1.

If and , then holds for . Moreover, there is no positive integer such that .

Note that if the algorithm has confirmed , the shift is given by after reporting the occurrence of the pattern. Algorithm LABEL:alg:kmp_kmpshift shows a pseudocode to compute the array . Clearly, it can be done in time. By using , the KMP algorithm finds all occurrences of in in time. algocf[btp]    

2.4 Hash algorithm

The HASH algorithm [18] is an adaptation of the Wu-Manber multiple string matching algorithm [21] to the single string matching problem. Before comparing and , the HASH algorithm shifts the pattern so that the suffix -gram of the text substring shall match the rightmost occurrence of the same -gram in the pattern. For practical efficiency, we use a hash function, though it may result in aligning mismatching -grams occasionally. The shift amount is given by where

We repeatedly shift the pattern till the suffix -grams of the pattern and the considered text substring have a matching hash value, in which case the shift amount will be 0. We then compare the characters of the pattern and the text substring from left to right. If a character mismatch occurs during the comparison, the pattern is shifted by

(1)

where , since we know that the -gram suffixes of the pattern and the text substring have the same hash values. The time complexity of the preprocessing phase for computing the shift function is . The searching phase has time complexity. The worst-case time complexity is worse than that of the naive method, but it works fast in practice.

Fact 2.

If , then . There is no positive integer such that .

3 Proposed algorithms

3.1 Dist algorithm

Our proposed algorithm uses three kinds of shifting functions. The first one is essentially the same as , the one used in the HASH algorithm, except for the hashing function. The second one is based on the distance of the closest occurrences of the -grams of the same hash value in the pattern. We involve as the third one to guarantee the linear-time behavior.

Formally, the first shifting function is given as

where

Fact 2 holds for .

The second shift function is defined for by

where . This function is a generalization of the shift (Eq. 1) used in the HASH algorithm. We have if the -gram ending at and the one ending at have the same hash value, while no -grams occurring between those have the same value. If no -gram ending before has the same hash value, then . By using this, in the situation where , when a mismatch occurs anywhere between and , the pattern can be shifted by .

Fact 3.

Suppose that . Then , unless . Moreover, there is no positive integer such that .

Those functions , and are computed in the preprocessing phase. Algorithms LABEL:alg:dist_compute_shift and LABEL:alg:dist_compute_darray compute the arrays and , respectively. algocf[t]     algocf[t]    

Figure 1 shows examples of shifting the pattern using and .

Figure 1: Shifting a pattern using and

Both functions and shift the pattern using -gram hash values based on Facts 2 and 3, respectively. The latter can be used only when we know that the pattern and the text substring have aligned -grams ending at with the same hash value and it may shift the pattern at most , while the former can be used anytime and the maximum possible shift is . The advantage of the function is in the computational cost. If we know that the premise of Fact 3 is satisfied, we can immediately perform the shift based on , while computing for the concerned -gram in the text is not as cheap as . Our algorithm exploits this advantage of the new shifting function .

algocf[htbp]     Next, we explain our searching algorithm shown in Algorithm LABEL:alg:dist_search_algorithm. The searching phase is divided into three: Alignment-phase, Comparison-phase, and KMP-phase. The goal of the Alignment-phase is to shift the pattern as far as possible without comparing each single character of the pattern and the text. The Alignment-phase ends when we align the pattern and a text substring that have (a) aligned -grams of the same hash value and (b) the same first character. Suppose and are aligned at the beginning of the Alignment-phase. If , by shifting the pattern by , we find the aligned -grams of the same hash value. Namely, . Otherwise, we shift the pattern by repeatedly until we find aligned -grams of the same hash value. When finding a position satisfying (a) by aligning and for some , i.e., , we simply check the condition (b). If and the corresponding text character match, we move to the Comparison-phase. Otherwise, we safely shift the pattern using . Note that although it is possible to use the function rather than , the computation would be more expensive. Shifting the pattern by , unless , still the pattern and the aligned text substring satisfy (a). However, we do not repeat -shift any more, since the smaller becomes, the smaller the expected shift amount will be. We simply restart the Alignment-phase. Once the conditions (a) and (b) are satisfied, we move on to the Comparison-phase.

In the Comparison-phase, we check the characters from to . If a character mismatch occurs during the comparison, either of the shift by or by is possible. Therefore, we select the one where the resumption position of the character comparison goes further to the right after shifting the pattern. If the resumption position of the comparison is the same, we select the one with the larger shift amount. Recall that when the KMP algorithm finds that and , it resumes comparison from checking the match between and if , and and if . On the other hand, if we shift the pattern by , we simply resume matching and . Therefore, we should use rather than when either and or . Summarizing the discussion, we shift the pattern by if and hold. Otherwise, we shift the pattern by

. At this moment, we may have a “

partial match” between the pattern and the aligned text substring. If we have performed the KMP-shift with , then we have a match between the nonempty prefixes of the pattern and the aligned text substring of length . In this case, we go to the KMP-phase, where we simply perform the KMP algorithm. The KMP-phase prevents the character comparison position from returning to the left and guarantees the linear time behavior of our algorithm. If we have no partial match, we return to the Alignment-phase.

Theorem 1.

The worst-case time complexity of the DIST algorithm is .

Proof.

Since the proposed algorithm uses Fact 1 on the KMP algorithm to prevent the character comparison position from going back to the left, the number of character comparisons is at most times like the KMP algorithm. In addition, the hash value of -gram is calculated to perform the shift using . Since the hash value calculation requires time and it is calculated at the maximum of places in the text, the hash value calculation takes time in total. Therefore, the worst-case time complexity of the searching phase is . In the preprocessing, time is required to calculate the hash value of -gram at locations.

Example 1.

Let . The shift functions , , and are shown below. The hash values are calculated by treating each character as its ASCII value, e.g. is calculated as 97.

1 2 3 4 5 6 7 8 9 10
a b a a b b a a a
- - 1 2 3 4 5 4 7
1 1 3 2 4 3 7 6 7 8
aba baa aab abb bba aaa others
2041 2053 2038 2042 2057 2037
6 1 4 3 2 0 7

Figure 2 illustrates an example run of the DIST algorithm () for finding in .

Attempt 1

We shift the pattern by characters. Since the position of the -gram aligned by this shift is 8, is updated to 8.

Attempt 2

We check whether the first character of the pattern matches the character of the corresponding text.

, so the pattern is shifted by .

Attempt 3

We shift the pattern by and update to 7.

Attempt 4

We check whether the first character of the pattern matches the character of the corresponding text. From , we compare the characters of and from left to right. Since , the pattern is shifted by or . From , , and are satisfied. Therefore, we shift the pattern by .

Attempt 5

We shift the pattern by and is updated to 3.

Attempt 6

We check whether the first character of the pattern matches the character of the corresponding text. is satisfied, so the characters of and are compared from left to right. Since , the pattern is shifted by or . From and , the pattern is shifted by .

Attempt 7

Attempt 6 shows that , that is, there is a partial match, so we continue the comparison of and . Since , the pattern occurrence position 22 is reported.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a
Attempt 1 a b a a b b a a a
Attempt 2 a b a a b b a a a
Attempt 3 a b a a b b a a a
Attempt 4 a b a a b b a a a
Attempt 5 a b a a b b a a a
Attempt 6 a b a a b b a a a
Attempt 7 a b a a b b a a a
Figure 2: An example run of the DIST algorithm for a pattern and a text . For each alignment of the pattern, and indicate a match and a mismatch between the text and the pattern, respectively. The character with is known to match the character at the corresponding position in the text without comparison. Subscript numbers show the order of character comparisons in each attempt.

3.2 Ldist algorithm

The LDIST algorithm is a linear time version of the DIST algorithm. In the DIST algorithm, if strings such as and are given, time is required for searching phase because the hash values of each -gram are calculated in the text. Since the hash function defined in Section 3.1 is a rolling hash, if the hash value of has already been obtained for a string , the hash value of can be computed in constant time by . The LDIST algorithm modifies Line LABEL:ln:dist_hash of Algorithm LABEL:alg:dist_search_algorithm so that we calculate the hash value of the -gram using the previously calculated value of the other -gram in the incremental way, if they overlap. Similarly, the time complexity of the processing phase can be reduced.

Theorem 2.

The worst-case time complexity of the LDIST algorithm is .

Proof.

Like the DIST algorithm, the number of character comparisons is at most times. In the calculation of the hash value of -gram, if the -gram for which the hash value has been calculated one step before and the -gram for which the hash value is to be calculated overlap, the incremental update is performed using the rolling hash. Therefore, the calculation of the hash value of -gram takes time in total. Thus, the worst time complexity of matching is . In the preprocessing, we calculate the hash values of -grams in the same way, so the hash values of -grams can be calculated in time.

4 Experiments

In this section, we compare the execution times of the proposed algorithms with other algorithms. We used the following algorithms:

  • BNDM: the Backward Nondeterministic DAWG Matching algorithm using -grams with  [19]

  • SBNDM: the Simplified version of the Backward Nondeterministic DAWG Matching algorithm using -grams with  [1]

  • KBNDM: the Factorized variant of the BNDM algorithm [4]

  • BSDM: the Backward SNR DAWG Matching algorithm using condensed alphabets with groups of characters, with  [10]

  • FJS: the Franek-Jennings-Smyth algorithm [13]

  • FJS: a modification of the FJS algorithm [17]

  • HASH: the hashing algorithm using -grams, with  [18] (see Section 2.4)

  • FS-: the Multiple Windows version [11] of the Fast Search algorithm [7] implemented using sliding windows with

  • IOM: the Improved Occurrence Matcher [8]

  • WOM: the Worst Occurrence Matcher [8]

  • LWFR: the Linear-Weak-Factor-Recognition algorithm [5] implemented with a -chained loop with

  • DIST: our algorithm proposed in Section 3.1 (Algorithm LABEL:alg:dist_search_algorithm) with

  • LDIST: our algorithm proposed in Section 3.2 with

2 4 8 16 32 64 128 256 512 1024
BNDM
SBNDM
KBNDM
BSDM
FJS
FJS
HASH
FS-
IOM
WOM
LWFR
DIST
LDIST
2 4 8 16 32 64 128 256 512 1024
BNDM
SBNDM
KBNDM
BSDM
FJS
FJS
HASH
FS-
IOM
WOM
LWFR
DIST
LDIST
2 4 8 16 32 64 128 256 512 1024
BNDM
SBNDM
KBNDM
BSDM
FJS
FJS
HASH
FS-