1 Introduction
The exact string matching problem is a task to find all occurrences of in when given a text of length and a pattern of length . A bruteforce solution of this problem is to compare with all the substrings of of length . It takes time. The KnuthMorrisPratt (KMP) algorithm [16] is well known as an algorithm that can solve the problem in time. However, it is not efficient in practice, because it scans every position of the text at least once. The BoyerMoore algorithm [3] is famous as an algorithm that can perform string matching fast in practice by skipping many positions of the text, though it has worstcase time complexity. Like this, many efficient algorithms whose worstcase time complexity is the same or even worse than the naive method have been proposed so far [15, 20, 18]. For example, the HASH algorithm [18] focuses on the substrings of length in a pattern and obtains a larger shift amount. However, considering that such an algorithm is embedded in software and actually used, if the worstcase input strings are given, the operation of the software may be slowed down. Therefore, an algorithm that operates theoretically and practically fast is important. Franek et al. [13] proposed the FranekJenningsSmyth (FJS) algorithm, which is a hybrid of the KMP algorithm and the Sunday algorithm [20]. The worstcase time complexity of the FJS algorithm is and it works fast in practice, where is the alphabet size. Kobayashi et al. [17] proposed an algorithm that improves the speed of the FJS algorithm by combining a method that extends the idea of the QuiteNaive algorithm [6]. This algorithm has the same worstcase time complexity as the FJS algorithm, and it runs faster than the FJS algorithm in many cases. The LWFR algorithm [5] is a practically fast algorithm that works in linear time. This algorithm uses a method of quickly recognizing substrings of a pattern using a hash function. See [12, 14] for recent surveys on exact string matching algorithms.
In this paper, we propose two new exact string matching algorithms based on the HASH algorithm and the KMP algorithm incorporating a new idea based on the distances of occurrences of the same grams. The time complexity of the preprocessing phase of the first algorithm is and the search phase runs in time. The second algorithm improves the theoretical complexity of the first algorithm, and the preprocessing and searching times are and , respectively. Our algorithms are as fast as the stateoftheart algorithms in many cases. Particularly, our algorithms work faster when a pattern frequently appears in a text.
This paper is organized as follows. Section 2 briefly reviews the KMP and HASH algorithms, which are the basis of the proposed algorithms. Section 3 proposes our algorithms. Section 4 shows experimental results comparing the proposed algorithms with several other algorithms using artificial and practical data. Section 5 draws our conclusions.
2 Preliminaries
2.1 Notation
Let be a set of characters called an alphabet and be its size. denotes the set of all strings over . The length of a string is denoted by . The empty string, denoted by , is the string of length zero. The th character of is denoted by for each . The substring of starting at and ending at is denoted by for . For convenience, let if . A string is called a prefix of and a string is called a suffix of . In particular, a prefix (resp. suffix ) of is a proper prefix (resp. proper suffix) of when (resp. ). A string is a border of if is both a prefix and a suffix of . Note that the empty string is a border of any string. Moreover, it is a proper border of if . The length of the longest proper border of for is given by
Throughout this paper, we assume is an integer alphabet.
2.2 The exact string matching problem
The exact string matching problem is defined as follows:
Input:  A text of length and a pattern of length ,  
Output:  All positions such that for . 
We will use a text of length and a pattern of length throughout the paper.
Let us consider comparing and . The naive method compares characters of the two strings from left to right. When a character mismatch occurs, the pattern is shifted to the right by one character. That is, we compare and . This naive method takes time for matching. There are a number of ideas to shift the pattern more so that searching for can be performed more quickly, using shift functions obtained by preprocessing the pattern.
2.3 KnuthMorrisPratt algorithm
The KnuthMorrisPratt (KMP) algorithm [16] is well known as a string matching algorithm that has linear worstcase time complexity. When the KMP algorithm has confirmed that and for some , it shifts the pattern so that a suffix of matches a prefix of and we do not have to rescan any part of again. That is, the pattern can be shifted by for . In addition, if , the same mismatch will occur again after the shift. In order to avoid this kind of mismatch, we use given by
The amount of the shift is given by
This function has a domain of and is implemented as an array in the algorithm. Hereafter, we identify some functions and the arrays that implement them.
Fact 1.
If and , then holds for . Moreover, there is no positive integer such that .
Note that if the algorithm has confirmed , the shift is given by after reporting the occurrence of the pattern. Algorithm LABEL:alg:kmp_kmpshift shows a pseudocode to compute the array . Clearly, it can be done in time. By using , the KMP algorithm finds all occurrences of in in time. algocf[btp]
2.4 Hash algorithm
The HASH algorithm [18] is an adaptation of the WuManber multiple string matching algorithm [21] to the single string matching problem. Before comparing and , the HASH algorithm shifts the pattern so that the suffix gram of the text substring shall match the rightmost occurrence of the same gram in the pattern. For practical efficiency, we use a hash function, though it may result in aligning mismatching grams occasionally. The shift amount is given by where
We repeatedly shift the pattern till the suffix grams of the pattern and the considered text substring have a matching hash value, in which case the shift amount will be 0. We then compare the characters of the pattern and the text substring from left to right. If a character mismatch occurs during the comparison, the pattern is shifted by
(1) 
where , since we know that the gram suffixes of the pattern and the text substring have the same hash values. The time complexity of the preprocessing phase for computing the shift function is . The searching phase has time complexity. The worstcase time complexity is worse than that of the naive method, but it works fast in practice.
Fact 2.
If , then . There is no positive integer such that .
3 Proposed algorithms
3.1 Dist algorithm
Our proposed algorithm uses three kinds of shifting functions. The first one is essentially the same as , the one used in the HASH algorithm, except for the hashing function. The second one is based on the distance of the closest occurrences of the grams of the same hash value in the pattern. We involve as the third one to guarantee the lineartime behavior.
The second shift function is defined for by
where . This function is a generalization of the shift (Eq. 1) used in the HASH algorithm. We have if the gram ending at and the one ending at have the same hash value, while no grams occurring between those have the same value. If no gram ending before has the same hash value, then . By using this, in the situation where , when a mismatch occurs anywhere between and , the pattern can be shifted by .
Fact 3.
Suppose that . Then , unless . Moreover, there is no positive integer such that .
Those functions , and are computed in the preprocessing phase. Algorithms LABEL:alg:dist_compute_shift and LABEL:alg:dist_compute_darray compute the arrays and , respectively. algocf[t] algocf[t]
Figure 1 shows examples of shifting the pattern using and .
Both functions and shift the pattern using gram hash values based on Facts 2 and 3, respectively. The latter can be used only when we know that the pattern and the text substring have aligned grams ending at with the same hash value and it may shift the pattern at most , while the former can be used anytime and the maximum possible shift is . The advantage of the function is in the computational cost. If we know that the premise of Fact 3 is satisfied, we can immediately perform the shift based on , while computing for the concerned gram in the text is not as cheap as . Our algorithm exploits this advantage of the new shifting function .
algocf[htbp] Next, we explain our searching algorithm shown in Algorithm LABEL:alg:dist_search_algorithm. The searching phase is divided into three: Alignmentphase, Comparisonphase, and KMPphase. The goal of the Alignmentphase is to shift the pattern as far as possible without comparing each single character of the pattern and the text. The Alignmentphase ends when we align the pattern and a text substring that have (a) aligned grams of the same hash value and (b) the same first character. Suppose and are aligned at the beginning of the Alignmentphase. If , by shifting the pattern by , we find the aligned grams of the same hash value. Namely, . Otherwise, we shift the pattern by repeatedly until we find aligned grams of the same hash value. When finding a position satisfying (a) by aligning and for some , i.e., , we simply check the condition (b). If and the corresponding text character match, we move to the Comparisonphase. Otherwise, we safely shift the pattern using . Note that although it is possible to use the function rather than , the computation would be more expensive. Shifting the pattern by , unless , still the pattern and the aligned text substring satisfy (a). However, we do not repeat shift any more, since the smaller becomes, the smaller the expected shift amount will be. We simply restart the Alignmentphase. Once the conditions (a) and (b) are satisfied, we move on to the Comparisonphase.
In the Comparisonphase, we check the characters from to . If a character mismatch occurs during the comparison, either of the shift by or by is possible. Therefore, we select the one where the resumption position of the character comparison goes further to the right after shifting the pattern. If the resumption position of the comparison is the same, we select the one with the larger shift amount. Recall that when the KMP algorithm finds that and , it resumes comparison from checking the match between and if , and and if . On the other hand, if we shift the pattern by , we simply resume matching and . Therefore, we should use rather than when either and or . Summarizing the discussion, we shift the pattern by if and hold. Otherwise, we shift the pattern by
. At this moment, we may have a “
partial match” between the pattern and the aligned text substring. If we have performed the KMPshift with , then we have a match between the nonempty prefixes of the pattern and the aligned text substring of length . In this case, we go to the KMPphase, where we simply perform the KMP algorithm. The KMPphase prevents the character comparison position from returning to the left and guarantees the linear time behavior of our algorithm. If we have no partial match, we return to the Alignmentphase.Theorem 1.
The worstcase time complexity of the DIST algorithm is .
Proof.
Since the proposed algorithm uses Fact 1 on the KMP algorithm to prevent the character comparison position from going back to the left, the number of character comparisons is at most times like the KMP algorithm. In addition, the hash value of gram is calculated to perform the shift using . Since the hash value calculation requires time and it is calculated at the maximum of places in the text, the hash value calculation takes time in total. Therefore, the worstcase time complexity of the searching phase is . In the preprocessing, time is required to calculate the hash value of gram at locations.
Example 1.
Let . The shift functions , , and are shown below. The hash values are calculated by treating each character as its ASCII value, e.g. is calculated as 97.
1  2  3  4  5  6  7  8  9  10  

a  b  a  a  b  b  a  a  a  
    1  2  3  4  5  4  7  
1  1  3  2  4  3  7  6  7  8 
aba  baa  aab  abb  bba  aaa  others  

2041  2053  2038  2042  2057  2037  
6  1  4  3  2  0  7 
Figure 2 illustrates an example run of the DIST algorithm () for finding in .
 Attempt 1

We shift the pattern by characters. Since the position of the gram aligned by this shift is 8, is updated to 8.
 Attempt 2

We check whether the first character of the pattern matches the character of the corresponding text.
, so the pattern is shifted by .  Attempt 3

We shift the pattern by and update to 7.
 Attempt 4

We check whether the first character of the pattern matches the character of the corresponding text. From , we compare the characters of and from left to right. Since , the pattern is shifted by or . From , , and are satisfied. Therefore, we shift the pattern by .
 Attempt 5

We shift the pattern by and is updated to 3.
 Attempt 6

We check whether the first character of the pattern matches the character of the corresponding text. is satisfied, so the characters of and are compared from left to right. Since , the pattern is shifted by or . From and , the pattern is shifted by .
 Attempt 7

Attempt 6 shows that , that is, there is a partial match, so we continue the comparison of and . Since , the pattern occurrence position 22 is reported.
1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  
a  b  b  a  a  b  b  a  a  b  a  b  b  a  b  b  a  a  a  b  a  a  b  a  a  b  b  a  a  a  
Attempt 1  a  b  a  a  b  b  a  a  a  
Attempt 2  a  b  a  a  b  b  a  a  a  
Attempt 3  a  b  a  a  b  b  a  a  a  
Attempt 4  a  b  a  a  b  b  a  a  a  
Attempt 5  a  b  a  a  b  b  a  a  a  
Attempt 6  a  b  a  a  b  b  a  a  a  
Attempt 7  a  b  a  a  b  b  a  a  a 
3.2 Ldist algorithm
The LDIST algorithm is a linear time version of the DIST algorithm. In the DIST algorithm, if strings such as and are given, time is required for searching phase because the hash values of each gram are calculated in the text. Since the hash function defined in Section 3.1 is a rolling hash, if the hash value of has already been obtained for a string , the hash value of can be computed in constant time by . The LDIST algorithm modifies Line LABEL:ln:dist_hash of Algorithm LABEL:alg:dist_search_algorithm so that we calculate the hash value of the gram using the previously calculated value of the other gram in the incremental way, if they overlap. Similarly, the time complexity of the processing phase can be reduced.
Theorem 2.
The worstcase time complexity of the LDIST algorithm is .
Proof.
Like the DIST algorithm, the number of character comparisons is at most times. In the calculation of the hash value of gram, if the gram for which the hash value has been calculated one step before and the gram for which the hash value is to be calculated overlap, the incremental update is performed using the rolling hash. Therefore, the calculation of the hash value of gram takes time in total. Thus, the worst time complexity of matching is . In the preprocessing, we calculate the hash values of grams in the same way, so the hash values of grams can be calculated in time.
4 Experiments
In this section, we compare the execution times of the proposed algorithms with other algorithms. We used the following algorithms:

BNDM: the Backward Nondeterministic DAWG Matching algorithm using grams with [19]

SBNDM: the Simplified version of the Backward Nondeterministic DAWG Matching algorithm using grams with [1]

KBNDM: the Factorized variant of the BNDM algorithm [4]

BSDM: the Backward SNR DAWG Matching algorithm using condensed alphabets with groups of characters, with [10]

FJS: the FranekJenningsSmyth algorithm [13]

FJS: a modification of the FJS algorithm [17]

IOM: the Improved Occurrence Matcher [8]

WOM: the Worst Occurrence Matcher [8]

LWFR: the LinearWeakFactorRecognition algorithm [5] implemented with a chained loop with

DIST: our algorithm proposed in Section 3.1 (Algorithm LABEL:alg:dist_search_algorithm) with

LDIST: our algorithm proposed in Section 3.2 with
2  4  8  16  32  64  128  256  512  1024  

BNDM  
SBNDM  
KBNDM  
BSDM  
FJS  
FJS  
HASH  
FS  
IOM  
WOM  
LWFR  
DIST  
LDIST 
2  4  8  16  32  64  128  256  512  1024  

BNDM  
SBNDM  
KBNDM  
BSDM  
FJS  
FJS  
HASH  
FS  
IOM  
WOM  
LWFR  
DIST  
LDIST 
2  4  8  16  32  64  128  256  512  1024  

BNDM  
SBNDM  
KBNDM  
BSDM  
FJS  
FJS  
HASH  
FS 
Comments
There are no comments yet.