Edit similarity join is a fundamental problem in the database literature. In this problem we are given a set of strings and a distance threshold , and asked to output all pairs of strings such that , where is the edit distance function defined to be the minimum number of insertions, deletions and substitutions to transfer one string to another. There is a long line of research on edit similarity joins [9, 2, 4, 6, 11, 22, 19, 13, 21, 12, 20]. We will discuss a few state-of-the-art algorithms shortly, and cover more related work in Section 6.
A major challenge for most existing algorithms, as pointed out by the recent work , is that they do not scale well to long strings and large edit thresholds. Long strings and large thresholds are critical for applications involving long sequence data such as big documents and DNA sequences, where a small threshold may just give zero output. For example, in the genome sequence assembly, the third generation sequencing technology such as single molecule real time sequencing (SMRT)  generates reads of 1,000-100,000 bps long with 12-18% sequencing errors (i.e., percentage of insertions, deletions and substitutions). Large threshold is also identified as the the main challenge in a recent string similarity search/join competition , where it was reported that “an error rate of 20%-25% pushes today’s techniques to the limit”.
Different from previous algorithms which are deterministic and return the exact answers, in  the authors proposed a randomized algorithm named EmbedJoin which is more efficient on long strings and large thresholds. However, the accuracy of EmbedJoin is only 95% - 99% on a number of real-world datasets tested in . The imperfect accuracy is inherent to EmbedJoin which we shall explain shortly. The main question we are going to address in this paper is:
Can we solve edit similarity joins efficiently on long string and large edit threshold while achieving perfect accuracy?
We propose a novel algorithm named MinJoin to address the above question. The high level framework of MinJoin is simple: it first partitions each string into a set of substrings, and then uses hash join on these substrings to find all pairs of strings that share at least one common substring. At the end a verification step is used to remove all false positives.
We design several string partition schemes using local hash minimums. The idea of our partition schemes is as follows: We first assign each letter in the string a value, which is a random hash value of the -gram starting from . We then determine the anchors of string using the following strategy: a letter is an anchor if and only if its value is the smallest among all letters in a certain neighborhood of . We will design different partition schemes with different neighborhood selections. At the end we simply partition
at all of its anchors. Via a rigorous mathematical analysis we can show that under our partition schemes, with high probability, all pairs of strings with edit distance at mostwill share at least one common partition (Theorem 1).
We have verified the effectiveness of MinJoin by an extensive set of experiments. Our results show that MinJoin is able to achieve perfect accuracy on all datasets that were used in . Moreover, MinJoin is faster than all existing exact algorithms by orders of magnitudes on datasets of long strings and large edit thresholds, and is also faster than EmbedJoin by a good margin.
Previous Work and the Advantages of MinJoin
Many of the existing algorithms on edit similarity joins also follow the string partition framework. The performance of the algorithm is largely determined by the number of partitions generated for each string, and the number of queries made to the indices (e.g., hash tables) to search for similar strings. We discuss several state-of-the-art algorithms according to the experimental studies in .
QChunk  is an exact edit similarity join algorithm based on string partition. The QChunk first obtains a global order of -grams. It partitions each string into a set of chunks with starting positions , and then stores the first chunks (according to the order ) in a hash table. Next, for each string the algorithm queries the hash table with its first -grams according to to check if there is any match, where is the string length. Alternatively, for each string we can store the first -grams in the hash table, and make queries with the first chunks.
PassJoin  is another exact algorithm based on string partition. The algorithm partitions each string into equal-length segments, and records the -th segment into an inverted index . Next, for each string the algorithm queries some of the inverted indices to find similar strings; the number of queries made for each string is .
VChunk  is the one that is closest to MinJoin among all algorithms that we are aware of. In VChunk each string is partitioned into at least chunks of possibly different lengths, determined by a chunk boundary dictionary (CBD). More precisely, each string is cut at positions of appearances of each word in CBD to obtain its chunks. The CBD is data dependent and the optimal one is NP-hard to compute. In  the authors proposed a greedy algorithm for computing a CBD in time , where is the number of input strings, and is the maximum string length.
The recently proposed algorithm EmbedJoin  uses a very different approach. EmbedJoin first embeds each string from the edit distance metric space to the Hamming distance metric space, translating the original problem to finding all pairs of strings that are close under Hamming distance. It then uses Locality Sensitive Hashing to compute (approximate) similarity joins in the Hamming space. However, the embedding algorithm employed by EmbedJoin has a worst case distance distortion , which can be very large. Although in practice the distortion is much smaller, it still contributes a non-negligible percentage of false negatives which prevent a perfect accuracy.
Compared with these existing algorithms, MinJoin has the following major advantages.
For each string MinJoin only generates partitions, and makes the same amount of queries (for searching similar strings), which are significantly smaller than QChunk and PassJoin.
MinJoin can compute partitions of all strings in time , i.e., linear in the input size, which is even faster than the computation of CBD in VChunk.
MinJoin is able to reach perfect accuracy on tested datasets, compared with 95%-99% of EmbedJoin.
We have listed a set of notations to be used in this paper in Table 1. We will make use of the following tools in probability.
|edit distance threshold|
|set of input strings|
|-th string in|
|number of input strings, i.e.,|
|length of string|
|substring of starting from the -th letter to the -th letter|
|maximum string length|
|alphabet of strings in|
|length of -gram|
|random hash function|
|number of targeted partitions|
|radius for computing local minimum|
Lemma 1 (Chebyshev’s inequality)
For any random variable
For any random variableand a constant , we have
Lemma 2 (Chernoff bound)
Let be independent random variables taking values in and . For any , we have
This paper is organized as follows: We first describe an oblivious local hash minimum based string partition scheme (Section 2), and use it as a preprocessing step in the MinJoin algorithm (Section 3). We then introduce an alternative string partition scheme which is also based on local hash minimums, but the partition process is adaptive (Section 4). After these we present our experimental studies (Section 5). Finally we cover more related work (Section 6).
2 A Local Hash Minimum Based String Partition Scheme
We start by giving some high level ideas of our partition scheme. As mentioned, in MinJoin we first partition each string to a set of substrings, and then find pairs of strings with at least one common partition as candidates for verification. Consider a pair of strings with edit distance , and let be the optimal alignment from to , where means that either or is substituted by in the optimal transformation, and means that is deleted in the optimal transformation. If we pick any indices such that , partition at to substrings, and partition at to substrings, then by the pigeonhole principle and must share at least one common partition.
Of course obtaining an optimal alignment between and before the partition is unrealistic. Our goal is to partition each string independently, while still guarantee that with a good probability, any pair of similar strings will share at least one common partition. The main idea of our partition scheme is as follows: for a string , we first hash all its substrings of length (i.e., all -grams of ) into values in . Now we have effectively transferred to an array of size , with each coordinate taking a value in . We call a coordinate in a local minimum if its value is strictly smaller than all other coordinates within a distance of (for a prespecified parameter ). We call the corresponding -th letter in string an anchor. We will use these anchors to partition into a set of substrings. For convenience, in the rest of the paper we also call a local minimum coordinate in an anchor.
We will show that for a pair of strings , if they share a common substring that is long enough, then there must be at least two letters in such that and are two adjacent anchors in both and , which means that if we use anchors to partition and , then they must share at least one common partition. On the other hand, we know that for two strings of length and edit distance at most , they must share at least one common substring of length . Thus by properly choosing the distance parameter , we can guarantee that two similar strings will share at least one common partition. We note that in the actual partition scheme the parameter may be different for strings of different lengths, which will slightly complicate the above argument.
2.1 The Algorithm
We present our partition algorithm in Algorithm 1 and Algorithm 2. Let us briefly describe them in words. Algorithm 1 first calls Algorithm 2 to obtain all anchors of the input string , and then cuts at each anchor into a set of substrings. To compute all anchors, for each , Algorithm 2 hashes the -th -gram of to a value . It then finds all the anchors using the array ; the naive implementation is to simply compare each with all its neighbors . We will discuss a more efficient way to compute all the anchors given in the analysis.
Before analyzing Algorithm 1 we first give a running example. Table 2 presents a collection of input strings , , , , and their lengths. We want to find all pairs of strings with edit distance less than or equal to . Table 3 presents the hash values of all -grams in under the hash function . Table 4 presents the partitions of strings obtained by Algorithm 2 under parameter . We also calculate the neighbor distance for each string based on its string length and parameters .
Considering string as an example, its -th -gram “CTA” has a smaller hash value than all its neighbors within distance (i.e., “TGC”, “GCT”, “TAA”, “AAC”). Thus “CTA” is selected as an anchor of . Same to the -th -gram “CTA”. We then partition to ACGTG, CTAACGTG, CTAACGTA. We next find that the strings share a common partition “CTAACGTG”, share a common partition “TCGAAT”, and share a common partition “CGTCGAAT”, which give the following candidate pairs: , , , . After computing the exact edit distance of each pair, we output , , as the final answer (i.e., those whose edit distances are no more than ).
|ID||Partitions of string|
|ACGTG, CTAACGTG, CTAACGTA|
|AAACGTG, CTAACGTG, CTAACCT|
|TCGAAT, CGTCGAAT, CGTCGAA|
|TCGAAT, CGTCGAAT, CGTGGAA|
|GTGCGAAT, CGTCGAAT, CGTCG|
We would like to discuss two items in more detail. First, we require the value of an anchor in the hash array to be strictly smaller than its neighbors. The purpose of this is to reduce the number of false positives generated by periodic substrings with short periods; false positives will increase the running time of the verification step of the MinJoin algorithm. In real world datesets, periodic substrings are often caused by systematic errors, and may be shared among different strings. For example, consider the following periodic substring on genome data “…AAAAAAAA …” produced by sequencing errors, if we allow the value of an anchor to be equal to its neighbors, then we may have many anchors in this substring. Consequently, two strings both containing such a substring will be considered as a candidate pair even that they are very different elsewhere.
Second, we use different neighborhood distance for strings of different lengths. More precisely, we set where is an input parameter standing for the number of targeted partitions. The purpose of doing this, instead of choosing a fixed for all strings, is again to reduce false positives. Indeed, if we choose the same for all strings, then long strings will generate many partitions, since in order to achieve perfect accuracy we cannot set to be too large at the presence of short strings. Consequently, the large number of partitions generated by long strings will contribute to many false positives. This is in contrast to VChunk, who cuts the string whenever it finds a word in CBD appearing on the string. Consequently two strings of very different length but sharing a relatively long substring are likely be considered as a candidate pair, producing a false positive for the verification.
2.2 The Analysis
Properties of the Partition
We now analyze the properties of Algorithm 1. Our goal is to understand how many partitions Algorithm 1 will generate, and what is the probability for two similar strings to share a common partition. To keep the analysis clean, we assume that in any -neighborhood of the array all the coordinates are distinct, which is true if (1) we assume that all corresponding -grams are different, and (2) the hash function does not produce a collision when applying to -grams. The later can be easily satisfied if we keep an -bit precision ( is the maximum string length) in the range of , in which case there is no hash collision with probability . We emphasize that this assumption is only used for the convenience of the analysis, and Algorithm 1 works without this constraint.
The following lemma states that the number of anchors produced by Algorithm 2 is concentrated around , the number of targeted partitions.
Given an input string and a parameter , for any , the number of anchors generated by Algorithm 2, denoted by , satisfies
Consider the array constructed in Algorithm 2; is the hash value of the -th -gram of . Let . For , define a random variable whose value is if is the smallest coordinate in the window , and otherwise. Let , which is the total number of anchors generated by Algorithm 2. We now analyze the random variable .
We next compute the variance.
We compute the two terms of (2) separately. For the first term,
For the second term of (2), by definition of the covariance we have
We analyze in three cases.
. It is easy to see that in this case and are independent, since their corresponding windows and are disjoint. We thus have , and consequently .
. In this case, is inside the window , and symmetrically is inside the window . Thus if then we must have , and if then we must have . Therefore , and consequently .
. The analysis for this case is a bit more complicated. Consider two windows and which overlap. We divide their union into three areas; see Figure 1 for an illustration. Area 2 denotes the intersection of the two windows, and Area 1 and Area 3 denote the coordinates that are only in and respectively. It is easy to see that the number of coordinates in Area 1 and Area 3 are equal; let denote this number.
We thus only need to analyze . Define a random variable such that if the central coordinate of (i.e., ) is smaller than all coordinates in Area 3. We have
Note that implies that the central coordinate of is smaller than all coordinates in , which, however, does not give any information about the relationship between all coordinates in . We thus have
On the other hand, implies that the central coordinate of is smaller than all coordinates in Area , and is larger than some coordinate in Area . We thus know that the minimum coordinate of must lie in Area . Therefore if and only if the central coordinate of is larger than all other coordinates in Area . We get
Consequently we have
Summing up, we have
We have empirically verified the concentration result in Lemma 3 on two real world datesets (to be introduced in Section 5); see Figure 2. It is clear that the numbers of partitions Algorithm 1 generates are tightly concentrated around the number of target partitions .
We next analyze another key property of our local minimum based partition: Given two similar strings, what is the probability that they share a common partition?
For two strings with , let and be the partitions outputted by Algorithm 1 (setting ) on and respectively. Assume . The probability that and share a common partition is at least 0.98.
Since , we have , and and must share a common substring of length at least in the optimal alignment.
For and , we have
which means that with probability there are at least four anchors on .
Let be four anchors on when processing using Algorithm 2. Let . Since and , it holds that . In the case that , and must also be anchors when processing using Algorithm 2, since an anchor is fully determined by a neighborhood of size .
For the case when , w.l.o.g., assume that and . Now the probability that is still an anchor when processing , given the fact that is an anchor when processing , is at least . Same argument holds for . Thus with probability (note that given and ), and are also anchors when processing .
Finally, observe that once and share two adjacent anchors and , they must share at least one common partition. ∎
Though the success probability in Lemma 4 is only , and it is only for each pair of similar strings, we can easily boost it to high probability for all pairs of similar strings using parallel repetitions. We can repeat the partition process for each string for times using independent randomness, and then union all the partitions of the string. Now for each pair of similar strings, the probability that they share a common partition is at least . We then use a union bound on the at most pairs of similar strings, and get that the probability that all pairs of similar strings share at least one common partition is at least . We note in our experiments that we do not need this boosting procedure since a single run of the partition process already achieves perfect accuracy.
If we apply Algorithm 1 augmented by the parallel repetition discussed above on all input strings, then with probability , all pair of strings with edit distances at most will share at least one common partition. The running time of the algorithm is times the input size, and the space needed is linear in terms of the input size.
The correctness follows directly from Lemma 4 and the discussion of parallel repetition above. In the rest of the proof we focus on the time and space. In fact, to show the claimed time and space usage we can just show that the time and space for partitioning one string (by Algorithm 1) is linear in terms of the string length .
The running time of Algorithm 1 is dominated by that of its subroutine Algorithm 2. The hash values of all -grams of can be computed by the Rabin-Karp algorithm (the rolling hash) in time. Instead of implementing Line 9-20 of Algorithm 2 in the straightforward manner, we can use the data structure of the sliding window sampling algorithm  to find all anchors in time. Roughly speaking, the data structure in  always maintains a set of items of small size when scanning through the string , such that for any sliding window, the coordinate with the minimum hash value in the window is always stored in . We can make use of two such data structures for and where is the current coordinate that we are investigating. Let and be the sets maintained by these two data structures. Now to decide whether is an anchor or not, we can simply check whether is smaller than the minimum value in , which can be done in amortized time.
Clearly, the space usage of Algorithm 1 is also . ∎
We note that the straightforward way of computing all anchors in Line 9-20 of Algorithm 2 takes time . In our experiments is of , so . In practice, the linear time algorithm described in the proof of Theorem 1 gives little advantage when is not very large. This is mainly because the linear time algorithm involves memory writes (to update and in the streaming fashion) which are much slower than pure memory reads in Line 9-20 of Algorithm 2.
3 The MinJoin Algorithm
We now present our main algorithm MinJoin, depicted in Algorithm 3. We briefly explain it in words below.
The MinJoin algorithm has three stages: initialization (Line 3 - 6), join and filtering (Line 7 - 22) and verification (Line 23 - 27). In the first stage, we initialize an empty set for candidate pairs and an empty hash table , generate a random hash function , and sort all strings according to their lengths for the pruning.
In the join and filtering stage, we compute the partitions for each input string using Algorithm 1. For each partition , which refers the substring of with length and is the index of its first character on , we find all tuples in -th bucket of hash table (that is, we perform a hash join). We use two rules to prune the candidate pairs we have found. The first condition (Line 11) says that if the lengths of and differ by larger than , then it is impossible to have . Consequently it is impossible to have for any .
The second condition (Line 12) concerns the following scenario: if and match at indices and , which divides both strings into two substrings , and . If and are indeed matched in the optimal alignment, then we must have
in which case we have
We add all pairs of strings that pass the two filtering conditions to the candidate set , and then perform a deduplication at the end since each pair can potentially be added into multiple times.
In the verification stage, we verify whether each pair of strings in indeed have edit distance at most , using the standard dynamic programming algorithm by Ukkonen . Due to this verification step our algorithm will never output any false positive. On the other hand, by Theorem 1, if we augment the string partition schemes with parallel repetition, then MinJoin will not produce any false negative with probability . Therefore MinJoin will achieve perfect accuracy with probability .
Time and Space Analysis
Let be the maximum string length in the set of input strings , and . By Theorem 1 the running time of the partition (without the parallel repetition) is bounded by .
The total number of pairs that are fed into the filtering steps (Line 11, 12) inherently depends on the concrete dataset. Suppose partitions of all strings are evenly distributed into buckets of the hash table (this is indeed what we have observed in our experiments), then we can upper bound this number by with probability . To see this, by the proof in Lemma 3 we know that the expected number of partitions of each string is . By linearity of expectation, the expected number of partitions of all strings is . Therefore the total number of actual partitions is bounded by with probability by a Markov inequality. The verification step can be done in where is the set of the candidate pairs.
The space usage is clearly bounded by , that is, the size of the input.
The MinJoin algorithm has the following theoretical properties. Consider the case that we augment the string partition procedure at Line 8 with parallel repetitions.
It achieves perfect accuracy with probability .
Assuming that the partitions of all strings are evenly distributed into the buckets of the hash table, the running time of MinJoin is bounded by
with probability , where is the set of the candidate pairs MinJoin produces before the verification step.
The space usage of MinJoin is linear in terms of the size of input.
4 An Adaptive String Partition Scheme
In this section we introduce an alternative string partition scheme. We will first introduce the algorithm and analyze its properties, and then compare it with the partition scheme presented in Section 2.
4.1 The Algorithm
The high level idea of this new partition algorithm is as follows: for each string we compute its anchors one by one adaptively. Given the corresponding array containing the hash values of all ’s -grams, the -th anchor is the coordinate with the minimum hash value in the window , where is the -th anchor, and is the parameter for controlling the length of the partitions. For the same reason as that for Algorithm 2 (i.e., to reduce false positives), we choose to be proportional to the string length. We present the full algorithm in Algorithm 4. To use it we can simply replace Line 4 in Algorithm 1 to:
We note that unlike oblivious partition, in which adjacent anchors must have a distance of at least , in the adaptive partition two adjacent anchors can be close to each other, which may result short partitions and consequently increase the number of false positives. We make a few remarks to this issue. First, it is easy to see that in expectation the distance between two adjacent anchors in the adaptive partition is , which means that there will not be many short partitions. Second, if there are coordinates with the same value in the window , we always choose the one with larger index to produce a longer partition. Third, we can perform a post-processing to merge or delete short partitions, which we shall discuss in Section 4.3.
4.2 The Analysis
Properties of the Partition
Similar to that for the oblivious partition, we would like to understand how many partitions the adaptive partition scheme will make, and what is the probability for two similar strings to share a common partition. The former is needed in the analysis of the running time, and the latter gives the correctness of the algorithm. These two questions are addressed in the following two lemmas.
Given an input string and a parameter as the targeted number of partitions, the number of anchors produced by Algorithm 4 falls into the range with probability .
To facilitate the analysis we again assume that all coordinates in the array have distinct hash values, and ignore the floor operation.
It is easy to see that the number of produced anchors is lower bounded by with certainty, since the largest distance between two anchors is , and
In the rest of the proof we focus on the upper bound. For each window , define to be the value such that is the minimum coordinate of ; in other words, is the distance between the -th anchor and the -th anchor. Let . We will show the follow inequality which finishes the proof.
To this end, define random variables such that each is distributed uniformly at random in . We thus have for each . Let ; thus . We are going to show that for any ,
By a Chernoff bound (Lemma 2) we have
In the rest of the proof we show (13). Consider two adjacent windows and , where and are anchors. We divide into two areas: Area corresponds to , and Area corresponds to . Area may be empty. See Figure 3 for an illustration.
Define an indicator random variablesuch that if is smaller than all coordinates in Area , and otherwise. When , we have that is smaller than all coordinates in , and thus the minimum coordinate in is distributed uniformly at random in , which indicates that is chosen uniformly at random in . When , we have that the minimum coordinate in is distributed uniformly at random in Area , which indicates that is chosen uniformly at random in . Summing up two cases, it is clear that the random variable dominates the random variable . Inequality (13) follows since the above argument holds for all . ∎
For two strings with , let and be the partitions outputted by Algorithm 1 (with Find-Anchor-Adaptive() replacing Find-Anchor(), and setting ) on and respectively. Assume . The probability that and share a common partition is at least 0.98.
Similar to that for the oblivious partition scheme, we can use parallel repetition to boost the success probability to for all pairs of similar strings, and get a result similar to Theorem 1.
To prove Lemma 6, we will make use of the following claim.