It is increasingly important for data processing and analysis systems to be able to work with data that is imprecise, incomplete, or noisy. Similarity join has emerged as a fundamental primitive in data cleaning and entity resolution over the last decade [augsten2013similarity, Chaudhuri_ICDE06, sarawagi2004efficient]. In this paper we focus on set similarity join: Given collections and of sets the task is to compute
where is a similarity measure and is a threshold parameter. We deal with sets , where the number of distinct tokens can be naturally thought of as the dimensionality of the data.
Many measures of set similarity exist [choi2010survey], but perhaps the most well-known such measure is the Jaccard similarity,
For example, the sets IT, University, Copenhagen and University, Copenhagen, Denmark have Jaccard similarity which could suggest that they both correspond to the same entity. In the context of entity resolution we want to find a set that contains if and only if and correspond to the same entity. The quality of the result can be measured in terms of precision and recall , both of which should be as high as possible. We will be interested in methods that achieve 100% precision, but that might not have 100% recall. We refer to methods with 100% recall as exact, and others as approximate.
I-a Our Contributions
We present a new approximate set similarity join algorithm: Chosen Path Similarity Join (CPSJoin). We cover its theoretical underpinnings, and show experimentally that it achieves high recall with a substantial speedup compared to state-of-the-art exact techniques. The key ideas behind CPSJoin are:
A new recursive filtering technique inspired by the recently proposed ChosenPath index for set similarity search [christiani2017set], adding new ideas to make the method parameter-free, near-linear space, and adaptive to a given data set.
Apply efficient sketches for estimating set similarity[li2011theory] that take advantage of modern hardware.
We compare CPSJoin to the exact set similarity join algorithms in the comprehensive empirical evaluation of Mann et al. [Mann2016], using the same data sets, and to other approximate set similarity join methods suggested in the literature. We find that CPSJoin outperforms other approximate methods and scales better than exact methods when the sets are relatively large (100 tokens or more) and the similarity threshold is low (e.g. Jaccard similarity 0.5) where we see speedups of more than an order of magnitude at 90% recall. The finding that exact methods are faster in the case of high similarity thresholds, when the average set size is small, and when sets have many rare tokens, whereas approximate methods are faster in the case of low similarity thresholds and when sets are large, is consistent with theory and is further corroborated by experiments on synthetic datasets.
I-B Related Work
For space reasons we present just a sample of the most related previous work, and refer to the book of Augsten and Böhlen [augsten2013similarity] for a survey of algorithms for exact similarity join in relational databases, covering set similarity joins as well as joins based on string similarity.
Exact Similarity Join
Early work on similarity join focused on the important special case of detecting near-duplicates with similarity close to 1, see e.g. [broder2000identifying, sarawagi2004efficient]. A sequence of results starting with the seminal paper of Bayardo et al. [Bayardo_WWW07] studied the range of thresholds that could be handled. Recently, Mann et al. [Mann2016] conducted a comprehensive study of 7 state-of-the-art algorithms for exact set similarity join for Jaccard similarity threshold . These algorithms all use the idea of prefix filtering [Chaudhuri_ICDE06], which generates a sequence of candidate pairs of sets that includes all pairs with similarity above the threshold. The methods differ in how much additional filtering is carried out. For example, [xiao2011efficient] applies additional length and suffix filters to prune the candidate pairs.
Prefix filtering uses an inverted index that for each element stores a list of the sets in the collection containing that element. Given a set , assume that we wish to find all sets such that . A valid result set must be contained in at least one of the inverted lists associated with any subset of elements of , or we would have . In particular, to speed up the search, prefix filtering looks at the elements of that have the shortest inverted lists.
The main finding by Mann et al. is that while more advanced filtering techniques do yield speedups on some data sets, an optimized version of the basic prefix filtering method (referred to as “ALL”) is always competitive within a factor 2.16, and most often the fastest of the algorithms. For this reason we will be comparing our results against ALL.
Locality-sensitive hashing (LSH) is a theoretically well-founded randomized method for generating candidate pairs [Gionis99]
. A family of locality-sensitive hash functions is a distribution over functions with the property that the probability that similar points (or sets in our case) are more likely to have the same function value. We know only of a few papers using LSH techniques to solve similarity join. Cohen et al.[cohen2001finding] used LSH techniques for set similarity join in a knowledge discovery context before the advent of prefix filtering. They sketch a way of choosing parameters suitable for a given data set, but we are not aware of existing implementations of this approach. Chakrabarti et al. [chakrabarti2015bayesian] improved plain LSH with an adaptive similarity estimation technique, BayesLSH, that reduces the cost of checking candidate pairs and typically improves upon an implementation of the basic prefix filtering method by –. Our experiments include a comparison against both methods [cohen2001finding, chakrabarti2015bayesian]. We refer to the survey paper [pagh2015large] for an overview of newer theoretical developments on LSH-based similarity joins, but point out that these developments have not matured sufficiently to yield practical improvements.
Similar to BayesLSH [chakrabarti2015bayesian] we make use of algorithms for similarity estimation, but in contrast to BayesLSH we use algorithms that make use of bit-level parallelism. This approach works when there exists a way of picking a random hash function such that
for every choice of sets and . Broder et al. [Broder_NETWORK97] presented such a hash function for Jaccard similarity, now known as “minhash” or “minwise hashing”. In the context of distance estimation, 1-bit minwise hashing of Li and König [li2011theory] maps minhash values to a compact sketch, often using just 1 or 2 machine words. Still, this is sufficient information to be able to estimate the Jaccard similarity of two sets and just based on the Hamming distance of their sketches.
Several recent theoretical advances in high-dimensional indexing [andoni2017optimal, christiani2017framework, christiani2017set] have used an approach that can be seen as a generalization of LSH. We refer to this approach as locality-sensitive mappings (also known as locality-sensitive filters in certain settings). The idea is to construct a function , mapping a set into a set of machine words, such that:
If then is nonempty with some fixed probability .
If , then the expected intersection size is “small”.
Here the exact meaning of “small” depends on the difference , but in a nutshell, if it is the case that almost all pairs have similarity significantly below then we can expect for almost all pairs. Performing the similarity join amounts to identifying all candidate pairs for which (for example by creating an inverted index), and computing the similarity of each candidate pair. To our knowledge these indexing methods have not been tried out in practice, probably because they are rather complicated. An exception is the recent paper [christiani2017set], which is relatively simple, and indeed our join algorithm is inspired by the index described in that paper.
The CPSJoin algorithm solves the -similarity join problem with a probabilistic guarantee on recall, formalized as follows:
An algorithm solves the -similarity join problem with threshold and recall probability if for every the output of the algorithm satisfies .
It is important to note that the probability is over the random choices made by the algorithm, and not over a random choice of . This means that for any the probability that the pair is not reported in independent repetitions of the algorithm is bounded by . For example if it takes just repetitions to bound the recall to at least .
Ii-a Similarity Measures
Our algorithm can be used with a broad range of similarity measures through randomized embeddings
. This allows it to be used with, for example, Jaccard and cosine similarity thresholds.
Embeddings map data from one space to another while approximately preserving distances, with accuracy that can be tuned. In our case we are interested in embeddings that map data to sets of tokens. We can transform any so-called LSHable similarity measure sim, where we can choose to make (1) hold, into a set similarity measure by the following randomized embedding: For a parameter pick hash functions independently from a family satisfying (1). The embedding of is the following set of size :
It follows from (1) that the expected size of the intersection is . Furthermore, it follows from standard concentration inequalities that the size of the intersection will be close to the expectation with high probability. For our experiments with Jaccard similarity thresholds , we found that gave sufficient precision for recall.
In summary we can perform the similarity join for any LSHable similarity measure by creating two corresponding relations and , and computing with respect to the similarity measure
This measure is the special case of Braun-Blanquet similarity where the sets are known to have size [choi2010survey]. Our implementation will take advantage of the set size being fixed, though it is easy to extend to general Braun-Blanquet similarity.
The class of LSHable similarity measures is large, as discussed in [chierichetti2015lsh]. If approximation errors are tolerable, even edit distance can be handled by our algorithm [chakraborty2016streaming, zhang2017embedjoin].
We are interested in sets where an element, is a set with elements from some universe . To avoid confusion we sometimes use “record” for and “token” for the elements of . Throughout this paper we will think of a record both as a set of tokens from
, as well as a vector from, where:
It is clear that these representations are equivalent. The set is equivalent to , is equivalent to , etc.
Iii Overview of approach
Our high-level approach is recursive and works as follows. To compute we consider each and either:
Compare to each record in (referred to as “brute forcing” ), or
create several subproblems with , , and solve them recursively.
The approach of [christiani2017set] corresponds to choosing option 2 until reaching a certain level of the recursion, where we finish the recursion by choosing option 1. This makes sense for certain worst-case data sets, but we propose an improved parameter-free method that is better at adapting to the given data distribution. In our method the decision on which option to choose depends on the size of and the average similarity of to the records of . We choose option 1 if has size below some (constant) threshold, or if the average Braun-Blanquet similarity of and , , is close to the threshold . In the former case it is cheap to finish the recursion. In the latter case many records will have larger than or close to , so we do not expect to be able to produce output pairs with in sublinear time in .
If neither of these pruning conditions apply we choose option 2 and include in recursive subproblems as described below. But first we note that the decision of which option to use can be made efficiently for each , since the average similarity of pairs from can be computed from token frequencies in time . Pseudocode for a self-join version of CPSJoin is provided in Algorithm 1 and 2.
We would like to ensure that for each pair the pair is computed in one of the recursive subproblems, i.e., that for some . In particular, we want the expected number of subproblems containing to be at least 1, i.e.,
To achieve (3) for every pair we proceed as follows: for each we recurse with probability on the subproblem with sets
where denotes the size of records in and . It is not hard to check that (3) is satisfied for every pair with . Of course, expecting one subproblem to contain does not directly imply a good probability that is contained in at least one subproblem. But it turns out that we can use results from the theory of branching processes to show such a bound; details are provided in section IV.
Iv Chosen Path Similarity Join
The CPSJoin algorithm solves the -set similarity join (Definition 1) for every choice of and with a guarantee on that we will lower bound in the analysis.
To simplify the exposition we focus on a self-join version where we are given a set of subsets of and we wish to report . Handling a general join follows the overview in section III and requires no new ideas: Essentially consider a self-join on but make sure to consider only pairs in for output. We also make the simplifying assumption that all sets in have a fixed size . As argued in section II-A the general case can be reduced to this one by embedding.
The CPSJoin algorithm (see Algorithm 1 for pseudocode) works by recursively splitting the data set on elements of that are selected according to a random process, forming a recursion tree with at the root and subsets of that are non-increasing in size as we get further down the tree. The randomized splitting has the property that the probability of a pair of sets being in a random subproblem is increasing as a function of .
Before each recursive splitting step we run the BruteForce subprocedure (see Algorithm 2 for pseudocode) that identifies subproblems that are best solved by brute force. It has two parts:
1. If is below some constant size, controlled by the parameter limit, we report exactly using a simple loop with distance computations (BruteForcePairs) and exit the recursion. In our experiments we have set limit to , with the precise choice seemingly not having a large effect as shown experimentally in Section VI-B.
2. If is larger than limit the second part activates: for every we check whether the expected number of distance computations involving is going to decrease by continuing the recursion. If this is not the case, we immediately compare against every point in (BruteForcePoint), reporting close pairs, and proceed by removing from . The BruteForce procedure is then run again on the reduced set.
This procedure where we choose to handle some points by brute force crucially separates our algorithm from many other approximate similarity join methods in the literature that typically are LSH-based [PaghSIMJOIN2015, cohen2001finding]. By efficiently being able to remove points at the “right” time, before they generate too many expensive comparisons further down the tree, we are able to beat the performance of other approximate similarity join techniques in both theory and practice. Another benefit of this approach is that it reduces the number of parameters compared to the usual LSH setting where the depth of the tree has to be selected by the user.
Iv-B Comparison to Chosen Path
The CPSJoin algorithm is inspired by the Chosen Path algorithm [christiani2017set] for the approximate near neighbor problem and uses the same underlying random splitting tree that we will refer to as the Chosen Path Tree. In the approximate near neighbor problem, the task is to construct a data structure that takes a query point and correctly reports an approximate near neighbor, if such a point exists in the data set. Using the Chosen Path data structure directly to solve the -set similarity join problem has several drawbacks that we avoid in the CPSJoin algorithm. First, the Chosen Path data structure is parameterized in a non-adaptive way to provide guarantees for worst-case data, vastly increasing the amount of work done compared to the optimal parameterization when data is not worst-case. Our recursion rule avoids this and instead continuously adapts to the distribution of distances as we traverse down the tree. Secondly, the data structure uses space where , storing the Chosen Path Tree of size for every data point. The CPSJoin algorithm, instead of storing the whole tree, essentially performs a depth-first traversal, using only near-linear space in in addition to the space required to store the output. Finally, the Chosen Path data structure only has to report a single point that is approximately similar to a query point, and can report points with similarity . To solve the approximate similarity join problem the CPSJoin algorithm has to satisfy reporting guarantees for every pair of points in the exact join.
The Chosen Path Tree for a set is defined by a random process: at each node, starting from the root, we sample a random hash function and construct children for every element such that . Nodes at depth in the tree are identified by their path . Formally, the set of nodes at depth in the Chosen Path Tree for is given by
where denotes vector concatenation and . The subset of the data set that survives to a node with path is given by
The random process underlying the Chosen Path Tree belongs to the well studied class of Galton-Watson branching processes [harris2002theory]. Originally these where devised to answer questions about the growth and decline of family names in a model of population growth assuming i.i.d. offspring for every member of the population across generations [watson1875]. In order to make statements about the properties of the CPSJoin algorithm we study in turn the branching processes of the Chosen Path Tree associated with a point , a pair of points , and a set of points . Note that we use the same random hash functions for different points in .
Iv-C1 Brute Force
The BruteForce subprocedure described by Algorithm 2 takes two global parameters: and . The parameter controls the minimum size of before we discard the CPSJoin algorithm for a simple exact similarity join by brute force pairwise distance computations. The second parameter, , controls the sensitivity of the BruteForce step to the expected number of comparisons that a point will generate if allowed to continue in the branching process. The larger the more aggressively we will resort to the brute force procedure. In practice we typically think of as a small constant, say , but for some of our theoretical results we will need a sub-constant setting of to show certain running time guarantees. The BruteForce step removes a point from the Chosen Path branching process, instead opting to compare it against every other point , if it satisfies the condition
In the pseudocode of Algorithm 2 we let count denote a hash table that keeps track of the number of times each element appears in . This allows us to evaluate the condition in equation (IV-C1) for an element in time by rewriting it as
We claim that this condition minimizes the expected number of comparisons performed by the algorithm: Consider a node in the Chosen Path Tree associated with a set of points while running the CPSJoin algorithm. For a point , we can either remove it from immediately at a cost of comparisons, or we can choose to let continue in the branching process (possibly into several nodes) and remove it later. The expected number of comparisons if we let it continue levels before removing it from every node that it is contained in, is given by
This expression is convex and increasing in the similarity between and other points , allowing us to state the following observation:
Observation 2 (Recursion).
Let and consider a set containing a point such that satisfies the recursion condition in equation (IV-C1). Then the expected number of comparisons involving if we continue branching exceeds at every depth . If does not satisfy the condition, the opposite is observed.
Iv-C2 Tree Depth
We proceed by bounding the maximal depth of the set of paths in the Chosen Path Tree that are explored by the CPSJoin algorithm. Having this information will allow us to bound the space usage of the algorithm and will also form part of the argument for the correctness guarantee. Assume that the parameter limit in the BruteForce step is set to some constant value, say . Consider a point and let be the subset of points in that are not too similar to . For every the expected number of vertices in the Chosen Path Tree at depth that contain both and is upper bounded by
Since we use Markov’s inequality to show the following bound:
Let satisfy that then the probability that there exists a vertex at depth in the Chosen Path Tree that contains and is at most .
If does not share any paths with points that have similarity that falls below the threshold for brute forcing, then the only points that remain are ones that will cause to be brute forced. This observation leads to the following probabilistic bound on the tree depth:
With high probability the maximal depth of paths explored by the CPSJoin algorithm is .
Let and be two sets of equal size such that . We are interested in lower bounding the probability that there exists a path of length in the Chosen Path Tree that has been chosen by both and , i.e. . Agresti [agresti1974] showed an upper bound on the probability that a branching process becomes extinct after at most steps. We use it to show the following lower bound on the probability of a close pair of points colliding at depth in the Chosen Path Tree.
Lemma 5 (Agresti [agresti1974]).
If then for every we have that .
The bound on the depth of the Chosen Path Tree for explored by the CPSJoin algorithm in Lemma 4 then implies a lower bound on .
Let be constant. Then for every set of points the CPSJoin algorithm solves the set similarity join problem with .
This analysis is very conservative: if either or is removed by the BruteForce step prior to reaching the maximum depth then it only increases the probability of collision. We note that similar guarantees can be obtained when using fast pseudorandom hash functions as shown in the paper introducing the Chosen Path algorithm [christiani2017set].
Iv-C4 Space Usage
We can obtain a trivial bound on the space usage of the CPSJoin algorithm by combining Lemma 4 with the observation that every call to CPSJoin on the stack uses additional space at most . The result is stated in terms of working space: the total space usage when not accounting for the space required to store the data set itself (our algorithms use references to data points and only reads the data when performing comparisons) as well as disregarding the space used to write down the list of results.
With high probability the working space of the CPSJoin algorithm is at most .
We conjecture that the expected working space is due to the size of being geometrically decreasing in expectation as we proceed down the Chosen Path Tree.
Iv-C5 Running Time
We will bound the running time of a solution to the general set similarity self-join problem that uses several calls to the CPSJoin algorithm in order to piece together a list of results . In most of the previous related work, inspired by Locality-Sensitive Hashing, the fine-grainedness of the randomized partition of space, here represented by the Chosen Path Tree in the CPSJoin algorithm, has been controlled by a single global parameter [Gionis99, PaghSIMJOIN2015]. In the Chosen Path setting this rule would imply that we run the splitting step without performing any brute force comparison until reaching depth where we proceed by comparing against every other point in nodes containing , reporting close pairs. In recent work by Ahle et al. [ahle2017] it was shown how to obtain additional performance improvements by setting an individual depth for every . We refer to these stopping strategies as global and individual, respectively. Together with our recursion strategy, this gives rise to the following stopping criteria for when to compare a point against everything else contained in a node:
Global: Fix a single depth for every .
Individual: For every fix a depth .
Adaptive: Remove when the expected number of comparisons is non-decreasing in the tree-depth.
Let denote the running time of our similarity join algorithm. We aim to show the following relation between the running time between the different stopping criteria when applied to the Chosen Path Tree:
First consider the global strategy. We set to balance the contribution to the running time from the expected number of vertices containing a point, given by , and the expected number of comparisons between pairs of points at depth , resulting in the following expected running time for the global strategy:
The global strategy is a special case of the individual case, and it must therefore hold that . The expected running time for the individual strategy is upper bounded by:
We will now argue that the expected running time of the CPSJoin algorithm under the adaptive stopping criteria is no more than a constant factor greater than when we set the global parameters of the BruteForce subroutine as follows:
Let and consider a path where is removed in from by the BruteForce step. Let denote the depth of the node (length of ) at which is removed. Compared to the individual strategy that removes at depth we are in one of three cases, also displayed in Figure 1.
The point is removed from at depth .
The point is removed from at depth .
The point is removed from at depth .
The underlying random process behind the Chosen Path Tree is not affected by our choice of termination strategy. In the first case we therefore have that the expected running time is upper bounded by the same (conservative) expression as the one used by the individual strategy. In the second case we remove earlier than we would have under the individual strategy. For every we have that since for larger values of the expected number of nodes containing exceeds . We therefore have that . Let denote the set of points in the node where was removed by the BruteForce subprocedure. There are two rules that could have triggered the removal of : Either or the condition in equation (IV-C1) was satisfied. In the first case, the expected cost of following the individual strategy would have been simply from the children containing in the next step. This is no more than a constant factor smaller than the adaptive strategy. In the second case, when the condition in equation (IV-C1) is activated we have that the expected number of comparisons involving resulting from if we had continued under the individual strategy is at least
which is no better than what we get with the adaptive strategy. In the third case where we terminate at depth , if we retrace the path to depth we know that was not removed in this node, implying that the expected number of comparisons when continuing the branching process on is decreasing compared to removing at depth . We have shown that the expected running time of the adaptive strategy is no greater than a constant times the expected running time of the individual strategy.
We are now ready to state our main theoretical contribution, stated below as Theorem 10. The theorem combines the above argument that compares the adaptive strategy against the individual strategy together with Lemma 4 and Lemma 6, and uses runs of the CPSJoin algorithm to solve the set similarity join problem for every choice of constant parameters .
For every LSHable similarity measure and every choice of constant threshold and probability of recall we can solve the -set similarity join problem on every set of points using working space and with expected running time
We implement an optimized version of the CPSJoin algorithm for solving the Jaccard similarity self-join problem. In our experiments (described in Section VI) we compare the CPSJoin algorithm against the approximate methods of MinHash LSH [Gionis99, Broder_NETWORK97] and BayesLSH [chakrabarti2015bayesian], as well as the AllPairs [Bayardo_WWW07] exact similarity join algorithm. The code for our experiments is written in C++ and uses the benchmarking framework and data sets of the recent experimental survey on exact similarity join algorithms by Mann et al. [Mann2016]. For our implementation we assume that each set is represented as a list of 32-bit unsigned integers. We proceed by describing the details of each implementation in turn.
V-a Chosen Path Similarity Join
, but makes use of a few heuristics, primarily sampling and sketching, in order to speed things up. The parameter setting is discussed and investigated experimentally in sectionVI-B.
Before running the algorithm we use the embedding described in section II-A. Specifically independent MinHash functions are used to map each set to a list of hash values . The MinHash function is implemented using Zobrist hashing [zobrist1970new] from 32 bits to 64 bits with 8-bit characters. We sample a MinHash function by sampling a random Zobrist hash function and let . Zobrist hashing (also known as simple tabulation hashing) has been shown theoretically to have strong MinHash properties and is very fast in practice [Patrascu2012, Thorup2017]. We set in our experiments, see discussion later.
During preprocessing we also prepare sketches using the 1-bit minwise hashing scheme of Li and König [li2011theory]. Let denote the length in 64-bit words of a sketch of a set . We construct sketches for a data set by independently sampling MinHash functions and Zobrist hash functions that map from 32 bits to 1 bit. The th bit of the sketch is then given by . In the experiments we set .
V-A2 Similarity Estimation Using Sketches
We use 1-bit minwise hashing sketches for fast similarity estimation in the BruteForcePairs and BruteForcePoint subroutines of the BruteForce step of the CPSJoin algorithm. Given two sketches, and , we compute the number of bits in which they differ by going through the sketches word for word, computing the popcount of their XOR using the gcc builtin _mm_popcnt_u64 that translates into a single instruction on modern hardware. Let denote the estimated similarity of a pair of sets . If is below a threshold , we exclude the pair from further consideration. If the estimated similarity is greater than we compute the exact similarity and report the pair if .
The speedup from using sketches comes at the cost of introducing false negatives: A pair of sets with may have an estimated similarity less than , causing us to miss it. We let denote a parameter for controlling the false negative probability of our sketches and set such that for sets with we have that . In our experiments we set the sketch false negative probability to be .
In the recursive step of the CPSJoin algorithm (Algorithm 1) the set is split into buckets using the following heuristic: Instead of sampling a random hash function and evaluating it on each element , we sample an expected elements from and split according to the corresponding minhash values from the preprocessing step. This saves the linear overhead in the size of our sets , reducing the time spent placing each set into buckets to . Internally, a collection of sets is represented as a C++ std::vector<uint32_t> of set ids.
Having reduced the overhead for each set to in the splitting step, we wish to do the same for the BruteForce step (described in Algorithm 2), at least in the case where we do not call the BruteForcePairs or BruteForcePoint subroutines. The main problem is that we spend time for each set when constructing the count hash map and estimating the average similarity of to sets in . To get around this we construct a 1-bit minwise hashing sketch of length for the set using sampling and our precomputed 1-bit minwise hashing sketches. The sketch is constructed as follows: Randomly sample elements of and set the th bit of to be the th bit of the th sample from . This allows us to estimate the average similarity of a set to sets in in time using word-level parallelism. A set is removed from if its estimated average similarity is greater than . To further speed up the running time we only call the BruteForce subroutine once for each call to CPSJoin, calling BruteForcePoint on all points that pass the check rather than recomputing each time a point is removed. Pairs of sets that pass the sketching check are verified using the same verification procedure as the AllPairs implementation by Mann et al. [Mann2016]. In our experiments we set the parameter . Duplicates are removed by sorting and performing a single linear scan.
In theory, for any constant desired recall it suffices with independent repetitions of the CPSJoin algorithm. In practice this number of repetitions is prohibitively large and we therefore set the number of independent repetitions used in our experiments to be fixed at ten. With this setting we were able to achieve more than recall across all datasets and similarity thresholds.
V-B MinHash LSH
We implement a locality-sensitive hashing similarity join using MinHash according to the pseudocode in Algorithm 3. A single run of the MinHash algorithm can be divided into two steps: First we split the sets into buckets according to the hash values of concatenated MinHash functions . Next we iterate over all non-empty buckets and run BruteForcePairs to report all pairs of points with similarity above the threshold . The BruteForcePairs subroutine is shared between the MinHash and CPSJoin implementation. MinHash therefore uses 1-bit minwise sketches for similarity estimation in the same way as in the implementation of the CPSJoin algorithm described above.
The parameter can be set for each dataset and similarity threshold to minimize the combined cost of lookups and similarity estimations performed by algorithm. This approach was mentioned by Cohen et al. [cohen2001finding] but we were unable to find an existing implementation. In practice we set to the value that results in the minimum estimated running time when running the first part (splitting step) of the algorithm for values of in the range and estimating the running time by looking at the number of buckets and their sizes. Once is fixed we know that each repetition of the algorithm has probability at least of reporting a pair with . For a desired recall we can therefore set . In our experiments we report the actual number of repetitions required to obtain a desired recall rather than using the setting of required for worst-case guarantees.
To compare our approximate methods against a state-of-the-art exact similarity join we use Bayardo et al.’s AllPairs algorithm [Bayardo_WWW07] as recently implemented in the set similarity join study by Mann et al. [Mann2016]. The study by Mann et al. compares implementations of several different exact similarity join methods and finds that the simple AllPairs algorithm is most often the fastest choice. Furthermore, for Jaccard similarity, the AllPairs algorithm was at most times slower than the best out of six different competing algorithm across all the data sets and similarity thresholds used, and for most runs AllPairs is at most slower than the best exact algorithm (see Table 7 in Mann et al. [Mann2016]). Since our experiments run in the same framework and using the same datasets and with the same thresholds as Mann et al.’s study, we consider their AllPairs implementation to be a good representative of exact similarity join methods for Jaccard similarity.
For a comparison against previous experimental work on approximate similarity joins we use an implementation of BayesLSH in C as provided by the BayesLSH authors [chakrabarti2015bayesian, bayeslsh]. The BayesLSH package features a choice between AllPairs and LSH as candidate generation method. For the verification step there is a choice between BayesLSH and BayesLSH-lite. Both verification methods use sketching to estimate similarities between candidate pairs. The difference between BayesLSH and BayesLSH-lite is that the former uses sketching to estimate the similarity of pairs that pass the sketching check, whereas the latter uses an exact similarity computation if a pair passes the sketching check. Since the approximate methods in our CPSJoin and MinHash implementations correspond to the approach of BayesLSH-lite we restrict our experiments to this choice of verification algorithm. In our experiments we will use BayesLSH to represent the fastest of the two candidate generation methods, combined with BayesLSH-lite for the verification step.
We run experiments using the implementations of CPSJoin, MinHash, BayesLSH, and AllPairs described in the previous section. In the experiments we perform self-joins under Jaccard similarity for similarity thresholds . We are primarily interested in measuring the join time of the algorithms, but we also look at the number of candidate pairs considered by the algorithms during the join as a measure of performance. Note that the preprocessing step of the approximate methods only has to be performed once for each set and similarity measure, and can be re-used for different similarity joins, we therefore do not count it towards our reported join times. In practice the preprocessing time is at most a few minutes for the largest data sets.
Vi-1 Data Sets
The performance is measured across real world data sets along with synthetic data sets described in Table I. All datasets except for the TOKENS datasets were provided by the authors of [Mann2016] where descriptions and sources for each data set can also be found. Note that we have excluded a synthetic ZIPF dataset used in the study by Mann et al.[Mann2016] due to it having no results for our similarity thresholds of interest. Experiments are run on versions of the datasets where duplicate records are removed and any records containing only a single token are ignored.
|Dataset||# sets /||avg. set size||sets / tokens|
In addition to the datasets from the study of Mann et al. we add three synthetic datasets TOKENS10K, TOKENS15K, and TOKENS20K, designed to showcase the robustness of the approximate methods. These datasets have relatively few unique tokens, but each token appears in many sets. Each of the TOKENS datasets were generated from a universe of tokens () and each token is contained in respectively, , , and different sets as denoted by the name. The sets in the TOKENS datasets were generated by sampling a random subset of the set of possible tokens, rejecting tokens that had already been used in more than the maximum number of sets ( for TOKENS10K). To sample sets with expected Jaccard similarity the size of our sampled sets should be set to . For the TOKENS datasets each have random sets planted with expected Jaccard similarity . This ensures an increasing number of results for our experiments where we use thresholds . The remaining sets have expected Jaccard similarity . We believe that the TOKENS datasets give a good indication of the performance on real-world data that has the property that most tokens appear in a large number of sets.
In our experiments we aim for a recall of at least for the approximate methods. In order to achieve this for the CPSJoin and MinHash algorithms we perform a number of repetitions after the preprocessing step, stopping when the desired recall has been achieved. This is done by measuring the recall against the recall of AllPairs and stopping when reaching . In practice this approach is not feasible as the size of the true result set is not known. However, it can be efficiently estimated using sampling if it is not too small. Another approach is to perform the number of repetitions required to obtain the theoretical guarantees on recall as described for CPSJoin in Section IV-C and for MinHash in Section V-B. Unfortunately, with our current analysis of the CPSJoin algorithm the number of repetitions required to guarantee theoretically a recall of far exceeds the number required in practice as observed in our experiments where ten independent repetitions always suffice. For BayesLSH using LSH as the candidate generation method, the recall probability with the default parameter setting is , although we experience a recall closer to in our experiments.
All experiments were run on an Intel Xeon E5-2690v4 CPU at 2.60GHz with MB L,kB L and kB L cache and GB of RAM. Since a single experiment is always confined to a single CPU core we ran several experiments in parallel [Tange2011a] to better utilize our hardware.
Vi-A1 Join Time
Table II shows the average join time in seconds over five independent runs, when approximate methods are required to have at least recall. We have omitted timings for BayesLSH since it was always slower than all other methods, and in most cases it timed out after 20 minutes when using LSH as candidate generation method. The join time for MinHash is always greater than the corresponding join time for CPSJoin except in a single setting: the dataset KOSARAK with threshold . Since CPSJoin is typically faster than MinHash we can restrict our attention to comparing AllPairs and CPSJoin where the picture becomes more interesting.
Figure 2 shows the join time speedup that CPSJoin achieves over AllPairs. We achieve speedups of between for most of the datasets, with greater speedups at low similarity thresholds. For a number of the datasets the CPSJoin algorithm is slower than AllPairs for the thresholds considered here. Comparing with Table I it seems that CPSJoin generally performs well on most data sets where tokens are contained in a large number of sets on average (NETFLIX, UNIFORM, DBLP), but is beaten by AllPairs on datasets that have a lot of “rare” tokens (SPOTIFY, FLICKR, AOL). This is logical because rare tokens are exploited by the sorted prefix-filtering in AllPairs. Without rare tokens AllPairs will be reading long inverted indexes. This is a common theme among all the current state-of-the-art exact methods examined in [Mann2016], including PPJoin. CPSJoin is robust in the sense that it does not depend on the presence of rare tokens in the data to perform well. This difference is showcased with the synthetic TOKEN data sets.
The poor performance of BayesLSH compared to the other algorithms (BayesLSH was always slower) can most likely be tracked down to differences in the implementation of the candidate generation methods of BayesLSH. The BayesLSH implementation uses an older implementation of AllPairs compared to the implementation by Mann et al. [Mann2016] which was shown to yield performance improvements by using a more efficient verification procedure. The LSH candidate generation method used by BayesLSH corresponds to the MinHash splitting step, but with (the number of hash functions) fixed to one. Our technique for choosing in the MinHash algorithm, aimed at minimizing the total join time, typically selects in the experiments. It is therefore likely that BayesLSH can be competitive with the other techniques by combining it with other candidate generation procedures. Further experiments to compare the performance of BayesLSH sketching to 1-bit minwise sketching for different parameter settings and similarity thresholds would also be instructive.
Vi-A3 TOKEN datasets
The TOKENS datasets clearly favor the approximate join algorithms where CPSJoin is two to three orders of magnitude faster than AllPairs. By increasing the number of times each token appears in a set we can make the speedup of CPSJoin compared to AllPairs arbitrarily large as shown by the progression from TOKENS10 to TOKENS20. The AllPairs algorithm generates candidates by searching through the lists of sets that contain a particular token, starting with rare tokens. Since every token appears in a large number of sets every list will be long.
Interestingly, the speedup of CPSJoin is even greater for higher similarity thresholds. We believe that this is due to an increase in the gap between the similarity of sets to be reported and the remaining sets that have an average Jaccard similarity of . This is in line with our theoretical analysis of CPSJoin and most theoretical work on approximate similarity search where the running time guarantees usually depend on the approximation factor.
Vi-A4 Candidates and Verification
Table IV compares the number of pre-candidates, candidates, and results generated by the AllPairs and CPSJoin algorithms where the desired recall for CPSJoin is set to be greater than . For AllPairs the number of pre-candidates denotes all pairs investigated by the algorithm that pass checks on their size so that it is possible that . The number of candidates is simply the number of unique pre-candidates as duplicate pairs are removed explicitly by the AllPairs algorithm.
For CPSJoin we define the number of pre-candidates to be all pairs considered by the BruteForcePairs and BruteForcePoint subroutines of Algorithm 2. The number of candidates are pre-candidate pairs that pass size checks (similar to AllPairs) and the 1-bit minwise sketching check as described in Section V-A. Note that for CPSJoin the number of candidates may still contain duplicates as this is inherent to the approximate method for candidate generation. Removing duplicates though the use of a hash table would drastically increase the space usage of CPSJoin. For both AllPairs and CPSJoin the number of candidates denotes the number of points that are passed to the exact similarity verification step of the AllPairs implementation of Mann et al. [Mann2016].
Table IV shows that for AllPairs there is not a great difference between the number of pre-candidates and number of candidates, while for CPSJoin the number of candidates is usually reduced by one or two orders of magnitude for datasets where CPSJoin performs well. For datasets where CPSJoin performs poorly such as AOL, FLICKR, and KOSARAK there is less of a decrease when going from pre-candidates to candidates. It would appear that this is due to many duplicate pairs from the candidate generation step and not a failure of the sketching technique.
To investigate how parameter settings affect the performance of the CPSJoin algorithm we run experiments where we vary the brute force parameter limit, the brute force aggressiveness parameter , and the sketch length in words . Table III shows the parameter settings used during theses experiments and the final setting used for our join time experiments.
|limit||Brute force limit|
|Sketch word length|
|Size of MinHash set|
|Brute force aggressiveness|
|Sketch false negative prob.|
Figure 3 shows the CPSJoin join time for different settings of the parameters. By picking one parameter at a time we are obviously ignoring possible interactions between the parameters, but the stability of the join times lead us to believe that these interactions have limited effect.
Figure 3 (a) shows the effect of the brute force limit on the join time. Lowering limit causes the join time to increase due to a combination of spending more time splitting sets into buckets and due to the lower probability of recall from splitting at a deeper level. The join time is relatively stable for limit .
Figure 3 (b) shows the effect of brute force aggressiveness on the join time. As we increase , sets that are close to the other elements in their buckets are more likely to be removed by brute force comparing them to all other points. The tradeoff here is between the loss of probability of recall by letting a point continue in the Chosen Path branching process versus the cost of brute forcing the point. The join time is generally increasing with , but it turns out that is a slightly better setting than for almost all data sets.
Figure 3 (c) shows the effect of sketch length on the join time. There is a trade-off between the sketch similarity estimation time and the precision of the estimate, leading to fewer false positives. For a similarity threshold of using only a single word negatively impacts the performance on most datasets compared to using two or more words. The cost of using longer sketches seems neglible as it is only a few extra instructions per similarity estimation so we opted to use words in our sketches.
We provided experimental and theoretical results on a new randomized set similarity join algorithm, CPSJoin, and compared it experimentally to state-of-the-art exact and approximate set similarity join algorithms. CPSJoin is typically times faster than previous approximate methods. Compared to exact methods it obtains speedups of more than an order of magnitude on real-world datasets, while keeping the recall above . Among the datasets used in these experiments we note that NETFLIX and FLICKR represents two archetypes. On average a token in the NETFLIX dataset appears in more than sets while on average a token in the FLICKR dataset appears in less than sets. Our experiments indicate that CPSJoin brings large speedups to the NETFLIX type datasets, while it is hard to improve upon the perfomance of AllPairs on the FLICKR type.
A direction for future work could be to tighten and simplify the theoretical analysis. We conjecture that the running time of the algorithm can be bounded by a simpler function of the sum of similarities between pairs of points in . We note that recursive methods such as ours lend themselves well to parallel and distributed implementations since most of the computation happens in independent, recursive calls. Further investigating this is an interesting possibility.
The authors would like to thank Willi Mann for making the source code and data sets of [Mann2016] available, Aniket Chakrabarti for information about the implementation of BayesLSH, and the anonymous reviewers for useful suggestions.