Similarity search is a widely used primitive in data mining applications, and all-pairs similarity in particular is a common data mining operation [1, 7, 21, 34]. Motivated by recommender systems and social networks, we design algorithms for computing all-pairs set similarity (a.k.a., a set similarity join). In particular, we consider the similarity of nodes in terms of a bipartite graph. We wish to determine similar pairs of nodes from one side of the graph. For each node on the right, we consider its neighborhood on the left. Equivalently, we can think of as a set of the neighbors of
in the graph. Using this representation, many graph-based similarity problems can be formulated as finding pairs of nodes with significantly overlapping neighborhoods. We focus on the cosine similarity between pairsand
represented as high-dimensional vectors.
Although set similarity search has received a lot of attention in the literature, there are three aspects of modern systems that have not been adequately addressed yet. Concretely, we aim to develop algorithms that come with provable guarantees and that handle the following three criteria:
Distributed and Scalable. The algorithm should work well in a distributed environment like MapReduce, and should scale to large graphs using a large number of processors.
Low Similarity. The algorithm should output most pairs of sets with relatively low normalized set similarity, such as a setting of cosine similarity taking values
Extreme Skew. The algorithm should provably work well even when the dimensions (degrees on the left) are highly irregular and skewed.
The motivation for these criteria comes from recommender systems and social networks. For the first criteria, we consider graphs with a large number of vertices. For the second, we wish to find pairs of nodes that are semantically similar without having a large cosine value. This situation is common in collaborative filtering and user similarity , where two users may be alike even though they overlap on a small number of items (e.g., songs, movies, or citations). Figure 1 depicts the close pair histogram of a real graph, where most similar pairs have low cosine similarity. For the third criteria, skewness has come to recent attention as an important property [5, 23, 36], and it can be thought of as power-law type behavior for degrees on the left. In contrast, most other prior work assumes that the graph has uniformly small degrees on the left [22, 27, 28]. This smoothness assumption is reasonable in settings when the graph is curated by manual actions (e.g., Twitter follow graph). However, this is too restrictive in some settings, such as a graph of documents and entities, where entities can legitimately have high degrees, and throwing away these entities may remove a substantial source of information. Another illustration of this phenomenon can be observed even on human-curated graphs, e.g., the Twitter follow graph, where computing similarities among consumers (instead of producers, as in ) runs into a similar issue.
Previous work fails to handle all three of the above criteria. When finding low similarity items (e.g., cosine similarity ), standard techniques like Locality-Sensitive Hashing [16, 29, 36] are no longer effective (because the number of hashing iterations is too large). Recently, there have been several proposals for addressing this, and the closest one to ours is the wedge-sampling algorithm from . However, the approach in  has one severe shortcoming: it requires that each dimension has a relatively low frequency (i.e., the bipartite graph has small left degrees).
In this work, we address this gap by presenting a new distributed algorithm LSF-Join for approximate all-pairs similarity that can scale to large graphs with high skewness. As a main contribution, we provide theoretical guarantees on our algorithm, showing that it achieves very high accuracy. We also provide guarantees on the communication, work, and maximum load in a distributed environment with a very large number of processors.
Our approach uses Locality Sensitive Filtering (LSF) . This is a variant of the ideas used for Locality Sensitive Hashing (LSH). The main difference between LSF and LSH is that the LSF constructs a single group of surviving elements based on a hash function (for each iteration). In contrast, LSH constructs a whole hash table, each time, for a large number of iterations. While the hashing and sampling ideas are similar, the benefit of LSF is in its computation and communication costs. Specifically, our LSF scheme will have the property that if an element survives in out of total hash functions, then the computation scales with and not . For low similarity elements, is usually substantially smaller than , resulting in a lower overall cost (for example will be sublinear, while is linear, in the input size). We also provide an efficient way to execute this filtering step on a per-node basis.
Our LSF procedure can also be a viewed as a pre-processing step before applying any all-pairs similarity algorithm (even one needing a smaller problem size and a graph without skew). The reason is that the survival procedure outputs a number of smaller subsets of the original dataset, each with a different, smaller set of dimensions, along with a guarantee that no dimension has a high degree. The procedure also ensures that similar pairs are preserved with high probability. Then, after performing this filtering, we may use other steps to improve the computation time. For example, applying a hashing technique may reduce the effective dimensionality without affecting the similarity structure.
The input consists of a bipartite graph with a set of vertices on the left and vertices on the right. We denote that graph as , and we refer to as the set of dimensions, and to as the set of nodes. Given a parameter , we want to output all similar pairs of nodes from such that
This problem also encapsulates other objectives, such as finding top- results per node. Note that we could equivalently identify each node with the set of its neighbors , and hence, this problem is the same as the set similarity join problem with input and threshold for cosine similarity. We describe our algorithm in a MapReduce-like framework, and we analyze it in the massively parallel computation model [8, 19], which captures the theoretical properties of MapReduce-inspired models (e.g., [26, 18]).
We have processors, in a shared-nothing distributed environment. The input data starts arbitrarily partitioned among the processors. Associated to each node on the right is a vector which is an indicator vector for the neighbors of on the left. We would like to achieve the twin properties of load-balanced servers and low communication cost.
The main contribution of our work is a new randomized, distributed algorithm, LSF-Join, which provably finds almost all pairs of sets with cosine similarity above a given threshold . Our algorithm will satisfy all three of the criteria mentioned above (scalability, low similarity, and skewness). A key component of LSF-Join is a new randomized LSF scheme, which we call the survival procedure. The goal of this procedure is to find subsets of the dataset that are likely to contain similar pairs. In other words, it acts as a filtering step. Our LSF procedure comes with many favorable empirical and theoretical properties. First, we can execute it in nearly-linear time, which allows it to scale to very large datasets. Second, we exhibit an efficient way to implement it in a distributed setting with a large number of processors, using only a single round of communication for the whole LSF-Join algorithm. Third, the survival procedure leads to sub-quadratic local work, even when the dimensions are highly skewed and the similarity threshold is relatively low. To achieve these properties, we demonstrate how to implement the filtering using efficient, pairwise independent hash functions, and we show that even in this setting, the algorithm has good provable guarantees on the accuracy and running time. We also present a number of theoretical optimizations that better illuminate the behavior of the algorithm on datasets with different structural properties. Finally, we empirically validate our results by testing LSF-Join on multiple graphs.
Many filtering-based similarity join algorithms provide exact algorithms and rely on heuristics to improve the running time[4, 6, 15, 22, 30, 32, 33, 34]. We primarily review prior work that is relevant to our setting and provides theoretical guarantees.
One related work uses LSF for set similarity search and join on skewed data . Their data dependent method leads to a sequential algorithm based on the frequency of dimensions, improving a prior LSF-based algorithm . Unfortunately, it seems impossible to adapt their method to the one-round distributed setting. Another relevant result is the wedge-sampling approach in . They provide a distributed algorithm for low-similarity joins on large graphs. However, their algorithm assumes that the dataset is not skewed.
In the massively-parallel computation model [8, 9], multi-round algorithms have been developed that build off of LSH for approximate similarity joins, achieving output-optimal guarantees on the maximum load [17, 24]. However, it can be prohibitively expensive to use multiple rounds in modern shared-nothing clusters with a huge number of processors. In particular, the previous work achieves good guarantees only when the number of nodes and number of processors satisfy for a constant . We focus on one-round algorithms, and we allow the possibility of , which may be common in very large computing environments. Algorithms using LSH work well when is large enough, such as . However, for smaller , LSH-based distributed algorithms require too much computation and/or communication due to the large number of repetitions [11, 31, 27, 35]. Prior work has also studied finding extremely close pairs [2, 1, 10] or finding pairs of sets with constant-size intersection . These results do not apply to our setting because we aim to find pairs of large-cardinality sets with cosine similarity in the range , and we allow for the intersection size to be large in magnitude.
2 The LSF-Join Algorithm
We start with a high-level overview of our set similarity join algorithm, LSF-Join, which is based on a novel and effective LSF scheme. Let be the input graph with dimensions on the left, and nodes on the right. For convenience, we refer to the vertices and their indices interchangeably, where we use to denote the set .
The LSF-Join algorithm uses independent repetitions of our filtering scheme (where achieves the best tradeoff). In the -th repetition we create a set of survivors of the set of vertices on the right. We will define the LSF procedure shortly, which will determine the subsets in a data-independent fashion. During the communication phase, the survival sets will be distributed in their entirety across the processors. In particular, if there are processors, then each processor will handle roughly different repetitions. During the local computation, the processors will locally compute all similar pairs in for and output these pairs in aggregate (in a distributed fashion). As part of the theoretical analysis, we show that the size of each is concentrated around its mean, and therefore, our algorithm has balanced load across the processors. To achieve high recall of similar pairs, we will need to execute the LSF-Join algorithm times independently, so that the failure probability will be polynomially small. Fortunately, this only increases the communication and computation by a factor. We execute the iterations in parallel, and LSF-Join requires only one round of communication.
2.1 Constructing the Survival Sets
We now describe our LSF scheme, which boils down to describing how to construct the survival sets. We have two main parameters of interest: denotes the survival probability of a single dimension (on the left), and denotes the number of repetitions. The simplest way to describe our LSF survival procedure goes via uniform random sampling. We refer to this straightforward scheme as the Naive-Filter method, and we describe it first. Then, we explain how to improve this method by using a pairwise independent filtering scheme, which will be much more efficient in practice. We refer to the improved LSF scheme as the Fast-Filter method. Later, we also show that Fast-Filter enjoys many of the same theoretical guarantees of Naive-Filter, with much lower computational cost.
For the naive version of our filtering scheme, consider a repetition number . We choose a uniformly random set of vertices on left by choosing each node to be in with probability independently. Then, we filter vertices on the right depending on whether their neighborhood is completely contained in or not (that is, whether or not). The -th survival set will be the set of vertices such that . We repeat this process independently for each , to derive filtered sets of vertices . Notice that for each , the probability that survives in is exactly , where is the number of neighbors of on the left.
The intuition behind using this filtering method for set similarity search is that similar pairs are relatively likely to survive in the same set. Indeed, the chance that both and survive in is equal to . When the cosine similarity is large, we must have that is large and also that is much smaller than . In other words, and are more likely to survive together if they are similar, and less likely if they are very different. For example, consider the case where is a large constant. Then, pairs with cosine similarity at least will survive together with probability . At the other extreme, disjoint pairs only survive together with probability .
The main drawback of the Naive-Filter method is that it takes too much time to determine all indices such that . Consider the set of ’s neighbors . We need to determine whether for every . Hence, it requires at least work to compute the indices where survives, that is, the set . We will need to set , and hence, the work of Naive-Filter is linear in or worse for each node . To improve upon this, our Fast-Filter method will have work proportional to , and we show that this is often considerably smaller than .
2.2 The Fast-Filter Method
The key idea behind our fast filtering method is to develop a pairwise independent filtering scheme that approximates the uniform sampling of the survival sets. We then devise a way to efficiently compute the survival sets on a per-node basis, by using fast matrix operations. More precisely, for each node on the right, Fast-Filter will determine the indices of survival sets in which survives (that is, we have ). We develop a way to compute independent for each vertex by using Gaussian elimination on binary matrices. The Fast-Filter method only requires a small amount of shared randomness between the processors.
To describe the Fast-Filter method, it will be convenient to assume that and are both integers. We now explain the pairwise independent filtering scheme. For each node on the left, we sample a random binary matrix and a -length bit-string . We identify each of the repetitions with binary vectors in the -dimensional vector space over , the finite field with two elements. In other words, we use the binary representation of to associate with a length bit-string, and we perform matrix and vector operations modulo two. We abuse notation and use for both the integer and the bit-string, where context will distinguish the two.
To determine whether a node survives in , we perform the following operation. We first stack the matrices on top of each other for each of ’s neighbors . This forms a matrix . We also stack the vectors on top of each other, forming a length bit-string . Finally, we define by setting if and only if , where denotes the all-zeros vector. We say that survives the -th repetition if . Then is the set of indices in which survives.
In a one-round distributed setting, the processors can effectively pre-compute the submatrices and the subvectors using a shared seed. In particular, these may be computed on the fly, as opposed to stored up front, by using a shared random seed and by using an efficient hash function to compute the elements of and only when processing such that . By doing so, the processors will use the same values of and as one another, leading to consistent survival sets, without incurring any extra rounds of communication.
To gain intuition about this filtering procedure, let denote the number of ’s neighbors. Node will survive in if satisfies . This consists of linear equations that must satisfy. As the matrix and the vector are chosen uniformly at random, it is easy to check that survives in with probability , and hence, their expected sizes satisfy
over a random and .
Theoretically, the main appeal of Fast-Filter is that it is pairwise independent in the following sense. For any two distinct repetitions and , the bit-strings for and differ in at least one bit. Therefore, we see that is satisfied or not independently of , over the random choice of and . While this is only true for pairs of repetitions, this level of independence will suffice for our theoretical analysis. Furthermore, we show that we can determine the survival sets containing in time proportional to the number of such sets, which is often much less than the total number of possible sets.
We now explain how to efficiently compute the survival sets on a per-node basis. For a fixed node , the Fast-Filter method determines the repetitions that survives in, or in other words, the set . This is equivalent to finding all length bit-strings that are solutions to . The processor can form and in time, where , assuming the unit cost RAM model on words of bits. Then, we can use Gaussian elimination over bit-strings to very quickly find all that satisfy . To understand the complexity of this, first note that has columns. Moreover, without loss of generality, we see that has at most rows, as otherwise there exists no solution. Therefore, Gaussian elimination takes time to write in upper triangular form (and correspondingly rewrite ) so that all solutions to can be enumerated in time proportional to the number of solutions to this equation. The expected total work is
This can be parallelized for each node independently.
We prove guarantees about Fast-Filter in Theorem 2. The pseudo-code for Fast-Filter appears as Algorithm 1. The main difference between the two filtering methods is how the random survival sets are chosen. For the sake of this discussion, we set , which is reasonable in practice, and we continue to let . In the Fast-Filter method, we use a random linear map over with enough independent randomness to decide for each repetition, whether or not a node survives not. By using Gaussian elimination, we are able to compute in time proportional to . In particular, the amount of work for is in expectation, because when .
The pseudo-code for LSF-Join appears as Algorithm 2. We assume that the vertices start partitioned arbitrarily across processors. For each vertex in parallel, we use Fast-Filter determine the indices of the sets in which survives. As detailed above, we can do so consistently by using a shared random seed for Fast-Filter. During the communication phase, we randomly distribute the sets across processors, so that each processor handles sets in expectation. Then, during local computation, we compare all pairs in for each in parallel. We use independent iterations of the algorithm in parallel to find all close pairs with high probability (e.g., recall close to one). Finally, we output all pairs with cosine similarity at least in a distributed fashion.
One way of processing each set is to compare all pairs in this set. Specifically, for all pairs of nodes , explicitly compute and check if it is at least . One can assume the lists and are sorted arrays of integers, where . Thus, one can compute by merging these sorted lists in time, assuming words of length can be manipulated in constant time in the unit cost RAM model.
Letting be the maximum of over , the time to locally compare all pairs in set is . We can also bound the average amount of work across processors to handle all sets . This can be bounded by
We call this the brute-force all-pairs algorithm.
2.2.1 Setting the Parameters
Let denote the average degree on the right in the input graph. Ideally, these parameters should satisfy
or in other words, , where 2 could be replaced with a larger constant for improved recall. If it is possible to approximately satisfy (1) with being an integer, then running independent iterations of the algorithm with these parameters will work very well. For example, this is the case when for constant . However, for large average degree , the parameter may exceed 1/2. To approximate , we can subsample the matrices and vectors to increase the effective collision probability. More precisely, consider . If we wish to survive in a repetition with probability , then we can solve for in the equality , and we subsample the rows in and down to . This effectively constructs survival sets as in Naive-Filter with probability of each neighbor surviving. In the theoretical results, we will assume that and satisfy (1). In the experiments, we either set to be 1/2, or we use the matrix subsampling approach; we also vary the number of independent iterations to improve recall (where we use to denote the number of iterations).
3 Theoretical Guarantees
We assume on the graph is right-regular with nodes in having degree for simplicity. In practice, we can repeat the algorithm for different small ranges of . First, notice that
Now consider two nodes . Then both and are in if and only if the following event occurs. Let be the matrix obtained by stacking on top of , and be the vector obtained by stacking on top of . Note that for each , the rows of occur twice in and the coordinates of occur twice in . Thus, it suffices to retain only one copy of and in for each , and by doing so we reduce the number of rows of and entries of to at most . Consequently,
We first justify the setting of in (1).
Let be such that . The expected number of repetitions for which both and is at least .
The expected load per processor is , and the expected total communication is .
There are repetitions, each concerning one survival set. Each node survives in with probability independently. The expected size of is . Each processor handles repetitions, leading to expected load. The total communication is , which has expectation ∎
Using brute-force all-pairs locally, the expected work per machine is .
Each repetition has expected size , leading to work . Each processor handles repetitions, implying work per processor in expectation. ∎
Combining the lemmas and plugging in gives us the following.
Setting , the survival procedure has total communication is
and local work
As an example, we compare to hash-join when , which has total communication and local work . We set , and by Theorem 1, the expected total communication is . The local work per processor is . Since , the work is always sublinear, thus improving over hash-join while using the same amount of total communication. As we will see in the theorem below, it is crucial that we use the family of pairwise independent hash functions above for generating our randomness.
The expected total time the nodes in need to generate the is
and the expected total time and communication that the nodes in need to send the sets for each for each is
Each node needs to figure out the repetitions that it survives in. It can form and in time assuming the unit cost RAM model on word of bits. Note then needs to figure out which satisfy . To do so, in can just solve this equation using Gaussian elimination. Note that has at most rows, and has columns. Therefore Gaussian elimination takes at most time to write in upper triangular form and corresponding so that all solutions to the equation can be enumerated in time proportional to the number of solutions to this equation. Thus, the expected time per processor is , where we have used (2) to bound the expected number of repetitions that survives in by . Thus, the total expected time to form all of the , for , is . Note that is the total expected amount of communication. ∎
While correct in expectation, since the randomness uses across the repetitions is not independent, namely, we use the same matrices and vectors for each node
, it is important to show that the variance of the number of repetitionsfor which both and is small. This enables one to show the probability there is at least one repetition for which both and survive is a large enough constant, which can be amplified to any larger constant by independently repeating a constant number of times.
Let be such that . With probability at least , there is a repetition with both and .
be an indicator random variable which isif and survive the -th repetition, and is otherwise. Let be the number of repetitions for which both and survive. By Lemma 1, . It is well-known that the hash function family , where and range over all possible binary matrices and vectors, respectively, is a pairwise independent family. It follows that are pairwise independent random variables, and consequently . As , we have , and hence, . By Chebyshev’s inequality,
Efficiently Amplifying Recall.
At this point, we have shown that one iteration of LSF-Join will find a constant fraction of close pairs. To amplify the recall, we run copies of LSF-Join in parallel. We emphasize that this is a more efficient way to achieve a high probability result, better than simply increasing the number of repetitions in a single LSF-Join execution. Intuitively, this is because the repetitions are only guaranteed to be pairwise independent. Theoretically, independent copies leads to a failure probability of by a Chernoff bound. But, if we only increased the number of repetitions, then by Chebyshev’s inequality, we would need to use repetitions for the same success probability . The latter requires times the amount of communication/computation, while the former is only a factor. Setting leads to a failure probability of after taking a union bound over the possible pairs.
In this section, we present several extensions of the LSF-Join algorithm and analysis, such as considering the number of close pairs, using hashing to reduce dimensionality, combining LSF-Join with hash-join, and lowering the communication cost when the similarity graph is a matching.
4.1 Processing Time as a Function of the Profile
While Theorem 1 gives us a worst-case tradeoff between computation and communication, we can better understand this tradeoff by parameterizing the total amount of work of the servers by a data-dependent quantity , introduced below, which may give a better overall running time in certain cases.
Supposing that , the processors receive multiple sets to process. We choose a random hash function so that processor receives all sets for which . When , each processor handles sets in expectation.
The processor handling the set receives together with the neighborhood for each , and is responsible for outputting all pairs for which .
To bound the total amount of computation, we introduce a data-dependent quantity . Note that the are independent and identically distributed, so we can fix a particular . We define the profile of a dataset as follows:
We are interested in bounding the overall time for all nodes in to process the sets .
The total work of the nodes in to process the sets , assuming that we use the brute-force all-pairs algorithm is The average work per processor is
After receiving the , the total time for all processors to execute their brute-force all-pairs algorithm is , which allows for outputting the similar pairs. The theorem follows. ∎
4.2 Hashing to Speed Up Processing
Recall that the processor responsible for finding all similar pairs in receives the set of neighbors of each node . In the case when the neighborhoods are all of comparable size, sat size , we can think of as a vector with exactly ones in it; here is the characteristic vector of the neighbors of . We can first hash the vector down to dimensions, for a parameter . To do this, we use the CountMin map 
, which can be viewed as a random matrixwith a single non-zero per column, and this non-zero is chosen uniformly at random and independently for each of the columns of . We replace with . If an entry of is larger than , we replace it with , and let the resulting vector be denoted , which is in . Note that we can compute all of the for a given repetition using time, assuming arithmetic operations on bit words can be performed in constant time.
While for two nodes , it could be that . We now quantify this.
For any two nodes , it holds that with probability at least ,
Note that since each node is hashed to a bucket by CountMin, which will be a coordinate that is set to in both and . Also the probability that hashes to a bucket containing a with is at most and the expected number of with this property is at most . By a Markov bound, the number of such is at most with probability at least , as desired. ∎
By the previous lemma we can replace the original dimension- vectors with the potentially much smaller dimension--vectors with a small price in accuracy and success probability.
4.3 Combining LSF-Join with Hash-Join
The LSH-based approach of Hu et. al  suggests (in our framework) an alternate strategy of sub-partitioning the survival sets, using a hash-join to distribute the brute-force all-pairs algorithm. Here we analyze this combined approach and plot the tradeoffs. We show that this strategy does not provide any benefit in the communication vs. computation tradeoff, perhaps surprisingly.
The combined strategy, using processors, starts by using repetitions for a parameter , and this is followed by a hash join on each survival set. More precisely, we first construct sets using the Fast-Filter survival procedure. Then, for each set , we will process all pairs in using
machines. This can be implemented in one round, because all we need to do is estimate the size of each setapproximately, that is, . Then, we can implement the hash-join in a distributed fashion.
We first review the guarantees of the standard hash-join.
For vectors and machines, a hash-join has expected total communication and expected work per machine.
We use this bound to compute the communication and work, when using a hash-join to process each survival set.
The combined approach has expected total communication and expected work per processor.
When , we have that , and hence, we have in expectation. We use Lemma 7 to analyze the hash-join for each of the groups of processors. Each group handles inputs, and therefore the communication of the group is , which is . Multiplying by , the exponent of becomes
which gives the claimed communication bound. For the per processor work, we have that this is the claimed bound:
Figure 2 demonstrates that the combination approach is never better than the original LSF-Join approach. For a comparison, we consider processors, and hence, the number of repetitions will be for . Then, when , the survival procedure has expected total communication , and it has expected work per processor. And, when we have that the combined approach has expected total communication , and it has expected work per processor. Notice that corresponds to standard LSF-Join, and corresponds to using a hash-join on the whole dataset.
4.4 When the Similarity Graph is a Matching
Recall that to recover all close pairs with high probability, we need to iterate the LSF-Join algorithm times, because each time finds a constant fraction of close pairs. We exhibit an improvement using multiple communication steps when the similar pairs are structured. An important application of all-pairs similarity is constructing the similarity graph. In our setting, the similarity graph connects all pairs such that their cosine similarity is at least . The structure that we consider is when the similarity graph happens to be a matching, containing exactly disjoint pairs with similarity at least .
The key idea is that each iteration decreases the number of input nodes by a constant fraction. We will remove these nodes (or at least one endpoint from each close pair) from consideration, and then repeat the procedure using the remaining nodes. We observe that this method can also be extended to near-matchings (e.g., small disjoint cliques). Similarly, our result is not specific to LSF-Join, and the technique would work for any LSF similarity join method.
We state our result using the -th iterated log function , where , and for . Then, we show:
Using communication steps, we can find all but a negligible fraction of close pairs when the similarity graph is a matching. The total communication and computation is times the cost of one execution of LSF-Join.
For , we simply run LSF-Join times independently in a single communication step, where each time finds a constant fraction of close pairs. For communication steps, we will use rounds of LSF-Join, and we will remove all found pairs between subsequent rounds (each round will take two communication steps, except for the last, which takes one).
In the first round, we run LSF-Join times. Then, the expected number of pairs that are not found will be , where . In the next round, with rounds remaining, we will only consider the remaining pairs, and we will iterate LSF-Join times. We repeat this process until no more rounds remain, and output the close pairs from all rounds.
We can implement each round of the above algorithm using at most two communication steps. We do so by marking the found pairs between rounds using a single extra communication step. More formally, the input pairs start partitioned across processors. We denote the input partition as . After finding some fraction of close pairs, processor must be notified of which nodes in are no longer active. Whenever processor finds a close pair , it sends the index of to processor such that (and similarly for ), where is known to processor because processor must have sent to processor in LSF-Join. We reduce the total input set from to , where denotes the remaining nodes after removing the found pairs.
To analyze this procedure, notice that the dominant contribution to the total communication and computation is the first round. This is because the subsequent rounds have a geometrically decreasing number of input nodes. The first round uses iterations of LSF-Join, which shows that overall communication and computation is times the cost of one iteration. ∎
4.5 Hashing to Improve Recall
Not only is hashing helpful in order to reduce the description size of the neighborhood sets, as described in Section 4.2, hashing can also be used to increase the number of similar pairs surviving a repetition, and thus the recall. Before, a node pair survives a repetition with probability . Hashing can, however, make smaller due to collisions. Suppose we hash the characteristic vector of the neighborhood of a node down to dimensions for some parameter , obtaining the vector , as in Section 4.2. We could, for example, set as in Section 4.2.
Thinking of and as characteristic vectors of sets, and letting , we have
Let be an indicator random variable for the event that -th bin is non-empty when throwing balls into bins. If the bin is empty, then let . Then , and so where is the total number of non-empty bins. ∎
By Lemma 8, the expected size of the union of the neighborhoods drops after hashing. This is useful, as the survival probability of the node pair in a repetition after hashing is now , which by the previous lemma is larger than before since and this inequality is strict in expectation. Note, however, that the communication and work per machine increase in expectation, but this tradeoff may be beneficial.
5 Experimental Results
In this section, we complement the theoretical analysis presented earlier with experiments that measure the recall and efficiency of LSF-Join on three real world graphs from the SNAP repository : WikiVote, PhysicsCitation, and Epinions. In accordance with our motivation, we also run LSF-Join on an extremely skewed synthetic graph, on which the WHIMP algorithm fails.
|WikiVote||7K||104K||710MB (, )||60MB||100%||100%|
|Citation||34K||421K||410MB (, )||50MB||100%||100%|
|Epinions||60K||500K||6GB (, )||60MB||100%||100%|
|Synthetic||10M||200M||160GB (, )||Failed||90%||—|
* The communication cost of LSF-Join depends on the number of survivors, which we note along with the value of .
WHIMP communication cost is dominated by shuffling SimHash sketches. We use 8K bits for SimHash, as suggested in .
We compare LSF-Join against the state of the art WHIMP algorithm from , and hence our setup is close to the one for WHIMP. In this vein, we transform our graphs into bipartite graphs, either by orienting edges from left to right (for directed graphs), or by duplicating nodes on either side (for undirected ones). This is in accordance with the setup of the left side denoting sets and the right side denoting nodes that is described in the introduction. Also, we pre-filter each bipartite graph to have a narrow degree range on the right (the left degrees can still be ) to minimize variance in cosine similarity values due to degree mismatch. This makes the experiments cleaner, and the algorithm itself can run over all degrees in a doubling manner. We use sparse matrix multiplication for computing all-pairs similarity after computing the survivor sets for each bucket , as it is quite fast in practice and consumes memory on each server. Finally, even though we computed a theoretically optimal value of earlier, in practice, a smaller choice of often suffices in combination with repeating the Fast-Filter method for independent iterations.
For each of the graphs, we run LSF-Join on the graph on a distributed MapReduce platform internal to Google, and compare the output similar pairs against a ground truth set generated from a sample of the data. The ground truth set is generated by doing an exact all-pairs computation for a small subset of nodes chosen at random. Using this ground truth, we can measure the efficacy of the algorithm, and the measure we focus on for the evaluation is the recall of similar pairs111The precision is dependent on the method used to compute all-pairs similarity in a bucket, and since we use sparse matrix multiplication, for us this is 100%.. Specifically, let the set of true similar pairs in the ground truth with similarity at least be denoted by . Furthermore, let the set of similar pairs on the same set of nodes that are returned by the algorithm be . Then, the recall . For a fixed value of , we can measure the change in recall as the number of independent iterations varies (with fixed and ). We run our experiments at a value of that achieves high recall (which is a strategy that carries across datasets), and the results are summarized in Table 1 for ease of comparison. There is a synthetic dataset included in the table, which is described later. The communication cost for LSF-Join is dependent on the number of survivors, which in turn depends on the choice of . We do ignore a subtlety here in that the communication cost will actually often be much less than the number of survivors, since multiple independent repetitions will produce many copies of the same node and hence we can only send one of those copies to a processor.
We reiterate that our experimental comparison is only against the WHIMP algorithm as the WHIMP paper demonstrated that commonly used LSH-based techniques are provably worse. Since WHIMP is only applicable in the scenario where there are no high degree left nodes, our three public graphs are those for which this assumption holds in order to be able to do a comparison. Since the WHIMP algorithm has output-optimal communication complexity, we expect WHIMP to have lower communication cost than LSF-Join, as WHIMP’s communication cost is dominated by the number of edges in the graph. This is indeed seen to be the case from Table 1. However, LSF-Join trades-off higher communication cost with the benefit of load balancing across individual servers. WHIMP does not do any load balancing in the worst case, which can render it inapplicable for a broad class of graphs, as we shall see in the next section. Indeed, the WHIMP job failed for our synthetic graph.
5.1 Synthetic Graph With Extreme Skew
To illustrate a case that WHIMP fails to address, we present results on a synthetic graph that contains the core element of skeweness that we set out to address in this work. We anticipate that the same results will hold for several real world settings, but a synthetic graph is sufficient for comparison with WHIMP. Indeed, the motivation for this randomly generated synthetic graph comes from user behavior where even though users consume almost the same amount of content (say, videos) online, the content being consumed sees a power-law distribution (e.g., some videos are vastly more popular than others). A simplified setting of the same phenomenon can be captured in the following random bipartite graph construction: we build an bipartite graph , where each right node has degree . Each right node chooses to connect to left nodes as follows: first pick nodes at random (without replacement) from a small set of hot nodes , and pick nodes at random (again, without replacement) from the rest of . If , and , this results in right nodes having pairwise cosine similarity that scale with while the hot dimensions have degree for constant . In this setting, we expect wedge sampling-based methods to fail since the hot dimensions have large neighborhoods.
We constructed such a synthetic random bipartite graph with the following parameters: , , and . Then, we repeated the same experiment as the one described above for the real world graphs. This time, we noted that WHIMP failed as the maximum degree for left nodes was around . We were able to run our procedure though, and the recall and the communication cost of the Fast-Filter procedure is shown in Table 1. The recall of the Fast-Filter procedure is shown in Fig (a)a, and the number of survivors in Fig (b)b. Note that, as before, we are able to achieve high recall even on this graph with a heavily skewed degree distribution, with reasonable communication cost.
We present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity search. The key idea of the algorithm is the use of a novel LSF scheme. We exhibit an efficient version of this scheme that runs in nearly linear time, utilizing pairwise independent hash functions. We show that LSF-Join effectively finds low similarity pairs in high-dimensional datasets with extreme skew. Theoretically, we provide guarantees on the accuracy, communication, and work of LSF-Join. Our algorithm improves over hash-join and LSH-based methods. Experimentally, we show that LSF-Join achieves high accuracy on real and synthetic graphs, even for a low similarity threshold. Moreover, our algorithm succeeds for a graph with extreme skew, whereas prior approaches fail.
Part of this work was done while D. Woodruff was visiting Google Mountain View. D. Woodruff also acknowledges support from the National Science Foundation Grant No. CCF-1815840.
-  (2012) Fuzzy Joins using MapReduce. In ICDE, Cited by: §1, §1.
-  (2014) Anchor-Points Algorithms for Hamming and Edit Distances Using MapReduce. In ICDT, Cited by: §1.
-  (2016) On the complexity of inner product similarity join. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 151–164. Cited by: §1.
-  (2013) Optimizing parallel algorithms for all pairs similarity search. In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 203–212. Cited by: §1.
-  (2013) Similarity joins in relational database systems. Synthesis Lectures on Data Management 5 (5), pp. 1–124. Cited by: §1.
-  (2010) Document similarity self-join with mapreduce. In 2010 IEEE International Conference on data mining, pp. 731–736. Cited by: §1.
-  (2007) Scaling up All Pairs Similarity Search. In WWW, Cited by: §1.
-  (2013) Communication steps for parallel query processing. In PODS, pp. 273–284. Cited by: §1, §1.
-  (2014) Skew in Parallel Query Processing. In PODS, Cited by: §1.
-  (2017) Massively-parallel similarity join, edge-isoperimetry, and distance correlations on the hypercube. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 289–306. Cited by: §1.
-  (2018) Scalable and robust set similarity join. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1240–1243. Cited by: §1.
-  (2017) A Framework for Similarity Search with Space-Time Tradeoffs Using Locality-Sensitive Filtering. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 31–46. Cited by: §1, §1.
-  (2005) An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55 (1), pp. 58–75. Cited by: §4.2.
-  (2018) Overlap set similarity joins with theoretical guarantees. In Proceedings of the 2018 International Conference on Management of Data, pp. 905–920. Cited by: §1.
-  (2018) Set similarity joins on mapreduce: an experimental survey. Proceedings of the VLDB Endowment 11 (10), pp. 1110–1122. Cited by: §1.
Approximate nearest neighbor: towards removing the curse of dimensionality.. Theory of Computing 8 (1), pp. 321–350. Cited by: §1.
-  (2019-04) Output-Optimal Massively Parallel Algorithms for Similarity Joins. ACM Trans. Database Syst. 44 (2), pp. 6:1–6:36. External Links: Cited by: §1, §4.3.
-  (2010) A model of computation for mapreduce. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pp. 938–948. Cited by: §1.
-  (2018) Algorithmic aspects of parallel data processing. Foundations and Trends® in Databases 8 (4), pp. 239–370. Cited by: §1.
-  (2014-06) SNAP Datasets: Stanford large network dataset collection. Note: http://snap.stanford.edu/data Cited by: §5.
-  (2014) Mining of Massive Datasets. Cambridge University Press. Cited by: §1.
-  (2016) An empirical evaluation of set similarity join techniques. Proceedings of the VLDB Endowment 9 (9), pp. 636–647. Cited by: §1, §1.
-  (2018) Set similarity search for skewed data. In Proc. of the 37th Symp. on Principles of Database Systems (PODS), pp. 63–74. Cited by: §1, §1.
-  (2018) Adaptive mapreduce similarity joins. In Proc. 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 4. Cited by: §1.
-  (2019) Hardness of bichromatic closest pair with jaccard similarity. In 27th Annual European Symposium on Algorithms (ESA 2019), Cited by: §1.
-  (2013-02) Upper and Lower Bounds on the Cost of a Map-reduce Computation. Proc. VLDB Endow. 6 (4), pp. 277–288. External Links: Cited by: §1.
-  (2017) When hashes met wedges: a distributed algorithm for finding high similarity vectors. In Proceedings of the 26th International Conference on World Wide Web (WWW), pp. 431–440. Cited by: §1, §1, §1, §1, §5, Table 1.
-  (2016) An experimental survey of mapreduce-based similarity joins. In Similarity Search and Applications: 9th International Conference, SISAP 2016, Tokyo, Japan, October 24-26, 2016, Proceedings, L. Amsaleg, M. E. Houle, and E. Schubert (Eds.), Cham, pp. 181–195. External Links: Cited by: §1.
-  (2013) Streaming Similarity Search Over One Billion Tweets Using Parallel Locality-Sensitive Hashing. PVLDB 6 (14), pp. 1930–1941. Cited by: §1.
-  (2010) Efficient Parallel Set-similarity Joins using MapReduce. In SIGMOD, pp. 495–506. Cited by: §1.
-  (2013) Locality Sensitive Hashing Revisited: Filling the Gap Between Theory and Algorithm Analysis. In CIKM, New York, NY, USA, pp. 1969–1978. External Links: Cited by: §1.
-  (2012) Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 85–96. Cited by: §1.
-  (2017) Leveraging set relations in exact set similarity join. Proceedings of the VLDB Endowment 10 (9), pp. 925–936. Cited by: §1.
-  (2011) Efficient Similarity Joins for Near-duplicate Detection. ACM Transactions on Database Systems 36 (3), pp. 15. Cited by: §1, §1.
-  (2016) A generic method for accelerating lsh-based similarity join processing. IEEE Transactions on Knowledge and Data Engineering 29 (4), pp. 712–726. Cited by: §1.
-  (2016) LSH ensemble: internet-scale domain search. Proceedings of the VLDB Endowment 9 (12), pp. 1185–1196. Cited by: §1, §1.