1 Introduction
Similarity search is a widely used primitive in data mining applications, and allpairs similarity in particular is a common data mining operation [1, 7, 21, 34]. Motivated by recommender systems and social networks, we design algorithms for computing allpairs set similarity (a.k.a., a set similarity join). In particular, we consider the similarity of nodes in terms of a bipartite graph. We wish to determine similar pairs of nodes from one side of the graph. For each node on the right, we consider its neighborhood on the left. Equivalently, we can think of as a set of the neighbors of
in the graph. Using this representation, many graphbased similarity problems can be formulated as finding pairs of nodes with significantly overlapping neighborhoods. We focus on the cosine similarity between pairs
andrepresented as highdimensional vectors.
Although set similarity search has received a lot of attention in the literature, there are three aspects of modern systems that have not been adequately addressed yet. Concretely, we aim to develop algorithms that come with provable guarantees and that handle the following three criteria:

Distributed and Scalable. The algorithm should work well in a distributed environment like MapReduce, and should scale to large graphs using a large number of processors.

Low Similarity. The algorithm should output most pairs of sets with relatively low normalized set similarity, such as a setting of cosine similarity taking values

Extreme Skew. The algorithm should provably work well even when the dimensions (degrees on the left) are highly irregular and skewed.
The motivation for these criteria comes from recommender systems and social networks. For the first criteria, we consider graphs with a large number of vertices. For the second, we wish to find pairs of nodes that are semantically similar without having a large cosine value. This situation is common in collaborative filtering and user similarity [27], where two users may be alike even though they overlap on a small number of items (e.g., songs, movies, or citations). Figure 1 depicts the close pair histogram of a real graph, where most similar pairs have low cosine similarity. For the third criteria, skewness has come to recent attention as an important property [5, 23, 36], and it can be thought of as powerlaw type behavior for degrees on the left. In contrast, most other prior work assumes that the graph has uniformly small degrees on the left [22, 27, 28]. This smoothness assumption is reasonable in settings when the graph is curated by manual actions (e.g., Twitter follow graph). However, this is too restrictive in some settings, such as a graph of documents and entities, where entities can legitimately have high degrees, and throwing away these entities may remove a substantial source of information. Another illustration of this phenomenon can be observed even on humancurated graphs, e.g., the Twitter follow graph, where computing similarities among consumers (instead of producers, as in [27]) runs into a similar issue.
Previous work fails to handle all three of the above criteria. When finding low similarity items (e.g., cosine similarity ), standard techniques like LocalitySensitive Hashing [16, 29, 36] are no longer effective (because the number of hashing iterations is too large). Recently, there have been several proposals for addressing this, and the closest one to ours is the wedgesampling algorithm from [27]. However, the approach in [27] has one severe shortcoming: it requires that each dimension has a relatively low frequency (i.e., the bipartite graph has small left degrees).
In this work, we address this gap by presenting a new distributed algorithm LSFJoin for approximate allpairs similarity that can scale to large graphs with high skewness. As a main contribution, we provide theoretical guarantees on our algorithm, showing that it achieves very high accuracy. We also provide guarantees on the communication, work, and maximum load in a distributed environment with a very large number of processors.
Our approach uses Locality Sensitive Filtering (LSF) [12]. This is a variant of the ideas used for Locality Sensitive Hashing (LSH). The main difference between LSF and LSH is that the LSF constructs a single group of surviving elements based on a hash function (for each iteration). In contrast, LSH constructs a whole hash table, each time, for a large number of iterations. While the hashing and sampling ideas are similar, the benefit of LSF is in its computation and communication costs. Specifically, our LSF scheme will have the property that if an element survives in out of total hash functions, then the computation scales with and not . For low similarity elements, is usually substantially smaller than , resulting in a lower overall cost (for example will be sublinear, while is linear, in the input size). We also provide an efficient way to execute this filtering step on a pernode basis.
Our LSF procedure can also be a viewed as a preprocessing step before applying any allpairs similarity algorithm (even one needing a smaller problem size and a graph without skew). The reason is that the survival procedure outputs a number of smaller subsets of the original dataset, each with a different, smaller set of dimensions, along with a guarantee that no dimension has a high degree. The procedure also ensures that similar pairs are preserved with high probability. Then, after performing this filtering, we may use other steps to improve the computation time. For example, applying a hashing technique may reduce the effective dimensionality without affecting the similarity structure.
Problem Setup
The input consists of a bipartite graph with a set of vertices on the left and vertices on the right. We denote that graph as , and we refer to as the set of dimensions, and to as the set of nodes. Given a parameter , we want to output all similar pairs of nodes from such that
This problem also encapsulates other objectives, such as finding top results per node. Note that we could equivalently identify each node with the set of its neighbors , and hence, this problem is the same as the set similarity join problem with input and threshold for cosine similarity. We describe our algorithm in a MapReducelike framework, and we analyze it in the massively parallel computation model [8, 19], which captures the theoretical properties of MapReduceinspired models (e.g., [26, 18]).
We have processors, in a sharednothing distributed environment. The input data starts arbitrarily partitioned among the processors. Associated to each node on the right is a vector which is an indicator vector for the neighbors of on the left. We would like to achieve the twin properties of loadbalanced servers and low communication cost.
Our Contributions
The main contribution of our work is a new randomized, distributed algorithm, LSFJoin, which provably finds almost all pairs of sets with cosine similarity above a given threshold . Our algorithm will satisfy all three of the criteria mentioned above (scalability, low similarity, and skewness). A key component of LSFJoin is a new randomized LSF scheme, which we call the survival procedure. The goal of this procedure is to find subsets of the dataset that are likely to contain similar pairs. In other words, it acts as a filtering step. Our LSF procedure comes with many favorable empirical and theoretical properties. First, we can execute it in nearlylinear time, which allows it to scale to very large datasets. Second, we exhibit an efficient way to implement it in a distributed setting with a large number of processors, using only a single round of communication for the whole LSFJoin algorithm. Third, the survival procedure leads to subquadratic local work, even when the dimensions are highly skewed and the similarity threshold is relatively low. To achieve these properties, we demonstrate how to implement the filtering using efficient, pairwise independent hash functions, and we show that even in this setting, the algorithm has good provable guarantees on the accuracy and running time. We also present a number of theoretical optimizations that better illuminate the behavior of the algorithm on datasets with different structural properties. Finally, we empirically validate our results by testing LSFJoin on multiple graphs.
Related Work
Many filteringbased similarity join algorithms provide exact algorithms and rely on heuristics to improve the running time
[4, 6, 15, 22, 30, 32, 33, 34]. We primarily review prior work that is relevant to our setting and provides theoretical guarantees.One related work uses LSF for set similarity search and join on skewed data [23]. Their data dependent method leads to a sequential algorithm based on the frequency of dimensions, improving a prior LSFbased algorithm [12]. Unfortunately, it seems impossible to adapt their method to the oneround distributed setting. Another relevant result is the wedgesampling approach in [27]. They provide a distributed algorithm for lowsimilarity joins on large graphs. However, their algorithm assumes that the dataset is not skewed.
In the massivelyparallel computation model [8, 9], multiround algorithms have been developed that build off of LSH for approximate similarity joins, achieving outputoptimal guarantees on the maximum load [17, 24]. However, it can be prohibitively expensive to use multiple rounds in modern sharednothing clusters with a huge number of processors. In particular, the previous work achieves good guarantees only when the number of nodes and number of processors satisfy for a constant . We focus on oneround algorithms, and we allow the possibility of , which may be common in very large computing environments. Algorithms using LSH work well when is large enough, such as . However, for smaller , LSHbased distributed algorithms require too much computation and/or communication due to the large number of repetitions [11, 31, 27, 35]. Prior work has also studied finding extremely close pairs [2, 1, 10] or finding pairs of sets with constantsize intersection [14]. These results do not apply to our setting because we aim to find pairs of largecardinality sets with cosine similarity in the range , and we allow for the intersection size to be large in magnitude.
2 The LSFJoin Algorithm
We start with a highlevel overview of our set similarity join algorithm, LSFJoin, which is based on a novel and effective LSF scheme. Let be the input graph with dimensions on the left, and nodes on the right. For convenience, we refer to the vertices and their indices interchangeably, where we use to denote the set .
The LSFJoin algorithm uses independent repetitions of our filtering scheme (where achieves the best tradeoff). In the th repetition we create a set of survivors of the set of vertices on the right. We will define the LSF procedure shortly, which will determine the subsets in a dataindependent fashion. During the communication phase, the survival sets will be distributed in their entirety across the processors. In particular, if there are processors, then each processor will handle roughly different repetitions. During the local computation, the processors will locally compute all similar pairs in for and output these pairs in aggregate (in a distributed fashion). As part of the theoretical analysis, we show that the size of each is concentrated around its mean, and therefore, our algorithm has balanced load across the processors. To achieve high recall of similar pairs, we will need to execute the LSFJoin algorithm times independently, so that the failure probability will be polynomially small. Fortunately, this only increases the communication and computation by a factor. We execute the iterations in parallel, and LSFJoin requires only one round of communication.
2.1 Constructing the Survival Sets
We now describe our LSF scheme, which boils down to describing how to construct the survival sets. We have two main parameters of interest: denotes the survival probability of a single dimension (on the left), and denotes the number of repetitions. The simplest way to describe our LSF survival procedure goes via uniform random sampling. We refer to this straightforward scheme as the NaiveFilter method, and we describe it first. Then, we explain how to improve this method by using a pairwise independent filtering scheme, which will be much more efficient in practice. We refer to the improved LSF scheme as the FastFilter method. Later, we also show that FastFilter enjoys many of the same theoretical guarantees of NaiveFilter, with much lower computational cost.
NaiveFilter.
For the naive version of our filtering scheme, consider a repetition number . We choose a uniformly random set of vertices on left by choosing each node to be in with probability independently. Then, we filter vertices on the right depending on whether their neighborhood is completely contained in or not (that is, whether or not). The th survival set will be the set of vertices such that . We repeat this process independently for each , to derive filtered sets of vertices . Notice that for each , the probability that survives in is exactly , where is the number of neighbors of on the left.
The intuition behind using this filtering method for set similarity search is that similar pairs are relatively likely to survive in the same set. Indeed, the chance that both and survive in is equal to . When the cosine similarity is large, we must have that is large and also that is much smaller than . In other words, and are more likely to survive together if they are similar, and less likely if they are very different. For example, consider the case where is a large constant. Then, pairs with cosine similarity at least will survive together with probability . At the other extreme, disjoint pairs only survive together with probability .
The main drawback of the NaiveFilter method is that it takes too much time to determine all indices such that . Consider the set of ’s neighbors . We need to determine whether for every . Hence, it requires at least work to compute the indices where survives, that is, the set . We will need to set , and hence, the work of NaiveFilter is linear in or worse for each node . To improve upon this, our FastFilter method will have work proportional to , and we show that this is often considerably smaller than .
2.2 The FastFilter Method
The key idea behind our fast filtering method is to develop a pairwise independent filtering scheme that approximates the uniform sampling of the survival sets. We then devise a way to efficiently compute the survival sets on a pernode basis, by using fast matrix operations. More precisely, for each node on the right, FastFilter will determine the indices of survival sets in which survives (that is, we have ). We develop a way to compute independent for each vertex by using Gaussian elimination on binary matrices. The FastFilter method only requires a small amount of shared randomness between the processors.
To describe the FastFilter method, it will be convenient to assume that and are both integers. We now explain the pairwise independent filtering scheme. For each node on the left, we sample a random binary matrix and a length bitstring . We identify each of the repetitions with binary vectors in the dimensional vector space over , the finite field with two elements. In other words, we use the binary representation of to associate with a length bitstring, and we perform matrix and vector operations modulo two. We abuse notation and use for both the integer and the bitstring, where context will distinguish the two.
To determine whether a node survives in , we perform the following operation. We first stack the matrices on top of each other for each of ’s neighbors . This forms a matrix . We also stack the vectors on top of each other, forming a length bitstring . Finally, we define by setting if and only if , where denotes the allzeros vector. We say that survives the th repetition if . Then is the set of indices in which survives.
In a oneround distributed setting, the processors can effectively precompute the submatrices and the subvectors using a shared seed. In particular, these may be computed on the fly, as opposed to stored up front, by using a shared random seed and by using an efficient hash function to compute the elements of and only when processing such that . By doing so, the processors will use the same values of and as one another, leading to consistent survival sets, without incurring any extra rounds of communication.
To gain intuition about this filtering procedure, let denote the number of ’s neighbors. Node will survive in if satisfies . This consists of linear equations that must satisfy. As the matrix and the vector are chosen uniformly at random, it is easy to check that survives in with probability , and hence, their expected sizes satisfy
over a random and .
Theoretically, the main appeal of FastFilter is that it is pairwise independent in the following sense. For any two distinct repetitions and , the bitstrings for and differ in at least one bit. Therefore, we see that is satisfied or not independently of , over the random choice of and . While this is only true for pairs of repetitions, this level of independence will suffice for our theoretical analysis. Furthermore, we show that we can determine the survival sets containing in time proportional to the number of such sets, which is often much less than the total number of possible sets.
We now explain how to efficiently compute the survival sets on a pernode basis. For a fixed node , the FastFilter method determines the repetitions that survives in, or in other words, the set . This is equivalent to finding all length bitstrings that are solutions to . The processor can form and in time, where , assuming the unit cost RAM model on words of bits. Then, we can use Gaussian elimination over bitstrings to very quickly find all that satisfy . To understand the complexity of this, first note that has columns. Moreover, without loss of generality, we see that has at most rows, as otherwise there exists no solution. Therefore, Gaussian elimination takes time to write in upper triangular form (and correspondingly rewrite ) so that all solutions to can be enumerated in time proportional to the number of solutions to this equation. The expected total work is
This can be parallelized for each node independently.
We prove guarantees about FastFilter in Theorem 2. The pseudocode for FastFilter appears as Algorithm 1. The main difference between the two filtering methods is how the random survival sets are chosen. For the sake of this discussion, we set , which is reasonable in practice, and we continue to let . In the FastFilter method, we use a random linear map over with enough independent randomness to decide for each repetition, whether or not a node survives not. By using Gaussian elimination, we are able to compute in time proportional to . In particular, the amount of work for is in expectation, because when .
The pseudocode for LSFJoin appears as Algorithm 2. We assume that the vertices start partitioned arbitrarily across processors. For each vertex in parallel, we use FastFilter determine the indices of the sets in which survives. As detailed above, we can do so consistently by using a shared random seed for FastFilter. During the communication phase, we randomly distribute the sets across processors, so that each processor handles sets in expectation. Then, during local computation, we compare all pairs in for each in parallel. We use independent iterations of the algorithm in parallel to find all close pairs with high probability (e.g., recall close to one). Finally, we output all pairs with cosine similarity at least in a distributed fashion.
One way of processing each set is to compare all pairs in this set. Specifically, for all pairs of nodes , explicitly compute and check if it is at least . One can assume the lists and are sorted arrays of integers, where . Thus, one can compute by merging these sorted lists in time, assuming words of length can be manipulated in constant time in the unit cost RAM model.
Letting be the maximum of over , the time to locally compare all pairs in set is . We can also bound the average amount of work across processors to handle all sets . This can be bounded by
We call this the bruteforce allpairs algorithm.
2.2.1 Setting the Parameters
Let denote the average degree on the right in the input graph. Ideally, these parameters should satisfy
(1) 
or in other words, , where 2 could be replaced with a larger constant for improved recall. If it is possible to approximately satisfy (1) with being an integer, then running independent iterations of the algorithm with these parameters will work very well. For example, this is the case when for constant . However, for large average degree , the parameter may exceed 1/2. To approximate , we can subsample the matrices and vectors to increase the effective collision probability. More precisely, consider . If we wish to survive in a repetition with probability , then we can solve for in the equality , and we subsample the rows in and down to . This effectively constructs survival sets as in NaiveFilter with probability of each neighbor surviving. In the theoretical results, we will assume that and satisfy (1). In the experiments, we either set to be 1/2, or we use the matrix subsampling approach; we also vary the number of independent iterations to improve recall (where we use to denote the number of iterations).
3 Theoretical Guarantees
We assume on the graph is rightregular with nodes in having degree for simplicity. In practice, we can repeat the algorithm for different small ranges of . First, notice that
(2) 
Now consider two nodes . Then both and are in if and only if the following event occurs. Let be the matrix obtained by stacking on top of , and be the vector obtained by stacking on top of . Note that for each , the rows of occur twice in and the coordinates of occur twice in . Thus, it suffices to retain only one copy of and in for each , and by doing so we reduce the number of rows of and entries of to at most . Consequently,
(3) 
Notice that on one extreme if and are disjoint, then (3) evaluates to . On the other hand, if , then , and then (3) evaluates to .
The discrepancy in (2) and (3) is exactly what we exploit in our LSF scheme; namely, we use the fact that similar pairs are more likely to survive together in a repetition than dissimilar pairs.
We first justify the setting of in (1).
Lemma 1.
Let be such that . The expected number of repetitions for which both and is at least .
Proof.
Lemma 2.
The expected load per processor is , and the expected total communication is .
Proof.
There are repetitions, each concerning one survival set. Each node survives in with probability independently. The expected size of is . Each processor handles repetitions, leading to expected load. The total communication is , which has expectation ∎
Lemma 3.
Using bruteforce allpairs locally, the expected work per machine is .
Proof.
Each repetition has expected size , leading to work . Each processor handles repetitions, implying work per processor in expectation. ∎
Combining the lemmas and plugging in gives us the following.
Theorem 1.
Setting , the survival procedure has total communication is
and local work
in expectation.
As an example, we compare to hashjoin when , which has total communication and local work . We set , and by Theorem 1, the expected total communication is . The local work per processor is . Since , the work is always sublinear, thus improving over hashjoin while using the same amount of total communication. As we will see in the theorem below, it is crucial that we use the family of pairwise independent hash functions above for generating our randomness.
Theorem 2.
The expected total time the nodes in need to generate the is
and the expected total time and communication that the nodes in need to send the sets for each for each is
Proof.
Each node needs to figure out the repetitions that it survives in. It can form and in time assuming the unit cost RAM model on word of bits. Note then needs to figure out which satisfy . To do so, in can just solve this equation using Gaussian elimination. Note that has at most rows, and has columns. Therefore Gaussian elimination takes at most time to write in upper triangular form and corresponding so that all solutions to the equation can be enumerated in time proportional to the number of solutions to this equation. Thus, the expected time per processor is , where we have used (2) to bound the expected number of repetitions that survives in by . Thus, the total expected time to form all of the , for , is . Note that is the total expected amount of communication. ∎
While correct in expectation, since the randomness uses across the repetitions is not independent, namely, we use the same matrices and vectors for each node
, it is important to show that the variance of the number of repetitions
for which both and is small. This enables one to show the probability there is at least one repetition for which both and survive is a large enough constant, which can be amplified to any larger constant by independently repeating a constant number of times.Lemma 4.
Let be such that . With probability at least , there is a repetition with both and .
Proof.
Let
be an indicator random variable which is
if and survive the th repetition, and is otherwise. Let be the number of repetitions for which both and survive. By Lemma 1, . It is wellknown that the hash function family , where and range over all possible binary matrices and vectors, respectively, is a pairwise independent family. It follows that are pairwise independent random variables, and consequently . As , we have , and hence, . By Chebyshev’s inequality,∎
Efficiently Amplifying Recall.
At this point, we have shown that one iteration of LSFJoin will find a constant fraction of close pairs. To amplify the recall, we run copies of LSFJoin in parallel. We emphasize that this is a more efficient way to achieve a high probability result, better than simply increasing the number of repetitions in a single LSFJoin execution. Intuitively, this is because the repetitions are only guaranteed to be pairwise independent. Theoretically, independent copies leads to a failure probability of by a Chernoff bound. But, if we only increased the number of repetitions, then by Chebyshev’s inequality, we would need to use repetitions for the same success probability . The latter requires times the amount of communication/computation, while the former is only a factor. Setting leads to a failure probability of after taking a union bound over the possible pairs.
4 Optimizations
In this section, we present several extensions of the LSFJoin algorithm and analysis, such as considering the number of close pairs, using hashing to reduce dimensionality, combining LSFJoin with hashjoin, and lowering the communication cost when the similarity graph is a matching.
4.1 Processing Time as a Function of the Profile
While Theorem 1 gives us a worstcase tradeoff between computation and communication, we can better understand this tradeoff by parameterizing the total amount of work of the servers by a datadependent quantity , introduced below, which may give a better overall running time in certain cases.
Supposing that , the processors receive multiple sets to process. We choose a random hash function so that processor receives all sets for which . When , each processor handles sets in expectation.
The processor handling the set receives together with the neighborhood for each , and is responsible for outputting all pairs for which .
To bound the total amount of computation, we introduce a datadependent quantity . Note that the are independent and identically distributed, so we can fix a particular . We define the profile of a dataset as follows:
Lemma 5.
and .
Proof.
We are interested in bounding the overall time for all nodes in to process the sets .
Theorem 3.
The total work of the nodes in to process the sets , assuming that we use the bruteforce allpairs algorithm is The average work per processor is
Proof.
After receiving the , the total time for all processors to execute their bruteforce allpairs algorithm is , which allows for outputting the similar pairs. The theorem follows. ∎
4.2 Hashing to Speed Up Processing
Recall that the processor responsible for finding all similar pairs in receives the set of neighbors of each node . In the case when the neighborhoods are all of comparable size, sat size , we can think of as a vector with exactly ones in it; here is the characteristic vector of the neighbors of . We can first hash the vector down to dimensions, for a parameter . To do this, we use the CountMin map [13]
, which can be viewed as a random matrix
with a single nonzero per column, and this nonzero is chosen uniformly at random and independently for each of the columns of . We replace with . If an entry of is larger than , we replace it with , and let the resulting vector be denoted , which is in . Note that we can compute all of the for a given repetition using time, assuming arithmetic operations on bit words can be performed in constant time.While for two nodes , it could be that . We now quantify this.
Lemma 6.
For any two nodes , it holds that with probability at least ,
Proof.
Note that since each node is hashed to a bucket by CountMin, which will be a coordinate that is set to in both and . Also the probability that hashes to a bucket containing a with is at most and the expected number of with this property is at most . By a Markov bound, the number of such is at most with probability at least , as desired. ∎
By the previous lemma we can replace the original dimension vectors with the potentially much smaller dimensionvectors with a small price in accuracy and success probability.
4.3 Combining LSFJoin with HashJoin
The LSHbased approach of Hu et. al [17] suggests (in our framework) an alternate strategy of subpartitioning the survival sets, using a hashjoin to distribute the bruteforce allpairs algorithm. Here we analyze this combined approach and plot the tradeoffs. We show that this strategy does not provide any benefit in the communication vs. computation tradeoff, perhaps surprisingly.
The combined strategy, using processors, starts by using repetitions for a parameter , and this is followed by a hash join on each survival set. More precisely, we first construct sets using the FastFilter survival procedure. Then, for each set , we will process all pairs in using
machines. This can be implemented in one round, because all we need to do is estimate the size of each set
approximately, that is, . Then, we can implement the hashjoin in a distributed fashion.We first review the guarantees of the standard hashjoin.
Lemma 7.
For vectors and machines, a hashjoin has expected total communication and expected work per machine.
We use this bound to compute the communication and work, when using a hashjoin to process each survival set.
Theorem 4.
The combined approach has expected total communication and expected work per processor.
Proof.
When , we have that , and hence, we have in expectation. We use Lemma 7 to analyze the hashjoin for each of the groups of processors. Each group handles inputs, and therefore the communication of the group is , which is . Multiplying by , the exponent of becomes
which gives the claimed communication bound. For the per processor work, we have that this is the claimed bound:
∎
Figure 2 demonstrates that the combination approach is never better than the original LSFJoin approach. For a comparison, we consider processors, and hence, the number of repetitions will be for . Then, when , the survival procedure has expected total communication , and it has expected work per processor. And, when we have that the combined approach has expected total communication , and it has expected work per processor. Notice that corresponds to standard LSFJoin, and corresponds to using a hashjoin on the whole dataset.
4.4 When the Similarity Graph is a Matching
Recall that to recover all close pairs with high probability, we need to iterate the LSFJoin algorithm times, because each time finds a constant fraction of close pairs. We exhibit an improvement using multiple communication steps when the similar pairs are structured. An important application of allpairs similarity is constructing the similarity graph. In our setting, the similarity graph connects all pairs such that their cosine similarity is at least . The structure that we consider is when the similarity graph happens to be a matching, containing exactly disjoint pairs with similarity at least .
The key idea is that each iteration decreases the number of input nodes by a constant fraction. We will remove these nodes (or at least one endpoint from each close pair) from consideration, and then repeat the procedure using the remaining nodes. We observe that this method can also be extended to nearmatchings (e.g., small disjoint cliques). Similarly, our result is not specific to LSFJoin, and the technique would work for any LSF similarity join method.
We state our result using the th iterated log function , where , and for . Then, we show:
Theorem 5.
Using communication steps, we can find all but a negligible fraction of close pairs when the similarity graph is a matching. The total communication and computation is times the cost of one execution of LSFJoin.
Proof.
For , we simply run LSFJoin times independently in a single communication step, where each time finds a constant fraction of close pairs. For communication steps, we will use rounds of LSFJoin, and we will remove all found pairs between subsequent rounds (each round will take two communication steps, except for the last, which takes one).
In the first round, we run LSFJoin times. Then, the expected number of pairs that are not found will be , where . In the next round, with rounds remaining, we will only consider the remaining pairs, and we will iterate LSFJoin times. We repeat this process until no more rounds remain, and output the close pairs from all rounds.
We can implement each round of the above algorithm using at most two communication steps. We do so by marking the found pairs between rounds using a single extra communication step. More formally, the input pairs start partitioned across processors. We denote the input partition as . After finding some fraction of close pairs, processor must be notified of which nodes in are no longer active. Whenever processor finds a close pair , it sends the index of to processor such that (and similarly for ), where is known to processor because processor must have sent to processor in LSFJoin. We reduce the total input set from to , where denotes the remaining nodes after removing the found pairs.
To analyze this procedure, notice that the dominant contribution to the total communication and computation is the first round. This is because the subsequent rounds have a geometrically decreasing number of input nodes. The first round uses iterations of LSFJoin, which shows that overall communication and computation is times the cost of one iteration. ∎
4.5 Hashing to Improve Recall
Not only is hashing helpful in order to reduce the description size of the neighborhood sets, as described in Section 4.2, hashing can also be used to increase the number of similar pairs surviving a repetition, and thus the recall. Before, a node pair survives a repetition with probability . Hashing can, however, make smaller due to collisions. Suppose we hash the characteristic vector of the neighborhood of a node down to dimensions for some parameter , obtaining the vector , as in Section 4.2. We could, for example, set as in Section 4.2.
Lemma 8.
Thinking of and as characteristic vectors of sets, and letting , we have
Proof.
Let be an indicator random variable for the event that th bin is nonempty when throwing balls into bins. If the bin is empty, then let . Then , and so where is the total number of nonempty bins. ∎
By Lemma 8, the expected size of the union of the neighborhoods drops after hashing. This is useful, as the survival probability of the node pair in a repetition after hashing is now , which by the previous lemma is larger than before since and this inequality is strict in expectation. Note, however, that the communication and work per machine increase in expectation, but this tradeoff may be beneficial.
5 Experimental Results
In this section, we complement the theoretical analysis presented earlier with experiments that measure the recall and efficiency of LSFJoin on three real world graphs from the SNAP repository [20]: WikiVote, PhysicsCitation, and Epinions. In accordance with our motivation, we also run LSFJoin on an extremely skewed synthetic graph, on which the WHIMP algorithm fails.
Dataset  Communication Cost  Recall  

LSFJoin  LSFJoin  WHIMP  
WikiVote  7K  104K  710MB (, )  60MB  100%  100% 
Citation  34K  421K  410MB (, )  50MB  100%  100% 
Epinions  60K  500K  6GB (, )  60MB  100%  100% 
Synthetic  10M  200M  160GB (, )  Failed  90%  — 
* The communication cost of LSFJoin depends on the number of survivors, which we note along with the value of .
WHIMP communication cost is dominated by shuffling SimHash sketches. We use 8K bits for SimHash, as suggested in [27].
Experimental Setup
We compare LSFJoin against the state of the art WHIMP algorithm from [27], and hence our setup is close to the one for WHIMP. In this vein, we transform our graphs into bipartite graphs, either by orienting edges from left to right (for directed graphs), or by duplicating nodes on either side (for undirected ones). This is in accordance with the setup of the left side denoting sets and the right side denoting nodes that is described in the introduction. Also, we prefilter each bipartite graph to have a narrow degree range on the right (the left degrees can still be ) to minimize variance in cosine similarity values due to degree mismatch. This makes the experiments cleaner, and the algorithm itself can run over all degrees in a doubling manner. We use sparse matrix multiplication for computing allpairs similarity after computing the survivor sets for each bucket , as it is quite fast in practice and consumes memory on each server. Finally, even though we computed a theoretically optimal value of earlier, in practice, a smaller choice of often suffices in combination with repeating the FastFilter method for independent iterations.
For each of the graphs, we run LSFJoin on the graph on a distributed MapReduce platform internal to Google, and compare the output similar pairs against a ground truth set generated from a sample of the data. The ground truth set is generated by doing an exact allpairs computation for a small subset of nodes chosen at random. Using this ground truth, we can measure the efficacy of the algorithm, and the measure we focus on for the evaluation is the recall of similar pairs^{1}^{1}1The precision is dependent on the method used to compute allpairs similarity in a bucket, and since we use sparse matrix multiplication, for us this is 100%.. Specifically, let the set of true similar pairs in the ground truth with similarity at least be denoted by . Furthermore, let the set of similar pairs on the same set of nodes that are returned by the algorithm be . Then, the recall . For a fixed value of , we can measure the change in recall as the number of independent iterations varies (with fixed and ). We run our experiments at a value of that achieves high recall (which is a strategy that carries across datasets), and the results are summarized in Table 1 for ease of comparison. There is a synthetic dataset included in the table, which is described later. The communication cost for LSFJoin is dependent on the number of survivors, which in turn depends on the choice of . We do ignore a subtlety here in that the communication cost will actually often be much less than the number of survivors, since multiple independent repetitions will produce many copies of the same node and hence we can only send one of those copies to a processor.
We reiterate that our experimental comparison is only against the WHIMP algorithm as the WHIMP paper demonstrated that commonly used LSHbased techniques are provably worse. Since WHIMP is only applicable in the scenario where there are no high degree left nodes, our three public graphs are those for which this assumption holds in order to be able to do a comparison. Since the WHIMP algorithm has outputoptimal communication complexity, we expect WHIMP to have lower communication cost than LSFJoin, as WHIMP’s communication cost is dominated by the number of edges in the graph. This is indeed seen to be the case from Table 1. However, LSFJoin tradesoff higher communication cost with the benefit of load balancing across individual servers. WHIMP does not do any load balancing in the worst case, which can render it inapplicable for a broad class of graphs, as we shall see in the next section. Indeed, the WHIMP job failed for our synthetic graph.
5.1 Synthetic Graph With Extreme Skew
To illustrate a case that WHIMP fails to address, we present results on a synthetic graph that contains the core element of skeweness that we set out to address in this work. We anticipate that the same results will hold for several real world settings, but a synthetic graph is sufficient for comparison with WHIMP. Indeed, the motivation for this randomly generated synthetic graph comes from user behavior where even though users consume almost the same amount of content (say, videos) online, the content being consumed sees a powerlaw distribution (e.g., some videos are vastly more popular than others). A simplified setting of the same phenomenon can be captured in the following random bipartite graph construction: we build an bipartite graph , where each right node has degree . Each right node chooses to connect to left nodes as follows: first pick nodes at random (without replacement) from a small set of hot nodes , and pick nodes at random (again, without replacement) from the rest of . If , and , this results in right nodes having pairwise cosine similarity that scale with while the hot dimensions have degree for constant . In this setting, we expect wedge samplingbased methods to fail since the hot dimensions have large neighborhoods.
We constructed such a synthetic random bipartite graph with the following parameters: , , and . Then, we repeated the same experiment as the one described above for the real world graphs. This time, we noted that WHIMP failed as the maximum degree for left nodes was around . We were able to run our procedure though, and the recall and the communication cost of the FastFilter procedure is shown in Table 1. The recall of the FastFilter procedure is shown in Fig (a)a, and the number of survivors in Fig (b)b. Note that, as before, we are able to achieve high recall even on this graph with a heavily skewed degree distribution, with reasonable communication cost.
6 Conclusion
We present a new distributed algorithm, LSFJoin, for approximate allpairs set similarity search. The key idea of the algorithm is the use of a novel LSF scheme. We exhibit an efficient version of this scheme that runs in nearly linear time, utilizing pairwise independent hash functions. We show that LSFJoin effectively finds low similarity pairs in highdimensional datasets with extreme skew. Theoretically, we provide guarantees on the accuracy, communication, and work of LSFJoin. Our algorithm improves over hashjoin and LSHbased methods. Experimentally, we show that LSFJoin achieves high accuracy on real and synthetic graphs, even for a low similarity threshold. Moreover, our algorithm succeeds for a graph with extreme skew, whereas prior approaches fail.
Acknowledgments.
Part of this work was done while D. Woodruff was visiting Google Mountain View. D. Woodruff also acknowledges support from the National Science Foundation Grant No. CCF1815840.
References
 [1] (2012) Fuzzy Joins using MapReduce. In ICDE, Cited by: §1, §1.
 [2] (2014) AnchorPoints Algorithms for Hamming and Edit Distances Using MapReduce. In ICDT, Cited by: §1.
 [3] (2016) On the complexity of inner product similarity join. In Proceedings of the 35th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, pp. 151–164. Cited by: §1.
 [4] (2013) Optimizing parallel algorithms for all pairs similarity search. In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 203–212. Cited by: §1.
 [5] (2013) Similarity joins in relational database systems. Synthesis Lectures on Data Management 5 (5), pp. 1–124. Cited by: §1.
 [6] (2010) Document similarity selfjoin with mapreduce. In 2010 IEEE International Conference on data mining, pp. 731–736. Cited by: §1.
 [7] (2007) Scaling up All Pairs Similarity Search. In WWW, Cited by: §1.
 [8] (2013) Communication steps for parallel query processing. In PODS, pp. 273–284. Cited by: §1, §1.
 [9] (2014) Skew in Parallel Query Processing. In PODS, Cited by: §1.
 [10] (2017) Massivelyparallel similarity join, edgeisoperimetry, and distance correlations on the hypercube. In Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, pp. 289–306. Cited by: §1.
 [11] (2018) Scalable and robust set similarity join. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1240–1243. Cited by: §1.
 [12] (2017) A Framework for Similarity Search with SpaceTime Tradeoffs Using LocalitySensitive Filtering. In Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, pp. 31–46. Cited by: §1, §1.
 [13] (2005) An improved data stream summary: the countmin sketch and its applications. J. Algorithms 55 (1), pp. 58–75. Cited by: §4.2.
 [14] (2018) Overlap set similarity joins with theoretical guarantees. In Proceedings of the 2018 International Conference on Management of Data, pp. 905–920. Cited by: §1.
 [15] (2018) Set similarity joins on mapreduce: an experimental survey. Proceedings of the VLDB Endowment 11 (10), pp. 1110–1122. Cited by: §1.

[16]
(2012)
Approximate nearest neighbor: towards removing the curse of dimensionality.
. Theory of Computing 8 (1), pp. 321–350. Cited by: §1.  [17] (201904) OutputOptimal Massively Parallel Algorithms for Similarity Joins. ACM Trans. Database Syst. 44 (2), pp. 6:1–6:36. External Links: ISSN 03625915, Link, Document Cited by: §1, §4.3.
 [18] (2010) A model of computation for mapreduce. In Proceedings of the twentyfirst annual ACMSIAM symposium on Discrete Algorithms, pp. 938–948. Cited by: §1.
 [19] (2018) Algorithmic aspects of parallel data processing. Foundations and Trends® in Databases 8 (4), pp. 239–370. Cited by: §1.
 [20] (201406) SNAP Datasets: Stanford large network dataset collection. Note: http://snap.stanford.edu/data Cited by: §5.
 [21] (2014) Mining of Massive Datasets. Cambridge University Press. Cited by: §1.
 [22] (2016) An empirical evaluation of set similarity join techniques. Proceedings of the VLDB Endowment 9 (9), pp. 636–647. Cited by: §1, §1.
 [23] (2018) Set similarity search for skewed data. In Proc. of the 37th Symp. on Principles of Database Systems (PODS), pp. 63–74. Cited by: §1, §1.
 [24] (2018) Adaptive mapreduce similarity joins. In Proc. 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 4. Cited by: §1.
 [25] (2019) Hardness of bichromatic closest pair with jaccard similarity. In 27th Annual European Symposium on Algorithms (ESA 2019), Cited by: §1.
 [26] (201302) Upper and Lower Bounds on the Cost of a Mapreduce Computation. Proc. VLDB Endow. 6 (4), pp. 277–288. External Links: ISSN 21508097, Link, Document Cited by: §1.
 [27] (2017) When hashes met wedges: a distributed algorithm for finding high similarity vectors. In Proceedings of the 26th International Conference on World Wide Web (WWW), pp. 431–440. Cited by: §1, §1, §1, §1, §5, Table 1.
 [28] (2016) An experimental survey of mapreducebased similarity joins. In Similarity Search and Applications: 9th International Conference, SISAP 2016, Tokyo, Japan, October 2426, 2016, Proceedings, L. Amsaleg, M. E. Houle, and E. Schubert (Eds.), Cham, pp. 181–195. External Links: ISBN 9783319467597, Document, Link Cited by: §1.
 [29] (2013) Streaming Similarity Search Over One Billion Tweets Using Parallel LocalitySensitive Hashing. PVLDB 6 (14), pp. 1930–1941. Cited by: §1.
 [30] (2010) Efficient Parallel Setsimilarity Joins using MapReduce. In SIGMOD, pp. 495–506. Cited by: §1.
 [31] (2013) Locality Sensitive Hashing Revisited: Filling the Gap Between Theory and Algorithm Analysis. In CIKM, New York, NY, USA, pp. 1969–1978. External Links: ISBN 9781450322638, Link, Document Cited by: §1.
 [32] (2012) Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 85–96. Cited by: §1.
 [33] (2017) Leveraging set relations in exact set similarity join. Proceedings of the VLDB Endowment 10 (9), pp. 925–936. Cited by: §1.
 [34] (2011) Efficient Similarity Joins for Nearduplicate Detection. ACM Transactions on Database Systems 36 (3), pp. 15. Cited by: §1, §1.
 [35] (2016) A generic method for accelerating lshbased similarity join processing. IEEE Transactions on Knowledge and Data Engineering 29 (4), pp. 712–726. Cited by: §1.
 [36] (2016) LSH ensemble: internetscale domain search. Proceedings of the VLDB Endowment 9 (12), pp. 1185–1196. Cited by: §1, §1.
Comments
There are no comments yet.