1 Introduction
Data analysts often need to identify similar records stored in multiple data collections. This operation is very common for data cleaning [7], element clustering [5] and record linkage [8, 22]. In some scenarios the similarity analysis aims to detect near duplicate records [25], in which slightly different data representations may be caused by input errors, misspelling or use of synonyms [1]. In other scenarios, the goal is to find patterns or common behaviors between real entities, such as purchase patterns [13] and similar user interests [20].
Set Similarity Join is the operation that identifies all similar sets between two collections of sets (or inside the same collection) [13, 12]. In this case, each record is represented as a set of elements (tokens). For example, the text record “Data Warehouse” may be represented as a set of two words {Data, Warehouse} or a set of bigrams {Da, at, ta, a␣, ␣W, Wa, ar, re, eh, ho, ou, us, se}. Two records are considered similar according to a similarity function, such as Overlap, Dice, Cosine or Jaccard [2, 21]. For instance, two records may be considered similar if the overlap (intersection) of their bigrams is greater or equal than 6, such as in “Databases” and “Databazes”.
Many different solutions have been proposed in the literature for Set Similarity Join [10, 7, 3, 25, 23, 4, 9, 27, 17, 6, 26, 19, 28]. With respect to the guarantee that all similar pairs will be found, these solutions are divided into Exact approaches [10, 7, 3, 25, 23, 4, 28] and approximate aproaches [9, 27, 17, 6, 26, 19]. The approximate algorithms usually rely on the Local Sensitive Hashing technique (LSH) [11] and they are very competitive, but in the scope of this paper only the exact approaches will be considered.
The solutions for Exact Set Similarity Join are usually based on the FilterVerification Framework [13, 24], which is divided into two stages: a) the candidate generation uses a series of filters to produce a reduced list of candidate pairs; b) the verification stage verifies each candidate pair in order to check which ones are similar, with respect to the selected similarity function and threshold.
The filtering stage is commonly based on Prefix Filter [7, 3] and Length Filter[10], which are able to prune a considerable number of candidate pairs without compromising the exact result. Prefix Filter[7, 3] is based on the idea that two similar sets must share at least one element in their initial elements, given a defined element ordering. Length Filter [10] prunes candidate lists according to their lengths, considering that elements with a big difference in length will have low similarity. Other filters have also been proposed, such as the Positional Filter [25], which is able to prune candidate pairs based on the position where the token occurs in the set, and the Suffix Filter [25], that prunes pairs by applying a binary search on the last elements of the sets.
There are many algorithms in the literature combining the aforementioned filters. AllPairs[3] uses the Prefix and Length filters, PPJoin [25] extends AllPairs with the Positional Filter, PPJoin+ [25] extends PPJoin with the Suffix Filter. AdaptJoin [23] extends PPJoin using a variable schema with adaptive prefix lengths and GroupJoin [4] applies the filters in groups of similar sets, avoiding redundant computations. These algorithms usually rely on the assumption that the verification stage contributes significantly to the overall running time. However, in [13] a new verification procedure using an early termination condition was proposed that significantly reduced the execution time of the verification phase. With this, the proportion of execution time in the candidate generation phase increased and, as a consequence, the simplest filtering schemes are now able to produce the best performance. In [13], the best results were obtained by AllPairs, PPJoin, GroupJoin and AdaptJoin, whereas AllPairs was the best algorithm on the largest number of experiments [13]. On the contrary, PPJoin+[25], which uses a sophisticated suffix filter technique, was the slowest algorithm in [13].
The stateoftheart set similarity join algorithms still need to be improved in order to allow a faster join, otherwise very large collections may not be processed in a reasonable time. Based on the recent findings on the filterverification tradeoff overhead [13], the authors claimed that future filtering techniques should invest in fast and simple candidate generation methods instead of sophisticated filterbased algorithms, otherwise the filters effectiveness will not pay off their overhead when compared to the verification procedure.
The main contribution of this paper is a new low overhead filter called Bitmap Filter, which is able to efficiently prune candidate pairs and improve the performance of many stateoftheart exact set similarity join algorithms such as AllPairs[3], PPJoin[25], AdaptJoin[23] and GroupJoin [4]. Using bitwise operations implemented by most modern processors, the Bitmap Filter is able to speedup the performance of AllPairs, PPJoin, AdaptJoin and GroupJoin algorithms by up to 4.50 (1.43 on average) considering 9 different collections and Jaccard threshold varying from to . As far as we know, there is no other filter for the Exact Set Similarity Join problem that is based on efficient bitwise operations.
The Bitmap Filter is sufficiently flexible to be used in the candidate generation or verification stages. Furthermore, it can be efficiently implemented in devices such as Graphical Processor Units (GPU). As a secondary contribution of this paper, we thus propose a GPU implementation of the Bitmap Filter that is able to speedup the join by up to using a Nvidia Geforce GTX 980 Ti card, considering 6 collections of up to 606,770 sets.
The rest of this paper is structured as follows. Section 2 presents the background of the Set Similarity Join problem. Then, Section 3 proposes the Bitmap Filter and explains it in detail. Section 4 shows the design of the proposed CPU and GPU implementations. Section 5 presents the experimental results and Section 6 concludes the paper.
2 Background
In this section, we discuss the Set Similarity Join problem. First, the problem is formalized (Section 2.1). Then, the FilterVerification Framework is presented (Section 2.2). After that, some commonly used filters employed in the FilterVerification Framework are explained (Section 2.3). Finally, stateoftheart algorithms are described (Section 2.4).
2.1 Problem Definition
Given two collections and , where sets and are made of tokens from ( and ), the Set Similarity Join problem aims to find all pairs from and that are considered similar with respect to a given similarity function , i.e. , where is a user defined threshold [13, 25]. When R and S are the same collection, the problem is defined as a selfjoin, otherwise it is called join. The set similarity join operation can be written as presented in Equation 1 [13].
(1) 
The most used similarity functions for comparing sets are the Overlap, Jaccard, Cosine and Dice [2, 21], as presented in Table 1. The Overlap function returns the number of elements in the intersection of sets and (we suppress the subscripts and when the context is free of ambiguity). Jaccard, Cosine and Dice are similarity functions that normalize the overlap by dividing the intersection by the union (Jaccard), or by a combination of the lengths and (Cosine and Dice). The similarity functions can be converted to each other using the threshold equivalence presented in Table 1 [12, 16], where , , and are the overlap, Jaccard, cosine and dice thresholds, respectively.
Sim. Function  Equiv. Overlap  

Overlap  
Jaccard  
Cosine  
Dice 
To solve the set similarity join problem, the naïve algorithm (Algorithm 1) compares all pairs from collections and and verifies if the similarity of each pair (with respect to the similarity function) is greater or equal than the desired threshold. Since this method verifies elements, it does not scale to a large number of sets.
2.2 FilterVerification Framework
Many algorithms, such as AllPairs[3], PPJoin[25], PPJoin+[25], AdaptJoin[23], and GroupJoin [4], have been proposed to reduce the number of verification operations, aiming to solve the set similarity join problem efficiently. These algorithms typically employ filtering techniques using the FilterVerification Framework, as shown in Algorithm 2.
In the initialization step, the collection is indexed (line 4) with respect to the tokens found in each set , such that stores all sets containing token . Then, for each set , the algorithm iterates in a two phase loop: candidate generation and verification stage.
In the candidate generation loop (lines 47), the algorithm iterates over tokens from set (filtered by function filter in line 8) in order to find in index all sets that also contain token . Each set may also be filtered (function filter in line 10) and, if the set is not pruned, it is inserted into a candidate list.
The verification stage (lines 811) checks, for each unique candidate , if the pair is similar with respect to the similarity function and threshold (line 15) and, if so, the pair is inserted in the similar pair list (line 16). A filter may also be applied in the verification loop (function filter in line 14) to reduce the number of verified pairs. Finally, the similar pairs are returned (line 17).
2.3 Filtering Strategies
This section explains the most widely used filtering techniques in the literature.
2.3.1 Prefix Filter
The Prefix Filter technique [7, 3] creates an index over the first tokens in each set, considering that the tokens are sorted, usually based on the token frequencies. The sizes of the selected prefixes in each set are such that if two sets and are similar (i.e. ), then their prefixes will have at least one token in common. For instance, Figure 0(a) shows two sets and with sizes and . Considering that the overlap similarity must be greater or equal than , then the prefix sizes are respectively prefix and prefix. In this scenario, if there is no overlap in the prefixes (white cells), then it will be impossible to have at least overlaps considering the remaining cells (in gray). In fact, considering the overlap similarity function, the prefix size for a set with size is prefix [3]. The prefix sizes for other similarity functions are presented in Table 2 [12].
After the prefix is created, the algorithm produces a list of candidate pairs for each set . This list is composed of all the sets that are stored in the index pointed by the elements in prefix. The candidate pairs are then verified using the similarity function and, if the threshold is attained, this pair is marked as similar.
2.3.2 Length Filter
Length Filter [10] benefits from the idea that two sets and with very different lengths tend to be dissimilar. Given the size of set and the threshold , Table 2 presents the upper and lower bounds for the size such that, outside of these bounds, the similarity is surely lower than the threshold. Table 2 presents these bounds considering the overlap, Jaccard, cosine and dice similarity functions [13].
Function  Lower/Upper Bounds  Prefix Length  

Overlap  
Jaccard  
Cosine  
Dice 
Using this filter, the algorithms can ignore a set whose size is outside the defined bounds. Considering that the sets in the list are ordered by their lengths, whenever a set from an index list contains more elements than the defined upper bound, all the remaining elements of that index can be ignored.
2.3.3 Positional Filter
Positional Filter [25] tracks the position where a given token appears in the sets. Using this position, it is possible to determine a tighter upper bound for the similarity metric, and this tends to reduce the number of candidate pairs. For example, Figure 0(b) presents two sets and with sizes and . A Jaccard threshold will produce prefixes with size in both sets. In this figure, the elements in prefix are presented with their position in the set (, and ). The first matching token “D” occurs in position 3 in set and, with this information and the sizes and , it can be inferred that the maximum possible overlap is . Using the same idea, the minimum possible union is and, thus, the maximum Jaccard coefficient is . Since the upper Jaccard coefficient is below the threshold , this candidate can be pruned. The same idea can be applied to cosine and dice similarity functions.
2.3.4 Suffix Filter
The Suffix Filter [25] takes one probe token from the suffix of set and applies a binary search algorithm over set in order to divide these sets in two parts (left/right). Since the tokens in each set are sorted, tokens in the left side of one set cannot overlap with tokens in the right side of the other set. Using this idea, it can be inferred a tighter upper bound for the overlap between two sets. For example, in Figure 0(c) there are two sets and with sizes , where the required overlap threshold between the sets is . In the figure, prefixes are represented by white cells and the suffixes by gray cells. Comparing the tokens in the prefixes, it is possible to find one match (token “C”), but 5 additional matches still need to be found. Then, the Suffix Filter chooses the middle token (“X”) from the suffix of set and it seeks for the same token in set . In the figure, the token “X” divides the set into two subsets with two elements each, and the set is divided into two subsets, with one element to the left and three elements to the right. Considering that the elements are sorted, the maximum number of additional overlaps in each side is one element in the left of “X” and two elements in the right. This would lead to an upper bound of 5, which is less than the required overlap (). Thus, this pair can be pruned.
2.3.5 VariableLength Prefix Filter
The original Prefix Filter relies on the idea that the prefixes of sets and must share at least one token in order to satisfy the threshold. The VariableLength Prefix Filter [23] extends this idea using adaptive prefix lengths. For example, using a length prefix schema, it is required at least common tokens in the prefixes. The prefix size for a set with size in a prefix schema is prefix. Figure 0(d) shows an example with sets and with sizes and . Considering a required overlap , the prefix schema chooses prefix and prefix. Since the prefixes (white cells) present less than two overlaps, it is impossible to satisfy the threshold , so the pair is pruned.
2.4 Algorithms
This section presents an overview of stateofthe art algorithms that use the FilterVerification Framework (Algorithm 2). Table 3 shows five algorithms, related acronyms and filters used by the algorithm. In this paper, we will focus on the best four algorithms as stated by [13]: AllPairs, PPJoin, AdaptJoin, and GroupJoin.
Algorithm  

all  AllPairs[3]  Prefix  Length 
ppj  PPJoin[25]  Prefix  Length/Pos. 
ppj+  PPJoin+[25]  Prefix  Length/Pos./Suffix 
gro  GroupJoin[4]  Prefix  Length/Pos. 
ada  AdaptJoin[23]  Prefix  Length 
AllPairs[3] is an algorithm that uses the Prefix Filter (Section 2.3.1) and Length Filter (Section 2.3.2). In the FilterVerification Framework, the Prefix Filter is applied as filter and the Length Filter is applied as filter.
PPJoin[25] extends the AllPairs algorithm, including the Positional Filter (Section 2.3.3). In the FilterVerification Framework, the Prefix Filter is applied as filter and the Length and Positional filters are applied as filter.
PPJoin+[25] extends the PPJoin algorithm with the Suffix Filter (Section 2.3.4) over the candidate pairs. In the FilterVerification Framework, the Prefix Filter is applied as filter and the Length, Positional and Suffix filters are applied as filter.
AdaptJoin[23] is an algorithm that uses the Length Filter (Section 2.3.2) and the VariableLength Prefix Filter (Section 2.3.5). The FilterVerification Framework is modified such that additional iterations are executed with different prefix sizes, according to the characteristics of the collection. The VariableLength Prefix Filter is applied as filter and the Length Filter is applied as filter.
GroupJoin[4] applies the Prefix Filter (Section 2.3.1), Length Filter (Section 2.3.2), and Positional Filter (Section 2.3.3) over groups of sets containing the same prefixes and sizes. In the verification stage of the FilterVerification Framework, the grouped candidate pairs are expanded to individual elements and all possible pairs are verified. With this grouping schema, the filters can be applied only to one representative of the group, what considerably reduces the number of comparisons for groups with many elements. Figure 2 presents two groups of sets containing 3 elements sharing the same prefix.
3 Bitmap Filter
In this section we propose the Bitmap Filter, which uses a new bitwise technique capable of speeding up similarity set joins without compromising the exact result. The bitwise operations are widely used in modern microprocessors and most of the compilers are capable to map them to fast hardware instructions.
First, some preliminaries are given in Section 3.1. Then, Section 3.2 presents how the bitmap is generated. Section 3.3 shows how the overlap upper bound can be calculated using bitmaps. Then, in Section 3.5, a cutoff point is defined in order to determine when the Bitmap Filter is effective. Finally, the pseudocode of the Bitmap Filter is given in Section 3.6.
3.1 Preliminaries
The similarity functions between two sets and can be calculated over their binary representation. For instance, sets and can be mapped to binary sets represented by and , where at position represents the occurrence of token in the corresponding set. In this representation, the size of the binary sets must be sufficiently large to map all possible tokens in the collection. Considering that each bit in the binary set is a dimension, collections with large number of distinct tokens will produce a high dimensionality binary sets that are usually very sparse.
Modern microprocessors implement many bitwise instructions that can be used to speedup computations of similarity functions over binary sets. One of those instructions is the population count (popcount), which is used to count the number of “ones” in the binary representation of a numeric word [18]. For example, the overlap of the sets and mentioned in the previous paragraph can be calculated by , using only two instructions in general microprocessors (one AND instruction and one POPCNT instruction). The same idea can be extended to Jaccard coefficient: .
High dimensionality binary sets may be implemented by the concatenation of two or more fixedlength words, but as the dimensionality increases, more instructions are necessary to calculate the similarities, decreasing the performance of the computation. The Bitmap Filter proposed in this section reduces the dimensionality of the binary sets without compromising the exactness of the Set Similarity Join results. Using a binary set with reduced size, called Bitmap from now on, the bitwise instructions may be much more effective.
3.2 Bitmap Generation
Consider a set composed by tokens . Let be a bitmap with bits, representing set . The bit at position is represented by . Let be a hash function that maps each token to a specific bit position .
To create the bitmap from the set , we define a generation function that maps a subset of tokens into the bitmap . Three different generation functions are proposed in this paper: BitmapSet, BitmapXor and BitmapNext. The difference between them is in how they handle collisions produced by the hash function .
BitmapSet: For each element , the bit at position will be set. If multiple elements map to the same position, only the first element will effectively change the bit status (the other elements will leave this bit set to ‘one’). Algorithm 3 presents the pseudocode for this function.
BitmapXor: For each element , the bit at position will have its value changed ( or ). In the end, the bit
will remain set only if there is an odd number of elements mapped to position
. Algorithm 4 presents the pseudocode for this function, where is the xor operation.BitmapNext: For each element , if the bit at position is unset, set it to 1, otherwise set the next unset bit after position , circulating to the beginning of the bitmap if it reaches the last bit. This method ensures that there will be exactly 1’s in the bitmap, unless . In this case, all bits will be set. Algorithm 5 presents the pseudocode for this method.
Figure 3 shows an example of bitmap generation considering the bitmap size and a set with 10 tokens. The hash function is represented by the arrows pointing to each bitmap slot. The three bitmaps were produced from BitmapSet, BitmapXor and BitmapNext, respectively from top to bottom. Compared to BitmapSet, it can be seen that the bitmap generated by BitmapXor has 1 different bit and the BitmapNext has 3 different bits. Each generation method will fit better in different circumstances that will be explored in Section 3.3.
3.3 Overlap Upperbound using Bitmaps
Given two sets and and their respective bitmaps and produced by one of the three functions (Section 3.2), this section shows how to infer an upperbound for the overlap similarity function (). Whenever the upperbound is not sufficient to attain the similarity threshold given as input, then the pair of sets and may be pruned.
Intuitively, for each occurrence of a different bit in the same position in bitmaps and (i.e. ), we can infer that there is at least one element from or that does not occur in the intersection of the sets. Given that, it can be found that the maximum overlap between these sets is the sum of the set’s sizes minus the number of different bits (xor) in their bitmaps (also known as hamming distance).
In order to prove that the overlap upper bound holds for the BitmapSet, BitmapXor and BitmapNext, it must be noted that the bitmap generation functions may be represented as a series of algebraic operations. Let be the bitmap constructed by a single bit set at position . For instance, if and , then . Functions BitmapSet, BitmapXor and BitmapNext are related to a series of binary operations over the bitmaps, as in . For simplicity, this series will also be represented by the prefix notation . The operation is associated to the logic of each loop iteration in Algorithms 3, 4 and 5. For instance, in BitmapXor, is the xor operation and, in BitmapSet, is the or operation. The BitmapNext operation is constructed with the same logic intrinsic to the lines 7 to 10 in Algorithm 5, where the collisions are chained to the next unset bit until the bitmap is saturated. It must be noted that, in the three methods, is commutative () and associative (), so if we modify the order of the operands , the resultant bitmap will be the same (e.g. ). Given this observation, we have Theorem 1.
Theorem 1
Let two sets and and their bitmaps and produced by BitmapSet, BitmapXor or BitmapNext. The overlap is restricted to the upper bound defined by Equation 2
(2) 
Let the common elements of sets and be and the noncommon elements be . Since the bitmap generation functions can be represented by an associative and commutative operation , the order of their operands does not affect the final bitmap. So, without loss of generality, the elements of sets and are rearranged such that the first elements are the common elements, i.e., . Since the first elements of sets and are the same, their partial bitmaps and will also be the same (i.e. there are no different bits in and ).
Then, each remaining operand (noncommon elements in ) is able to change at most one bit in the partial bitmaps and . So, if there is one different bit from the final bitmaps and , surely it was caused by a noncommon element and, since each element can change at most one bit, then the number of different bits is lower or equal than the number of noncommon elements . The number of different bits from and can be counted by (hamming distance). So, the lower bound for can be defined by Equation 3.
(3) 
Considering that , Equation 3 can be directly transformed into the upper bound defined by Equation 2.
Figure 4 shows an example of two bitmaps and generated by the BitmapSet method. Then, the hamming distance is calculated using bitwise operations (). In Figure 4, the operation produces a word with 4 ones and, considering Equation 2, the overlap upper bound is . Using the BitmapXor and BitmapNext methods, the overlap upper bound would be .
3.4 Expected Overlap Upperbound
Due to collisions in the hash functions
, two sets may produce similar bitmaps even if they do not share similar tokens. As the probability of collision is increased, the filtering effectiveness is reduced because it will loose the capability of distinguishing between similar and dissimilar sets. Intuitively, the chance for a collision is increased whenever the number of tokens in the original sets
and is increased or the size of the bitmaps is reduced. This produces a tradeoff between the filtering efficiency (precision) and the reduction of the bitmap dimension.Assuming that the hash function distributes the tokens uniformly in each bits of the bitmaps, a probability analysis is able to determine, for each bitmap generation method, the expected number of collisions given two sets with random tokens. Given the expected number of collisions, it is also possible to infer the expected overlap upperbound using Equation 2. Equations 4, 5, and 6 present the expected overlap upper bound for a given bitmap size and
tokens, considering the BitmapSet, BitmapXor and BitmapNext respectively. These equations are based on the estimation of collision probabilities in hash functions
[15]. Furthermore, the probabilities were verified by a MonteCarlo simulation, where 100,000 random pairs of sets and were generated with tokens and the overlap upperbound was obtained for each pair. Running the simulation for each and for , the average results produced by the simulation were almost identical to those derived from Equations 4, 5, and 6, with an average error below 0.012%.(4)  
(5)  
(6) 
Figure 5 plots the upper bounds defined by Equations 4, 5, and 6 considering and varying from 1 to 256. The left axis presents a scale with the normalized overlap, given by the overlap divided by , and the right axis presents a scale with the equivalent Jaccard upper bound (Table 1). As the value increases, more difficult will be to distinguish between similar and nonsimilar pairs (the lower the value, better is the method). A value of represents an upper bound equal to the number of tokens (useless filter), and a value of represents a zero upperbound. For example, considering the BitmapSet and the BitmapXor, the expected normalized overlap upper bound at is around (or considering the equivalent Jaccard metric). So, a Jaccard threshold of will be expected to filter 50% of dissimilar pairs of sets composed of random tokens.
3.5 Cutoff Point for Bitmap Filter
Given a bitmap size and the overlap threshold , the cutoff point is defined as the maximum number of tokens at which the Bitmap Filter can efficiently distinguish between similar and dissimilar sets. The cutoff point can be defined in terms of the function (Equations 4, 5 and 6) such that implies . In Figure 5, the average normalized overlap upper bound at and is around . So, using a threshold , the Bitmap Filter precision will be effective with sets with up to tokens and, with more tokens, the precision will dropoff significantly, since the filter will not be able to discard the majority of dissimilar pairs. So, the cutoff point for will be .
Figure 6 presents the cutoff point for BitmapSet, BitmapXor and BitmapNext, with sizes . A similar plot pattern is observed for any greater than 64. Analyzing the cutoff difference between BitmapXor and BitmapSet at high thresholds (e.g. above ), it is noticeable that BitmapXor presents higher cutoffs. For instance, with and Jaccard threshold , BitmapXor attained a cutoff at , whereas the cuttoff point of BitmapSet is . In other words, the BitmapXor will still be effective with more tokens than BitmapSet. For Jaccard threshold , this proportion is .
In Figures 5 and 6, it can be seen that there is a preferable bitmap generation method for each threshold range: BitmapNext presents the higher cutoff when ; BitmapSet is slightly better between ; BitmapXor is preferred when . This pattern is observed for any value of greater than 64. So, a combined bitmap generation can be created using the preferred bitmap generation method for each interval (Algorithm 6). Figure 7 presents the cutoff point for the combined bitmap generation method, where the axis scale is relative to bitmap size .
3.6 Bitmap Filter Pseudocode
The pseudocode for the Bitmap Filter is presented in Algorithm 7. This pseudocode is suitable for the FilterVerification Framework (Algorithm 2) and it is divided in two sections: the initialization procedure (lines 14) is run only once for the entire execution; the Bitmap_Filter function (lines 510) can be executed at filter or filter in Algorithm 2.
The initialization precomputes the bitmap for every set in the collections and . As stated in Algorithm 6, the Bitmap generation method is selected accordingly to threshold . The initialization also calculates the cutoff point (Section 3.5) for the selected bitmap size and threshold . The cutoff calculation is precomputed using the combined method presented in Figure 7.
Therefore, for a given candidate pair and , the function Bitmap_Filter executes the proposed filter. If the length is above the cutoff point (line 7), the Bitmap Filter will be ignored. Otherwise, the filter calculates the overlap upper bound (line 8) as defined in Equation 2. If the upper bound is below threshold (line 9), then the filter prunes the pair.
4 Implementation Details
The Bitmap Filter will be evaluated using CPU and GPU implementations, which are detailed in this section.
4.1 CPU Implementation
In order to assess the Bitmap Filter in CPU, we added it to the stateoftheart implementation of [13] using the available source code. Four algorithms were modified: AllPairs, PPJoin, GroupJoin, and AdaptJoin. These algorithms follow the FilterVerification Framework (Algorithm 2) and, considering the peculiarities of each algorithm, the Bitmap Filter was introduced as filter (line 10) or filter (line 14). If the filter is applied in the candidate generation loop (filter), the bitmap test may be applied more than once for the same candidate pair. If the filter is applied in the verification stage loop (filter), the bitmap test will be applied only once for each unique candidate pair.
For AllPairs, PPJoin, and GroupJoin we chose to insert the Bitmap Filter in the verification loop (filter). In the specific case of the GroupJoin algorithm, the filter is applied after the grouped candidate pairs are expanded, in such way that the filter is applied to individual elements.
For the AdaptJoin, we verified that the candidate list is relatively small at the verification loop, so the Bitmap Filter is applied at the candidate generation (filter). The filter is executed only during the first prefix filter iterations (i.e. in the prefix schema computation).
The bitmaps were implemented with multiple of 64bit words and the population count operation was done using the __builtin_popcountll gcc intrinsic function. Using proper compilation flags, the gcc converts this function to the POPCNT hardware instruction for population count [18].
4.2 GPU Implementation
The Bitmap Filter relies on bitwise instructions (such as xor and population count) that are efficiently implemented in Graphical Processor Units (GPUs). Compute Unified Device Architecture (CUDA)[14] is the Nvidia general purpose framework that allows parallel execution of procedures (kernels) in Nvidia GPUs. In CUDA, each kernel is executed in parallel by several threads, which are grouped into independent blocks. Threads in the same block share data using a fast shared memory. The CUDA framework also allows the data transfer between host and GPU devices, in order to supply inputs to the kernel and send the output back to the host machine.
In order to show the potential of the Bitmap Filter in the CUDA architecture, we implemented the pseudocode of Algorithm 8 in GPU, where is the selfjoined collection, is the Jaccard threshold and is the return list containing a list of candidate pairs. This code is equivalent to the Naïve Algorithm (Algorithm 1) with the addition of the Length Filter (Section 2.3.2) and Bitmap Filter (Section 3). The BitmapXor generation method was used and the cutoff point was disabled. The kernel call (line 10) executes blocks with threads each, and each thread receives a unique thread id (line 2).
The kernel may be invoked many times until all sets are processed (line 3). Each thread compares set with all the previous sets , where is the first index containing a set not filterable by the Length Filter (line 4). If the Bitmap Filter finds a possible candidate pair, this pair is included in a threadlocal candidate list (lines 67). Each threadlocal candidate list can hold up to 2048 candidates. If the threadlocal list becomes saturated, all the remaining sets will be considered candidates and will be verified by the CPU.
At the end of the kernel execution (line 8), the threadlocal candidate lists are concatenated in a global candidate list, without empty spaces. This operation, called stream compaction or reduction, reduces the amount of data transferred to the host. Finally, in the host code, all candidate pairs are verified and the similar pairs are included in the final result list (lines 1114).
The GPU algorithm may produce up to Bitmap Filter computations and its implementation was not intended for collections with more than a million elements. Nevertheless, it was implemented in order to show that even a simple algorithm can take advantage of the Bitmap Filter and become very competitive when compared to the stateoftheart algorithms. So, we claim that more sophisticated GPU implementations may benefit more from Bitmap Filter, opening the opportunity of new improvements for the Set Similarity Join problem.
5 Experimental Results
In order to create a baseline, we replicated the experiments conducted by [13]. Table 4 presents the characteristics of the chosen collections. Comparing this table with [13], slight variations can be seen in dblp collection, probably due to a different charset. Also, flickr, netflix and spot collections were no more available for download in the links provided by [13]. We generated new uniform and zipf collections with the same methodology [13]
, using a Poisson distribution for the set sizes. All collections were preprocessed and the sets in the collections were sorted by size and, in case of a tie, they were sorted lexicographically by the token ids. We observed that the lexicographical ordering speeds up all the algorithms, as stated by
[13].Collection  # of  set size  # uniq.  

name  sets  max  avg  med  tokens 
aol  10154743  245  3.01  3  3873246 
bmspos  320285  164  9.30  7  1657 
dblp  100000  717  106.28  103  3801 
enron  245615  3162  135.19  86  1113220 
kosarak  606770  2498  11.93  5  41275 
livej  3061271  300  36.44  17  7489073 
orkut  2732271  40425  119.67  29  8730857 
uniform  100000  25  9.99  10  220 
zipf  100000  86  49.99  50  101584 
Figure 8 shows the set size distribution in the collections. Orkut, enron and kosarak present a very large tail to their right, whereas dblp, uniform and zipf collections present a more symmetrical frequency distribution. Figure 9 presents the number of occurrences of the tokens in some collections, where a Zipf distribution is often observed. This kind of distribution increases the efficiency of the Prefix Filter [13].
The experiments were conducted in a dedicated machine with an Intel i73770 CPU (4 cores) at 3.40GHz with 8 GB RAM and an Nvidia GeForce GTX 980 Ti (2048 CUDA cores) with 6 GB RAM, running a CentOS Linux 7.2. The gcc compiler was used with the flags O3 and march=native. As in [13], all the presented runtimes were taken as average of 5 runs.
5.1 CPU Experiments
The four best stateoftheart algorithms reported by [13] were selected for the experiments: AllPairs (all), AdaptJoin (ada), GroupJoin (gro) and PPJoin (ppj). For the baseline experiments, we used the original source code provided by [13]. For the Bitmap Filter experiments, we used the modified source code proposed in Section 4, with the BitmapCombined generation method (Algorithm 6). The default bitmap size was , but for the collections with large median set sizes (dblp, enron and zipf), the bitmap size was increased to . The selected hash function was . In order to distinguish the modified algorithms that applied the Bitmap Filter, they will be referenced as: AllPairsBF (allbf), AdaptJoinBF (adabf), GroupJoinBF (grobf) and PPJoinBF (ppjbf).
The experiments were conducted over 9 collections (selfjoined) and 8 threshold varying from to , resulting in 72 different input combinations. Each input was assessed by the two groups of algorithms: original algorithms (baseline) and the modified algorithms (with Bitmap Filter). In total, there are 288 experiments for each group of algorithms.
5.1.1 CPU Execution times
Table 5 shows the running times of the original algorithms in columns “orig.” (baseline) and the running times of the modified algorithms in columns “+BF” (with Bitmap Filter). The times do not include the preprocessing overhead (loading files and sorting the input).
Threshold  

0.5  0.6  0.7  0.75  0.8  0.85  0.9  0.95  
orig.  +BF  orig.  +BF  orig.  +BF  orig.  +BF  orig.  +BF  orig.  +BF  orig.  +BF  orig.  +BF  
aol  ADA  437.3  *113.1  119.3  *31.6  24.15  7.74  20.27  6.50  12.22  4.80  6.40  3.80  5.84  3.63  5.81  3.60 
ALL  296.1  183.4  75.9  46.1  9.81  *6.49  6.30  *4.60  3.16  *2.40  1.54  1.28  1.34  1.16  1.33  1.15  
GRO  219.2  207.3  *54.7  51.1  *8.89  7.90  *6.16  5.40  *2.82  2.41  *1.10  *0.93  *0.86  *0.74  *0.84  *0.72  
PPJ  *212.2  166.7  57.1  44.6  10.72  8.20  8.07  6.19  4.27  3.28  2.19  1.71  1.89  1.54  1.87  1.53  
bmspos  ADA  *24.49  *8.79  *8.47  *2.96  *2.61  *0.98  1.63  *0.61  0.88  *0.34  0.42  *0.18  0.26  0.12  0.20  0.10 
ALL  36.63  15.72  10.63  5.75  3.04  1.75  1.69  1.03  0.80  0.50  0.29  0.20  0.13  0.09  0.08  0.06  
GRO  36.11  33.04  10.39  9.32  2.88  2.41  *1.62  1.32  *0.77  0.61  *0.28  0.22  *0.10  *0.08  *0.04  *0.03  
PPJ  31.99  19.86  9.97  6.70  3.06  2.15  1.81  1.29  0.93  0.66  0.38  0.28  0.18  0.13  0.12  0.09  
dblp  ADA  *161.6  *172.7  *75.3  *64.6  *31.01  *14.20  *17.47  *7.06  *8.21  *3.37  *3.23  *1.37  *0.96  *0.42  0.23  0.10 
ALL  177.1  190.7  83.4  74.4  37.45  27.00  23.17  15.34  11.64  7.35  4.60  2.73  1.24  0.71  *0.16  *0.10  
GRO  216.0  228.4  107.7  103.0  39.72  36.72  22.10  20.24  10.42  9.48  4.01  3.74  1.12  1.05  0.17  0.15  
PPJ  199.3  206.7  102.9  93.2  38.86  31.50  21.72  16.94  10.21  7.91  4.00  3.00  1.11  0.82  0.17  0.12  
enron  ADA  *30.35  *31.47  11.05  10.41  4.46  3.66  2.93  2.21  1.91  1.36  1.18  0.86  0.63  0.49  0.32  0.30 
ALL  43.51  45.27  13.02  12.19  3.69  3.05  1.95  1.56  1.06  *0.88  0.59  *0.51  *0.26  *0.24  *0.12  *0.12  
GRO  32.96  33.37  *9.90  9.70  *2.93  2.79  *1.69  1.58  *0.98  0.93  *0.58  0.55  0.27  0.25  0.13  0.13  
PPJ  33.25  32.72  10.35  *9.41  3.07  *2.72  1.77  *1.54  1.03  0.90  0.59  0.53  0.27  0.25  0.12  0.12  
kosarak  ADA  93.83  *27.53  13.59  *4.12  1.53  0.78  1.03  0.56  0.71  0.41  0.47  0.31  0.36  0.25  0.28  0.22 
ALL  58.50  35.59  8.77  5.43  1.04  *0.67  *0.59  *0.41  *0.32  *0.23  *0.15  *0.12  0.09  *0.08  0.06  0.06  
GRO  *37.54  35.81  *5.82  5.50  *1.00  0.90  0.61  0.54  0.33  0.29  0.16  0.14  *0.09  0.08  *0.06  *0.05  
PPJ  39.90  34.95  6.49  5.31  1.08  0.81  0.68  0.51  0.39  0.30  0.19  0.15  0.12  0.10  0.09  0.07  
livej  ADA  *219.4  *192.3  64.59  *56.16  21.39  17.26  14.10  10.77  9.46  6.81  6.14  4.35  3.90  3.18  2.86  2.53 
ALL  365.8  312.6  94.86  79.57  21.01  16.75  10.08  7.90  4.83  3.88  2.30  *1.92  *1.18  *1.08  *0.64  *0.62  
GRO  260.8  263.3  66.41  66.35  16.07  15.91  8.40  8.21  *4.33  4.21  *2.25  2.16  1.24  1.21  0.72  0.70  
PPJ  245.8  225.7  *64.17  57.97  *15.86  *13.91  *8.38  *7.30  4.38  *3.85  2.30  2.04  1.23  1.15  0.70  0.66  
orkut  ADA  162.1  178.9  79.96  81.73  41.89  40.92  29.80  28.72  19.71  18.88  11.91  11.46  7.16  6.87  4.27  4.14 
ALL  220.2  238.6  70.55  74.57  25.10  25.65  15.42  15.71  9.10  9.38  5.12  5.17  *2.81  *2.84  *1.35  *1.36  
GRO  155.2  160.7  56.43  57.34  *22.09  22.53  *14.00  14.02  8.71  8.62  5.18  5.17  3.03  3.11  1.52  1.51  
PPJ  *150.1  *155.4  *54.34  *55.25  22.14  *21.96  14.15  *13.91  *8.68  *8.52  *5.10  *5.11  2.93  2.89  1.42  1.40  
uniform  ADA  *56.58  *15.85  29.19  *6.80  11.03  *2.45  7.21  *1.62  3.93  0.89  1.08  0.28  0.54  0.16  0.17  0.06 
ALL  58.78  27.33  25.36  14.05  8.44  5.30  5.38  3.55  2.81  1.92  0.86  0.58  0.44  0.31  0.15  0.12  
GRO  57.30  38.75  *19.49  12.39  *5.48  3.43  *2.95  1.77  *1.29  *0.72  *0.44  *0.23  *0.22  *0.12  *0.06  *0.03  
PPJ  61.14  33.92  25.35  15.93  8.74  6.14  5.59  4.15  3.00  2.26  1.16  0.86  0.65  0.49  0.21  0.16  
zipf  ADA  *1.09  *0.74  *0.61  0.45  0.40  0.32  0.34  0.28  0.29  0.23  0.23  0.19  0.18  0.15  0.13  0.12 
ALL  1.97  0.85  0.79  *0.38  0.36  *0.20  0.25  *0.15  0.17  *0.11  *0.09  *0.07  *0.05  *0.04  *0.02  *0.02  
GRO  2.15  2.19  0.86  0.85  0.39  0.39  0.27  0.27  0.18  0.17  0.11  0.11  0.06  0.06  0.03  0.03  
PPJ  1.68  1.06  0.70  0.48  *0.33  0.26  *0.23  0.19  *0.16  0.13  0.10  0.08  0.05  0.05  0.03  0.03 
For each of the 72 different inputs in Table 5, there is an asterisk (*) in the algorithm that achieved the best runtime, regarding each group of 4 algorithms (separated for the original and modified algorithm groups). Considering only the original algorithms, it can be seen that gro achieved the best runtimes for 33 out of 72 inputs (46%), followed by ada with 15 (21%), all with 13 (18%) and ppj with 11 (15%). Considering only the modified algorithms, adabf and allbf were the best in 25 out of 72 inputs (35% each), followed by ppjbf with 12 (17%) and grobf with 10 (14%). Considering both groups together, adabf and allbf obtained the best runtime in 23 out of 72 inputs (32% each), followed by grobf with 10 (14%), ppjbf with 9 (13%), ppj with 3 (4%), all and ada with 2 (3% each). In total, 90% of the inputs presented best runtime when using algorithms with Bitmap Filter.
The runtime sum of all 72 experiments for each original algorithm was ppj:1535s, gro:1561s, all:1878s, ada:1944s. Considering the modified algorithms with Bitmap Filter, the runtime sums and the related reduction when compared to the original algorithms were: adabf:1233s (37%), ppjbf:1359s (11%), grobf:1515s (3%) and allbf:1549s (18%).
Table 6 presents the runtime improvement of the Bitmap Filter considering the formula , where is the runtime with Bitmap Filter and the original baseline runtime. The Bitmap Filter was able to improve the running times in % on average, although in some situations it improved up to %. Some experiments presented a small slowdown (negative values) in the computation, but the maximum slowdown was not greater than %.
Threshold  

0.5 
0.6 
0.7 
0.75 
0.8 
0.85 
0.9 
0.95 

aol  ADA  287%  278%  212%  212%  154%  69%  61%  62% 
ALL  61%  64%  51%  37%  32%  20%  15%  16%  
GRO  6%  7%  13%  14%  17%  18%  18%  18%  
PPJ  27%  28%  31%  30%  30%  28%  22%  23%  
bmspos  ADA  179%  186%  166%  166%  155%  130%  108%  94% 
ALL  133%  85%  73%  65%  59%  50%  36%  27%  
GRO  9%  11%  19%  22%  25%  26%  30%  39%  
PPJ  61%  49%  43%  40%  40%  37%  35%  30%  
dblp  ADA  6%  16%  118%  148%  144%  137%  125%  121% 
ALL  7%  12%  39%  51%  58%  69%  75%  63%  
GRO  5%  5%  8%  9%  10%  7%  8%  11%  
PPJ  4%  10%  23%  28%  29%  34%  36%  39%  
enron  ADA  4%  6%  22%  33%  41%  38%  29%  7% 
ALL  4%  7%  21%  25%  21%  16%  8%  1%  
GRO  1%  2%  5%  6%  5%  7%  7%  2%  
PPJ  2%  10%  13%  15%  14%  12%  6%  6%  
kosarak  ADA  241%  230%  98%  84%  73%  54%  42%  30% 
ALL  64%  62%  54%  44%  38%  26%  19%  11%  
GRO  5%  6%  11%  12%  11%  7%  12%  12%  
PPJ  14%  22%  34%  32%  29%  24%  19%  19%  
livej  ADA  14%  15%  24%  31%  39%  41%  23%  13% 
ALL  17%  19%  26%  28%  24%  20%  9%  3%  
GRO  1%  0%  1%  2%  3%  4%  2%  3%  
PPJ  9%  11%  14%  15%  14%  13%  8%  7%  
orkut  ADA  9%  2%  2%  4%  4%  4%  4%  3% 
ALL  8%  5%  2%  2%  3%  1%  1%  1%  
GRO  3%  2%  2%  0%  1%  0%  2%  1%  
PPJ  4%  2%  1%  2%  2%  0%  2%  2%  
uniform  ADA  257%  329%  350%  344%  342%  282%  231%  170% 
ALL  115%  81%  59%  52%  47%  47%  41%  28%  
GRO  48%  57%  60%  67%  78%  92%  81%  80%  
PPJ  80%  59%  42%  35%  33%  35%  34%  30%  
zipf  ADA  48%  35%  24%  25%  24%  24%  20%  11% 
ALL  130%  108%  78%  69%  54%  30%  19%  0%  
GRO  2%  1%  2%  0%  6%  2%  4%  0%  
PPJ  59%  45%  30%  21%  19%  16%  8%  0% 
Table 7 shows the runtime improvement of the algorithms with respect to each threshold, considering the runtime average of all collections. All the algorithms presented improvements: adabf was the one with highest gain, varying from 57% up to 121%; allbf presented improvements ranging from 17% to 56%; ppjbf showed improvements from 17% to 27%; grobf was the one with lowest gains, varying from 6% up to 18%.
Threshold  

0.5 
0.6 
0.7 
0.75 
0.8 
0.85 
0.9 
0.95 

Average  ADA  112%  121%  113%  116%  109%  86%  72%  57% 
ALL  56%  48%  44%  41%  37%  31%  25%  17%  
GRO  6%  10%  13%  15%  17%  18%  18%  18%  
PPJ  27%  26%  26%  24%  23%  22%  19%  17% 
Table 8 presents the average improvement for each collection and threshold, considering the average runtimes between the algorithms. A very good improvement is noted for collections uniform, bmspos, aol, dblp and kosarak, followed by intermediate gains in livej and zip. The smallest improvements were observed in collections enron and orkut. Small slowdowns of up to 6% were detected in collections dblp, enron and orkut at low thresholds (0.5 and 0.6). The collection gain is related to the set size distribution (Figure 8) and the number of distinct tokens (Table 4), such that collections with many small sets and few unique tokens (e.g. uniform and bmspos) present higher improvements then collections with many large sets and many unique tokens (e.g. enron and orkut).
Threshold  

0.5 
0.6 
0.7 
0.75 
0.8 
0.85 
0.9 
0.95 

aol  95%  94%  77%  73%  58%  34%  29%  29% 
bmspos  96%  83%  75%  73%  70%  61%  53%  48% 
dblp  6%  11%  47%  59%  60%  61%  61%  58% 
enron  2%  6%  15%  20%  20%  18%  13%  4% 
kosarak  81%  80%  49%  43%  38%  28%  23%  18% 
livej  10%  11%  16%  19%  20%  19%  10%  6% 
orkut  6%  3%  0%  1%  1%  1%  1%  1% 
uniform  125%  132%  128%  124%  125%  114%  97%  77% 
zipf  59%  47%  33%  29%  26%  18%  13%  3% 
5.1.2 Filtering Ratio
In order to assess the efficiency of the Bitmap Filter, we collected the number of candidate pairs that were filtered out by the Bitmap Filter. Table 9 presents the filtering ratio, defined by the number of filtered pairs divided by the total number of candidates. The filtered ratio is strongly related to the runtime improvement (Table 6).
Threshold  

0.5 
0.6 
0.7 
0.75 
0.8 
0.85 
0.9 
0.95 

aol  97%  98%  99%  99%  99%  99%  99%  99% 
bmspos  98%  99%  99%  99%  99%  99%  99%  99% 
dblp  11%  54%  97%  99%  99%  99%  99%  99% 
enron  14%  29%  56%  71%  83%  86%  89%  85% 
kosarak  86%  90%  93%  95%  96%  98%  99%  99% 
livej  45%  52%  64%  74%  86%  99%  99%  99% 
orkut  4%  5%  8%  10%  13%  18%  27%  54% 
uniform  99%  99%  99%  99%  99%  99%  99%  99% 
zipf  99%  99%  100%  100%  100%  100%  100%  100% 
Figure 10 presents the filtering ratio for the three bitmap creation algorithms, using bitmap sizes with bits and without the cutoff point (Section 3.5). In the figure, the BitmapXor was consistently the best option for the three analyzed collections, followed by BitmapSet and BitmapNext. The only situation where the BitmapSet was the best option was for the enron collection with Jaccard threshold 0.5. As stated in Section 3.5, BitmapSet is slightly better around . In the dblp collection, the BitmapXor presented filtering ratios up to 6.37 better than BitmapSet, due to the large set sizes presented in dblp. This high filtering ratio difference reduced the runtime by up to .
5.1.3 Filtering Precision
The Bitmap Filter can be considered as a binary classification rule that classifies a candidate pair as
unfiltered (positive class) or filtered (negative class). Since the Bitmap Filter is an exact method, there is no wrongly filtered pair (false negative), so the Bitmap Filter has a recall (true positives divided by true positives plus false negatives). Nevertheless, there may be some dissimilar pairs that are not filtered (false positive), leading to an extra verification cost. We define the filtering precision as the ratio between similar pairs (true positives) and unfiltered pairs (false positives plus true positives). The number of false positives tends to increase when the bitmaps are too small or when the number or unique tokens are very large.The cutoff point defined in Section 3.5 disables the Bitmap Filter whenever the filtering precision falls drastically. In order to show this precision drop off, the filtering precision was measured for the selfjoins of dblp and enron collections, without using the cutoff point. The bitmap size used for this experiment was . Figure 11 shows the average filtering precision for the different set sizes found in the collection, considering Jaccard threshold varying from 0.50 to 0.80. Although each collection has its own characteristics, we can see the precision drop off in almost the same position in the plots. This clearly represents the cutoff point, such that beyond this point the filter becomes inefficient. So, higher runtime improvements tends to occur in collections whose majority of sets contains less tokens than the observed precision drop off point.
5.2 GPU Experiments
The GPU algorithm was executed over the collections of up to a 606,770 sets: bmspos, dblp, enron, kosarak, uniform, zipf. The algorithm was tested with bitmap sizes ranging from to and threshold from to . The number of CUDA blocks was fixed in 512, each one with 256 threads. Table 10 presents the running times of the GPU algorithm considering the bitmap of size which presented the best execution time. The running times does not include the file loading time and device initialization, but includes the GPU memory allocation, data transfers from/to the GPU, the GPU kernel execution and the CPU procedures related to the join algorithm. Table 10 also gives the speedup up of the GPU implementation compared to the best runtime obtained in the experiment with the baseline CPU algorithms (Table 5).
Threshold  
0.5 
0.6 
0.7 
0.75 

bmspos  GPU time  0.77  0.41  0.31  0.28 
CPU time  (ada)24.49  (ada)8.47  (ada)2.61  (gro)1.62  
speedup  31.7  20.6  8.3  5.7  
bitset  192bits  128bits  128bits  128bits  
dblp  GPU time  0.28  0.18  0.12  0.09 
CPU time  (ada)161.59  (ada)75.25  (ada)31.01  (ada)17.47  
speedup  577.9  414.8  250.5  190.3  
bitset  512bits  320bits  192bits  192bits  
enron  GPU time  3.27  1.92  1.26  1.00 
CPU time  (ada)30.35  (gro)9.90  (gro)2.93  (gro)1.69  
speedup  9.3  5.2  2.3  1.7  
bitset  2560bits  2560bits  2048bits  1280bits  
kosarak  GPU time  7.65  2.64  1.21  0.95 
CPU time  (gro)37.54  (gro)5.82  (gro)1.00  (all)0.59  
speedup  4.9  2.2  0.8  0.6  
bitset  512bits  256bits  192bits  128bits  
uniform  GPU time  1.96  0.19  0.07  0.06 
CPU time  (ada)56.58  (gro)19.49  (gro)5.48  (gro)2.95  
speedup  28.9  100.5  79.9  47.8  
bitset  192bits  128bits  128bits  128bits  
zipf  GPU time  0.14  0.11  0.09  0.09 
CPU time  (ada)1.09  (ada)0.61 
Comments
There are no comments yet.