Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations

11/20/2017 ∙ by Edans F. O. Sandes, et al. ∙ University of Brasilia 0

The Exact Set Similarity Join problem aims to find all similar sets between two collections of sets, with respect to a threshold and a similarity function such as overlap, Jaccard, dice or cosine. The naive approach verifies all pairs of sets and it is often considered impractical due the high number of combinations. So, Exact Set Similarity Join algorithms are usually based on the Filter-Verification Framework, that applies a series of filters to reduce the number of verified pairs. This paper presents a new filtering technique called Bitmap Filter, which is able to accelerate state-of-the-art algorithms for the exact Set Similarity Join problem. The Bitmap Filter uses hash functions to create bitmaps of fixed b bits, representing characteristics of the sets. Then, it applies bitwise operations (such as xor and population count) on the bitmaps in order to infer a similarity upper bound for each pair of sets. If the upper bound is below a given similarity threshold, the pair of sets is pruned. The Bitmap Filter benefits from the fact that bitwise operations are efficiently implemented by many modern general-purpose processors and it was easily applied to four state-of-the-art algorithms implemented in CPU: AllPairs, PPJoin, AdaptJoin and GroupJoin. Furthermore, we propose a Graphic Processor Unit (GPU) algorithm based on the naive approach but using the Bitmap Filter to speedup the computation. The experiments considered 9 collections containing from 100 thousands up to 10 million sets and the joins were made using Jaccard thresholds from 0.50 to 0.95. The Bitmap Filter was able to improve 90 experiments in CPU, with speedups of up to 4.50x and 1.43x on average. Using the GPU algorithm, the experiments were able to speedup the original CPU algorithms by up to 577x using an Nvidia Geforce GTX 980 Ti.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data analysts often need to identify similar records stored in multiple data collections. This operation is very common for data cleaning [7], element clustering [5] and record linkage [8, 22]. In some scenarios the similarity analysis aims to detect near duplicate records [25], in which slightly different data representations may be caused by input errors, misspelling or use of synonyms [1]. In other scenarios, the goal is to find patterns or common behaviors between real entities, such as purchase patterns [13] and similar user interests [20].

Set Similarity Join is the operation that identifies all similar sets between two collections of sets (or inside the same collection) [13, 12]. In this case, each record is represented as a set of elements (tokens). For example, the text record “Data Warehouse” may be represented as a set of two words {Data, Warehouse} or a set of bigrams {Da, at, ta, a␣, ␣W, Wa, ar, re, eh, ho, ou, us, se}. Two records are considered similar according to a similarity function, such as Overlap, Dice, Cosine or Jaccard [2, 21]. For instance, two records may be considered similar if the overlap (intersection) of their bigrams is greater or equal than 6, such as in “Databases” and “Databazes”.

Many different solutions have been proposed in the literature for Set Similarity Join [10, 7, 3, 25, 23, 4, 9, 27, 17, 6, 26, 19, 28]. With respect to the guarantee that all similar pairs will be found, these solutions are divided into Exact approaches [10, 7, 3, 25, 23, 4, 28] and approximate aproaches [9, 27, 17, 6, 26, 19]. The approximate algorithms usually rely on the Local Sensitive Hashing technique (LSH) [11] and they are very competitive, but in the scope of this paper only the exact approaches will be considered.

The solutions for Exact Set Similarity Join are usually based on the Filter-Verification Framework [13, 24], which is divided into two stages: a) the candidate generation uses a series of filters to produce a reduced list of candidate pairs; b) the verification stage verifies each candidate pair in order to check which ones are similar, with respect to the selected similarity function and threshold.

The filtering stage is commonly based on Prefix Filter [7, 3] and Length Filter[10], which are able to prune a considerable number of candidate pairs without compromising the exact result. Prefix Filter[7, 3] is based on the idea that two similar sets must share at least one element in their initial elements, given a defined element ordering. Length Filter [10] prunes candidate lists according to their lengths, considering that elements with a big difference in length will have low similarity. Other filters have also been proposed, such as the Positional Filter [25], which is able to prune candidate pairs based on the position where the token occurs in the set, and the Suffix Filter [25], that prunes pairs by applying a binary search on the last elements of the sets.

There are many algorithms in the literature combining the aforementioned filters. AllPairs[3] uses the Prefix and Length filters, PPJoin [25] extends AllPairs with the Positional Filter, PPJoin+ [25] extends PPJoin with the Suffix Filter. AdaptJoin [23] extends PPJoin using a variable schema with adaptive prefix lengths and GroupJoin [4] applies the filters in groups of similar sets, avoiding redundant computations. These algorithms usually rely on the assumption that the verification stage contributes significantly to the overall running time. However, in [13] a new verification procedure using an early termination condition was proposed that significantly reduced the execution time of the verification phase. With this, the proportion of execution time in the candidate generation phase increased and, as a consequence, the simplest filtering schemes are now able to produce the best performance. In [13], the best results were obtained by AllPairs, PPJoin, GroupJoin and AdaptJoin, whereas AllPairs was the best algorithm on the largest number of experiments [13]. On the contrary, PPJoin+[25], which uses a sophisticated suffix filter technique, was the slowest algorithm in [13].

The state-of-the-art set similarity join algorithms still need to be improved in order to allow a faster join, otherwise very large collections may not be processed in a reasonable time. Based on the recent findings on the filter-verification trade-off overhead [13], the authors claimed that future filtering techniques should invest in fast and simple candidate generation methods instead of sophisticated filter-based algorithms, otherwise the filters effectiveness will not pay off their overhead when compared to the verification procedure.

The main contribution of this paper is a new low overhead filter called Bitmap Filter, which is able to efficiently prune candidate pairs and improve the performance of many state-of-the-art exact set similarity join algorithms such as AllPairs[3], PPJoin[25], AdaptJoin[23] and GroupJoin [4]. Using bitwise operations implemented by most modern processors, the Bitmap Filter is able to speedup the performance of AllPairs, PPJoin, AdaptJoin and GroupJoin algorithms by up to 4.50 (1.43 on average) considering 9 different collections and Jaccard threshold varying from to . As far as we know, there is no other filter for the Exact Set Similarity Join problem that is based on efficient bitwise operations.

The Bitmap Filter is sufficiently flexible to be used in the candidate generation or verification stages. Furthermore, it can be efficiently implemented in devices such as Graphical Processor Units (GPU). As a secondary contribution of this paper, we thus propose a GPU implementation of the Bitmap Filter that is able to speedup the join by up to using a Nvidia Geforce GTX 980 Ti card, considering 6 collections of up to 606,770 sets.

The rest of this paper is structured as follows. Section 2 presents the background of the Set Similarity Join problem. Then, Section 3 proposes the Bitmap Filter and explains it in detail. Section 4 shows the design of the proposed CPU and GPU implementations. Section 5 presents the experimental results and Section 6 concludes the paper.

2 Background

In this section, we discuss the Set Similarity Join problem. First, the problem is formalized (Section 2.1). Then, the Filter-Verification Framework is presented (Section 2.2). After that, some commonly used filters employed in the Filter-Verification Framework are explained (Section 2.3). Finally, state-of-the-art algorithms are described (Section 2.4).

2.1 Problem Definition

Given two collections and , where sets and are made of tokens from ( and ), the Set Similarity Join problem aims to find all pairs from and that are considered similar with respect to a given similarity function , i.e. , where is a user defined threshold [13, 25]. When R and S are the same collection, the problem is defined as a self-join, otherwise it is called -join. The set similarity join operation can be written as presented in Equation 1 [13].

(1)

The most used similarity functions for comparing sets are the Overlap, Jaccard, Cosine and Dice [2, 21], as presented in Table 1. The Overlap function returns the number of elements in the intersection of sets and (we suppress the subscripts and when the context is free of ambiguity). Jaccard, Cosine and Dice are similarity functions that normalize the overlap by dividing the intersection by the union (Jaccard), or by a combination of the lengths and (Cosine and Dice). The similarity functions can be converted to each other using the threshold equivalence presented in Table 1 [12, 16], where , , and are the overlap, Jaccard, cosine and dice thresholds, respectively.

Sim. Function Equiv. Overlap
Overlap
Jaccard
Cosine
Dice
Table 1: Similarity functions and their equivalence to overlap[12, 16]

To solve the set similarity join problem, the naïve algorithm (Algorithm 1) compares all pairs from collections and and verifies if the similarity of each pair (with respect to the similarity function) is greater or equal than the desired threshold. Since this method verifies elements, it does not scale to a large number of sets.

1:Input: set collections and ; threshold
2:Output: of similar sets
3:for  do
4:     for  do
5:         if  then
6:                             
7:return
Algorithm 1 Naïve Algorithm

2.2 Filter-Verification Framework

Many algorithms, such as AllPairs[3], PPJoin[25], PPJoin+[25], AdaptJoin[23], and GroupJoin [4], have been proposed to reduce the number of verification operations, aiming to solve the set similarity join problem efficiently. These algorithms typically employ filtering techniques using the Filter-Verification Framework, as shown in Algorithm 2.

1:Input: set collections and ; threshold
2:Output: of similar sets
3:/* Initialization */
4:
5:for  do
6:     /* Candidate Generation Stage */
7:     
8:     for  do
9:         for  do
10:              if  then
11:                                               
12:     /* Verification Stage */
13:     for  do
14:         if  then
15:              if  then
16:                                               
17:     return
Algorithm 2 Filter-Verification Framework

In the initialization step, the collection is indexed (line 4) with respect to the tokens found in each set , such that stores all sets containing token . Then, for each set , the algorithm iterates in a two phase loop: candidate generation and verification stage.

In the candidate generation loop (lines 4-7), the algorithm iterates over tokens from set (filtered by function filter in line 8) in order to find in index all sets that also contain token . Each set may also be filtered (function filter in line 10) and, if the set is not pruned, it is inserted into a candidate list.

The verification stage (lines 8-11) checks, for each unique candidate , if the pair is similar with respect to the similarity function and threshold (line 15) and, if so, the pair is inserted in the similar pair list (line 16). A filter may also be applied in the verification loop (function filter in line 14) to reduce the number of verified pairs. Finally, the similar pairs are returned (line 17).

2.3 Filtering Strategies

This section explains the most widely used filtering techniques in the literature.

(a) Prefix Filter Technique
(b) Positional Filter Technique
(c) Suffix Filter Technique
(d) -Prefix Filter ()
Figure 1: Filtering strategies

2.3.1 Prefix Filter

The Prefix Filter technique [7, 3] creates an index over the first tokens in each set, considering that the tokens are sorted, usually based on the token frequencies. The sizes of the selected prefixes in each set are such that if two sets and are similar (i.e. ), then their prefixes will have at least one token in common. For instance, Figure 0(a) shows two sets and with sizes and . Considering that the overlap similarity must be greater or equal than , then the prefix sizes are respectively prefix and prefix. In this scenario, if there is no overlap in the prefixes (white cells), then it will be impossible to have at least overlaps considering the remaining cells (in gray). In fact, considering the overlap similarity function, the prefix size for a set with size is prefix [3]. The prefix sizes for other similarity functions are presented in Table 2 [12].

After the prefix is created, the algorithm produces a list of candidate pairs for each set . This list is composed of all the sets that are stored in the index pointed by the elements in prefix. The candidate pairs are then verified using the similarity function and, if the threshold is attained, this pair is marked as similar.

2.3.2 Length Filter

Length Filter [10] benefits from the idea that two sets and with very different lengths tend to be dissimilar. Given the size of set and the threshold , Table 2 presents the upper and lower bounds for the size such that, outside of these bounds, the similarity is surely lower than the threshold. Table 2 presents these bounds considering the overlap, Jaccard, cosine and dice similarity functions [13].

Function Lower/Upper Bounds Prefix Length
Overlap   
Jaccard   
Cosine   
Dice   
Table 2: Filter Parameters per Similarity Function

Using this filter, the algorithms can ignore a set whose size is outside the defined bounds. Considering that the sets in the list are ordered by their lengths, whenever a set from an index list contains more elements than the defined upper bound, all the remaining elements of that index can be ignored.

2.3.3 Positional Filter

Positional Filter [25] tracks the position where a given token appears in the sets. Using this position, it is possible to determine a tighter upper bound for the similarity metric, and this tends to reduce the number of candidate pairs. For example, Figure 0(b) presents two sets and with sizes and . A Jaccard threshold will produce prefixes with size in both sets. In this figure, the elements in prefix are presented with their position in the set (, and ). The first matching token “D” occurs in position 3 in set and, with this information and the sizes and , it can be inferred that the maximum possible overlap is . Using the same idea, the minimum possible union is and, thus, the maximum Jaccard coefficient is . Since the upper Jaccard coefficient is below the threshold , this candidate can be pruned. The same idea can be applied to cosine and dice similarity functions.

2.3.4 Suffix Filter

The Suffix Filter [25] takes one probe token from the suffix of set and applies a binary search algorithm over set in order to divide these sets in two parts (left/right). Since the tokens in each set are sorted, tokens in the left side of one set cannot overlap with tokens in the right side of the other set. Using this idea, it can be inferred a tighter upper bound for the overlap between two sets. For example, in Figure 0(c) there are two sets and with sizes , where the required overlap threshold between the sets is . In the figure, prefixes are represented by white cells and the suffixes by gray cells. Comparing the tokens in the prefixes, it is possible to find one match (token “C”), but 5 additional matches still need to be found. Then, the Suffix Filter chooses the middle token (“X”) from the suffix of set and it seeks for the same token in set . In the figure, the token “X” divides the set into two subsets with two elements each, and the set is divided into two subsets, with one element to the left and three elements to the right. Considering that the elements are sorted, the maximum number of additional overlaps in each side is one element in the left of “X” and two elements in the right. This would lead to an upper bound of 5, which is less than the required overlap (). Thus, this pair can be pruned.

2.3.5 Variable-Length Prefix Filter

The original Prefix Filter relies on the idea that the prefixes of sets and must share at least one token in order to satisfy the threshold. The Variable-Length Prefix Filter [23] extends this idea using adaptive prefix lengths. For example, using a -length prefix schema, it is required at least common tokens in the prefixes. The prefix size for a set with size in a -prefix schema is prefix. Figure 0(d) shows an example with sets and with sizes and . Considering a required overlap , the -prefix schema chooses prefix and prefix. Since the prefixes (white cells) present less than two overlaps, it is impossible to satisfy the threshold , so the pair is pruned.

2.4 Algorithms

This section presents an overview of state-of-the art algorithms that use the Filter-Verification Framework (Algorithm 2). Table 3 shows five algorithms, related acronyms and filters used by the algorithm. In this paper, we will focus on the best four algorithms as stated by [13]: AllPairs, PPJoin, AdaptJoin, and GroupJoin.

Algorithm
all AllPairs[3] Prefix Length
ppj PPJoin[25] Prefix Length/Pos.
ppj+ PPJoin+[25] Prefix Length/Pos./Suffix
gro GroupJoin[4] Prefix Length/Pos.
ada AdaptJoin[23] -Prefix Length
Table 3: Algorithms and the selected filters.

AllPairs[3] is an algorithm that uses the Prefix Filter (Section 2.3.1) and Length Filter (Section 2.3.2). In the Filter-Verification Framework, the Prefix Filter is applied as filter and the Length Filter is applied as filter.

PPJoin[25] extends the AllPairs algorithm, including the Positional Filter (Section 2.3.3). In the Filter-Verification Framework, the Prefix Filter is applied as filter and the Length and Positional filters are applied as filter.

PPJoin+[25] extends the PPJoin algorithm with the Suffix Filter (Section 2.3.4) over the candidate pairs. In the Filter-Verification Framework, the Prefix Filter is applied as filter and the Length, Positional and Suffix filters are applied as filter.

AdaptJoin[23] is an algorithm that uses the Length Filter (Section 2.3.2) and the Variable-Length Prefix Filter (Section 2.3.5). The Filter-Verification Framework is modified such that additional iterations are executed with different prefix sizes, according to the characteristics of the collection. The Variable-Length Prefix Filter is applied as filter and the Length Filter is applied as filter.

GroupJoin[4] applies the Prefix Filter (Section 2.3.1), Length Filter (Section 2.3.2), and Positional Filter (Section 2.3.3) over groups of sets containing the same prefixes and sizes. In the verification stage of the Filter-Verification Framework, the grouped candidate pairs are expanded to individual elements and all possible pairs are verified. With this grouping schema, the filters can be applied only to one representative of the group, what considerably reduces the number of comparisons for groups with many elements. Figure 2 presents two groups of sets containing 3 elements sharing the same prefix.

Figure 2: Grouped elements in GroupJoin

3 Bitmap Filter

In this section we propose the Bitmap Filter, which uses a new bitwise technique capable of speeding up similarity set joins without compromising the exact result. The bitwise operations are widely used in modern microprocessors and most of the compilers are capable to map them to fast hardware instructions.

First, some preliminaries are given in Section 3.1. Then, Section 3.2 presents how the bitmap is generated. Section 3.3 shows how the overlap upper bound can be calculated using bitmaps. Then, in Section 3.5, a cutoff point is defined in order to determine when the Bitmap Filter is effective. Finally, the pseudocode of the Bitmap Filter is given in Section 3.6.

3.1 Preliminaries

The similarity functions between two sets and can be calculated over their binary representation. For instance, sets and can be mapped to binary sets represented by and , where at position represents the occurrence of token in the corresponding set. In this representation, the size of the binary sets must be sufficiently large to map all possible tokens in the collection. Considering that each bit in the binary set is a dimension, collections with large number of distinct tokens will produce a high dimensionality binary sets that are usually very sparse.

Modern microprocessors implement many bitwise instructions that can be used to speedup computations of similarity functions over binary sets. One of those instructions is the population count (popcount), which is used to count the number of “ones” in the binary representation of a numeric word [18]. For example, the overlap of the sets and mentioned in the previous paragraph can be calculated by , using only two instructions in general microprocessors (one AND instruction and one POPCNT instruction). The same idea can be extended to Jaccard coefficient: .

High dimensionality binary sets may be implemented by the concatenation of two or more fixed-length words, but as the dimensionality increases, more instructions are necessary to calculate the similarities, decreasing the performance of the computation. The Bitmap Filter proposed in this section reduces the dimensionality of the binary sets without compromising the exactness of the Set Similarity Join results. Using a binary set with reduced size, called Bitmap from now on, the bitwise instructions may be much more effective.

3.2 Bitmap Generation

Consider a set composed by tokens . Let be a bitmap with bits, representing set . The bit at position is represented by . Let be a hash function that maps each token to a specific bit position .

To create the bitmap from the set , we define a generation function that maps a subset of tokens into the bitmap . Three different generation functions are proposed in this paper: Bitmap-Set, Bitmap-Xor and Bitmap-Next. The difference between them is in how they handle collisions produced by the hash function .

Bitmap-Set: For each element , the bit at position will be set. If multiple elements map to the same position, only the first element will effectively change the bit status (the other elements will leave this bit set to ‘one’). Algorithm 3 presents the pseudocode for this function.

1:function Bitmap-Set()
2:     
3:     for  do
4:               
5:     return
Algorithm 3 Bitmap-Set

Bitmap-Xor: For each element , the bit at position will have its value changed ( or ). In the end, the bit

will remain set only if there is an odd number of elements mapped to position

. Algorithm 4 presents the pseudocode for this function, where is the xor operation.

1:function Bitmap-Xor()
2:     
3:     for  do
4:               
5:     return
Algorithm 4 Bitmap-Xor

Bitmap-Next: For each element , if the bit at position is unset, set it to 1, otherwise set the next unset bit after position , circulating to the beginning of the bitmap if it reaches the last bit. This method ensures that there will be exactly 1’s in the bitmap, unless . In this case, all bits will be set. Algorithm 5 presents the pseudocode for this method.

1:function Bitmap-Next()
2:     if  then
3:         
4:     else
5:         
6:         for  do
7:              
8:              while  do
9:                                 
10:                             
11:     return
Algorithm 5 Bitmap-Next

Figure 3 shows an example of bitmap generation considering the bitmap size and a set with 10 tokens. The hash function is represented by the arrows pointing to each bitmap slot. The three bitmaps were produced from Bitmap-Set, Bitmap-Xor and Bitmap-Next, respectively from top to bottom. Compared to Bitmap-Set, it can be seen that the bitmap generated by Bitmap-Xor has 1 different bit and the Bitmap-Next has 3 different bits. Each generation method will fit better in different circumstances that will be explored in Section 3.3.

Figure 3: Bitmap example

3.3 Overlap Upper-bound using Bitmaps

Given two sets and and their respective bitmaps and produced by one of the three functions (Section 3.2), this section shows how to infer an upper-bound for the overlap similarity function (). Whenever the upper-bound is not sufficient to attain the similarity threshold given as input, then the pair of sets and may be pruned.

Intuitively, for each occurrence of a different bit in the same position in bitmaps and (i.e. ), we can infer that there is at least one element from or that does not occur in the intersection of the sets. Given that, it can be found that the maximum overlap between these sets is the sum of the set’s sizes minus the number of different bits (xor) in their bitmaps (also known as hamming distance).

In order to prove that the overlap upper bound holds for the Bitmap-Set, Bitmap-Xor and Bitmap-Next, it must be noted that the bitmap generation functions may be represented as a series of algebraic operations. Let be the bitmap constructed by a single bit set at position . For instance, if and , then . Functions Bitmap-Set, Bitmap-Xor and Bitmap-Next are related to a series of binary operations over the bitmaps, as in . For simplicity, this series will also be represented by the prefix notation . The operation is associated to the logic of each loop iteration in Algorithms 3, 4 and 5. For instance, in Bitmap-Xor, is the xor operation and, in Bitmap-Set, is the or operation. The Bitmap-Next operation is constructed with the same logic intrinsic to the lines 7 to 10 in Algorithm 5, where the collisions are chained to the next unset bit until the bitmap is saturated. It must be noted that, in the three methods, is commutative () and associative (), so if we modify the order of the operands , the resultant bitmap will be the same (e.g. ). Given this observation, we have Theorem 1.

Theorem 1

Let two sets and and their bitmaps and produced by Bitmap-Set, Bitmap-Xor or Bitmap-Next. The overlap is restricted to the upper bound defined by Equation 2

(2)

Let the common elements of sets and be and the non-common elements be . Since the bitmap generation functions can be represented by an associative and commutative operation , the order of their operands does not affect the final bitmap. So, without loss of generality, the elements of sets and are rearranged such that the first elements are the common elements, i.e., . Since the first elements of sets and are the same, their partial bitmaps and will also be the same (i.e. there are no different bits in and ).

Then, each remaining operand (non-common elements in ) is able to change at most one bit in the partial bitmaps and . So, if there is one different bit from the final bitmaps and , surely it was caused by a non-common element and, since each element can change at most one bit, then the number of different bits is lower or equal than the number of non-common elements . The number of different bits from and can be counted by (hamming distance). So, the lower bound for can be defined by Equation 3.

(3)

Considering that , Equation 3 can be directly transformed into the upper bound defined by Equation 2.

Figure 4 shows an example of two bitmaps and generated by the Bitmap-Set method. Then, the hamming distance is calculated using bitwise operations (). In Figure 4, the operation produces a word with 4 ones and, considering Equation 2, the overlap upper bound is . Using the Bitmap-Xor and Bitmap-Next methods, the overlap upper bound would be .

Figure 4: Overlap Upper-bound

3.4 Expected Overlap Upper-bound

Due to collisions in the hash functions

, two sets may produce similar bitmaps even if they do not share similar tokens. As the probability of collision is increased, the filtering effectiveness is reduced because it will loose the capability of distinguishing between similar and dissimilar sets. Intuitively, the chance for a collision is increased whenever the number of tokens in the original sets

and is increased or the size of the bitmaps is reduced. This produces a trade-off between the filtering efficiency (precision) and the reduction of the bitmap dimension.

Assuming that the hash function distributes the tokens uniformly in each bits of the bitmaps, a probability analysis is able to determine, for each bitmap generation method, the expected number of collisions given two sets with random tokens. Given the expected number of collisions, it is also possible to infer the expected overlap upper-bound using Equation 2. Equations 4, 5, and 6 present the expected overlap upper bound for a given bitmap size and

tokens, considering the Bitmap-Set, Bitmap-Xor and Bitmap-Next respectively. These equations are based on the estimation of collision probabilities in hash functions

[15]. Furthermore, the probabilities were verified by a Monte-Carlo simulation, where 100,000 random pairs of sets and were generated with tokens and the overlap upper-bound was obtained for each pair. Running the simulation for each and for , the average results produced by the simulation were almost identical to those derived from Equations 4, 5, and 6, with an average error below 0.012%.

(4)
(5)
(6)

Figure 5 plots the upper bounds defined by Equations 4, 5, and 6 considering and varying from 1 to 256. The left -axis presents a scale with the normalized overlap, given by the overlap divided by , and the right -axis presents a scale with the equivalent Jaccard upper bound (Table 1). As the -value increases, more difficult will be to distinguish between similar and non-similar pairs (the lower the value, better is the method). A -value of represents an upper bound equal to the number of tokens (useless filter), and a -value of represents a zero upper-bound. For example, considering the Bitmap-Set and the Bitmap-Xor, the expected normalized overlap upper bound at is around (or considering the equivalent Jaccard metric). So, a Jaccard threshold of will be expected to filter 50% of dissimilar pairs of sets composed of random tokens.

Figure 5: Expected upper bounds ()

3.5 Cutoff Point for Bitmap Filter

Given a bitmap size and the overlap threshold , the cutoff point is defined as the maximum number of tokens at which the Bitmap Filter can efficiently distinguish between similar and dissimilar sets. The cutoff point can be defined in terms of the function (Equations 4, 5 and 6) such that implies . In Figure 5, the average normalized overlap upper bound at and is around . So, using a threshold , the Bitmap Filter precision will be effective with sets with up to tokens and, with more tokens, the precision will drop-off significantly, since the filter will not be able to discard the majority of dissimilar pairs. So, the cutoff point for will be .

Figure 6: Cutoff values (higher is better)

Figure 6 presents the cutoff point for Bitmap-Set, Bitmap-Xor and Bitmap-Next, with sizes . A similar plot pattern is observed for any greater than 64. Analyzing the cutoff difference between Bitmap-Xor and Bitmap-Set at high thresholds (e.g. above ), it is noticeable that Bitmap-Xor presents higher cutoffs. For instance, with and Jaccard threshold , Bitmap-Xor attained a cutoff at , whereas the cuttoff point of Bitmap-Set is . In other words, the Bitmap-Xor will still be effective with more tokens than Bitmap-Set. For Jaccard threshold , this proportion is .

In Figures 5 and 6, it can be seen that there is a preferable bitmap generation method for each threshold range: Bitmap-Next presents the higher cutoff when ; Bitmap-Set is slightly better between ; Bitmap-Xor is preferred when . This pattern is observed for any value of greater than 64. So, a combined bitmap generation can be created using the preferred bitmap generation method for each interval (Algorithm 6). Figure 7 presents the cutoff point for the combined bitmap generation method, where the -axis scale is relative to bitmap size .

1:function Bitmap-Combined()
2:     if  then
3:         return Bitmap-Next(s)
4:     else if  then
5:         return Bitmap-Xor(s)
6:     else
7:         return Bitmap-Set(s)      
Algorithm 6 Bitmap-Combined
Figure 7: Best cutoffs

3.6 Bitmap Filter Pseudocode

The pseudocode for the Bitmap Filter is presented in Algorithm 7. This pseudocode is suitable for the Filter-Verification Framework (Algorithm 2) and it is divided in two sections: the initialization procedure (lines 1-4) is run only once for the entire execution; the Bitmap_Filter function (lines 5-10) can be executed at filter or filter in Algorithm 2.

The initialization precomputes the bitmap for every set in the collections and . As stated in Algorithm 6, the Bitmap generation method is selected accordingly to threshold . The initialization also calculates the cutoff point (Section 3.5) for the selected bitmap size and threshold . The cutoff calculation is precomputed using the combined method presented in Figure 7.

Therefore, for a given candidate pair and , the function Bitmap_Filter executes the proposed filter. If the length is above the cutoff point (line 7), the Bitmap Filter will be ignored. Otherwise, the filter calculates the overlap upper bound (line 8) as defined in Equation 2. If the upper bound is below threshold (line 9), then the filter prunes the pair.

1:procedure initialization(,,,)
2:     for  do
3:               
4:     
5:function bitmap_filter(r,s,)
6:     
7:     if  then
8:         
9:         if  then ;               
10:     return
Algorithm 7 Proposed Algorithm - Bitmap Filter

4 Implementation Details

The Bitmap Filter will be evaluated using CPU and GPU implementations, which are detailed in this section.

4.1 CPU Implementation

In order to assess the Bitmap Filter in CPU, we added it to the state-of-the-art implementation of [13] using the available source code. Four algorithms were modified: AllPairs, PPJoin, GroupJoin, and AdaptJoin. These algorithms follow the Filter-Verification Framework (Algorithm 2) and, considering the peculiarities of each algorithm, the Bitmap Filter was introduced as filter (line 10) or filter (line 14). If the filter is applied in the candidate generation loop (filter), the bitmap test may be applied more than once for the same candidate pair. If the filter is applied in the verification stage loop (filter), the bitmap test will be applied only once for each unique candidate pair.

For AllPairs, PPJoin, and GroupJoin we chose to insert the Bitmap Filter in the verification loop (filter). In the specific case of the GroupJoin algorithm, the filter is applied after the grouped candidate pairs are expanded, in such way that the filter is applied to individual elements.

For the AdaptJoin, we verified that the candidate list is relatively small at the verification loop, so the Bitmap Filter is applied at the candidate generation (filter). The filter is executed only during the first -prefix filter iterations (i.e. in the -prefix schema computation).

The bitmaps were implemented with multiple of 64-bit words and the population count operation was done using the __builtin_popcountll gcc intrinsic function. Using proper compilation flags, the gcc converts this function to the POPCNT hardware instruction for population count [18].

4.2 GPU Implementation

The Bitmap Filter relies on bitwise instructions (such as xor and population count) that are efficiently implemented in Graphical Processor Units (GPUs). Compute Unified Device Architecture (CUDA)[14] is the Nvidia general purpose framework that allows parallel execution of procedures (kernels) in Nvidia GPUs. In CUDA, each kernel is executed in parallel by several threads, which are grouped into independent blocks. Threads in the same block share data using a fast shared memory. The CUDA framework also allows the data transfer between host and GPU devices, in order to supply inputs to the kernel and send the output back to the host machine.

In order to show the potential of the Bitmap Filter in the CUDA architecture, we implemented the pseudocode of Algorithm 8 in GPU, where is the self-joined collection, is the Jaccard threshold and is the return list containing a list of candidate pairs. This code is equivalent to the Naïve Algorithm (Algorithm 1) with the addition of the Length Filter (Section 2.3.2) and Bitmap Filter (Section 3). The Bitmap-Xor generation method was used and the cutoff point was disabled. The kernel call (line 10) executes blocks with threads each, and each thread receives a unique thread id (line 2).

The kernel may be invoked many times until all sets are processed (line 3). Each thread compares set with all the previous sets , where is the first index containing a set not filterable by the Length Filter (line 4). If the Bitmap Filter finds a possible candidate pair, this pair is included in a thread-local candidate list (lines 6-7). Each thread-local candidate list can hold up to 2048 candidates. If the thread-local list becomes saturated, all the remaining sets will be considered candidates and will be verified by the CPU.

At the end of the kernel execution (line 8), the thread-local candidate lists are concatenated in a global candidate list, without empty spaces. This operation, called stream compaction or reduction, reduces the amount of data transferred to the host. Finally, in the host code, all candidate pairs are verified and the similar pairs are included in the final result list (lines 11-14).

1:procedure GPUKernel(, , )
2:     
3:     if  then return      
4:      COMMENT// length filter
5:     for  do
6:         if  then
7:                             
8:     
9:procedure HostCode(,)
10:     GPUKernel
11:     for  do
12:         if  then
13:                             
14:     return
Algorithm 8 GPU Algorithm

The GPU algorithm may produce up to Bitmap Filter computations and its implementation was not intended for collections with more than a million elements. Nevertheless, it was implemented in order to show that even a simple algorithm can take advantage of the Bitmap Filter and become very competitive when compared to the state-of-the-art algorithms. So, we claim that more sophisticated GPU implementations may benefit more from Bitmap Filter, opening the opportunity of new improvements for the Set Similarity Join problem.

5 Experimental Results

In order to create a baseline, we replicated the experiments conducted by [13]. Table 4 presents the characteristics of the chosen collections. Comparing this table with [13], slight variations can be seen in dblp collection, probably due to a different charset. Also, flickr, netflix and spot collections were no more available for download in the links provided by [13]. We generated new uniform and zipf collections with the same methodology [13]

, using a Poisson distribution for the set sizes. All collections were preprocessed and the sets in the collections were sorted by size and, in case of a tie, they were sorted lexicographically by the token ids. We observed that the lexicographical ordering speeds up all the algorithms, as stated by

[13].

Collection # of set size # uniq.
name sets max avg med tokens
aol 10154743 245 3.01 3 3873246
bms-pos 320285 164 9.30 7 1657
dblp 100000 717 106.28 103 3801
enron 245615 3162 135.19 86 1113220
kosarak 606770 2498 11.93 5 41275
livej 3061271 300 36.44 17 7489073
orkut 2732271 40425 119.67 29 8730857
uniform 100000 25 9.99 10 220
zipf 100000 86 49.99 50 101584
Table 4: Collections used in experiments

Figure 8 shows the set size distribution in the collections. Orkut, enron and kosarak present a very large tail to their right, whereas dblp, uniform and zipf collections present a more symmetrical frequency distribution. Figure 9 presents the number of occurrences of the tokens in some collections, where a Zipf distribution is often observed. This kind of distribution increases the efficiency of the Prefix Filter [13].

Figure 8: Set size frequency
Figure 9: Token occurrence histogram

The experiments were conducted in a dedicated machine with an Intel i7-3770 CPU (4 cores) at 3.40GHz with 8 GB RAM and an Nvidia GeForce GTX 980 Ti (2048 CUDA cores) with 6 GB RAM, running a CentOS Linux 7.2. The gcc compiler was used with the flags -O3 and -march=native. As in [13], all the presented runtimes were taken as average of 5 runs.

5.1 CPU Experiments

The four best state-of-the-art algorithms reported by [13] were selected for the experiments: AllPairs (all), AdaptJoin (ada), GroupJoin (gro) and PPJoin (ppj). For the baseline experiments, we used the original source code provided by [13]. For the Bitmap Filter experiments, we used the modified source code proposed in Section 4, with the Bitmap-Combined generation method (Algorithm 6). The default bitmap size was , but for the collections with large median set sizes (dblp, enron and zipf), the bitmap size was increased to . The selected hash function was . In order to distinguish the modified algorithms that applied the Bitmap Filter, they will be referenced as: AllPairs-BF (all-bf), AdaptJoin-BF (ada-bf), GroupJoin-BF (gro-bf) and PPJoin-BF (ppj-bf).

The experiments were conducted over 9 collections (self-joined) and 8 threshold varying from to , resulting in 72 different input combinations. Each input was assessed by the two groups of algorithms: original algorithms (baseline) and the modified algorithms (with Bitmap Filter). In total, there are 288 experiments for each group of algorithms.

5.1.1 CPU Execution times

Table 5 shows the running times of the original algorithms in columns “orig.” (baseline) and the running times of the modified algorithms in columns “+BF” (with Bitmap Filter). The times do not include the preprocessing overhead (loading files and sorting the input).

Threshold
0.5 0.6 0.7 0.75 0.8 0.85 0.9 0.95
orig. +BF orig. +BF orig. +BF orig. +BF orig. +BF orig. +BF orig. +BF orig. +BF
   aol ADA 437.3 *113.1 119.3 *31.6 24.15 7.74 20.27 6.50 12.22 4.80 6.40 3.80 5.84 3.63 5.81 3.60
ALL 296.1 183.4 75.9 46.1 9.81 *6.49 6.30 *4.60 3.16 *2.40 1.54 1.28 1.34 1.16 1.33 1.15
GRO 219.2 207.3 *54.7 51.1 *8.89 7.90 *6.16 5.40 *2.82 2.41 *1.10 *0.93 *0.86 *0.74 *0.84 *0.72
PPJ *212.2 166.7 57.1 44.6 10.72 8.20 8.07 6.19 4.27 3.28 2.19 1.71 1.89 1.54 1.87 1.53
   bms-pos ADA *24.49 *8.79 *8.47 *2.96 *2.61 *0.98 1.63 *0.61 0.88 *0.34 0.42 *0.18 0.26 0.12 0.20 0.10
ALL 36.63 15.72 10.63 5.75 3.04 1.75 1.69 1.03 0.80 0.50 0.29 0.20 0.13 0.09 0.08 0.06
GRO 36.11 33.04 10.39 9.32 2.88 2.41 *1.62 1.32 *0.77 0.61 *0.28 0.22 *0.10 *0.08 *0.04 *0.03
PPJ 31.99 19.86 9.97 6.70 3.06 2.15 1.81 1.29 0.93 0.66 0.38 0.28 0.18 0.13 0.12 0.09
   dblp ADA *161.6 *172.7 *75.3 *64.6 *31.01 *14.20 *17.47 *7.06 *8.21 *3.37 *3.23 *1.37 *0.96 *0.42 0.23 0.10
ALL 177.1 190.7 83.4 74.4 37.45 27.00 23.17 15.34 11.64 7.35 4.60 2.73 1.24 0.71 *0.16 *0.10
GRO 216.0 228.4 107.7 103.0 39.72 36.72 22.10 20.24 10.42 9.48 4.01 3.74 1.12 1.05 0.17 0.15
PPJ 199.3 206.7 102.9 93.2 38.86 31.50 21.72 16.94 10.21 7.91 4.00 3.00 1.11 0.82 0.17 0.12
   enron ADA *30.35 *31.47 11.05 10.41 4.46 3.66 2.93 2.21 1.91 1.36 1.18 0.86 0.63 0.49 0.32 0.30
ALL 43.51 45.27 13.02 12.19 3.69 3.05 1.95 1.56 1.06 *0.88 0.59 *0.51 *0.26 *0.24 *0.12 *0.12
GRO 32.96 33.37 *9.90 9.70 *2.93 2.79 *1.69 1.58 *0.98 0.93 *0.58 0.55 0.27 0.25 0.13 0.13
PPJ 33.25 32.72 10.35 *9.41 3.07 *2.72 1.77 *1.54 1.03 0.90 0.59 0.53 0.27 0.25 0.12 0.12
   kosarak ADA 93.83 *27.53 13.59 *4.12 1.53 0.78 1.03 0.56 0.71 0.41 0.47 0.31 0.36 0.25 0.28 0.22
ALL 58.50 35.59 8.77 5.43 1.04 *0.67 *0.59 *0.41 *0.32 *0.23 *0.15 *0.12 0.09 *0.08 0.06 0.06
GRO *37.54 35.81 *5.82 5.50 *1.00 0.90 0.61 0.54 0.33 0.29 0.16 0.14 *0.09 0.08 *0.06 *0.05
PPJ 39.90 34.95 6.49 5.31 1.08 0.81 0.68 0.51 0.39 0.30 0.19 0.15 0.12 0.10 0.09 0.07
   livej ADA *219.4 *192.3 64.59 *56.16 21.39 17.26 14.10 10.77 9.46 6.81 6.14 4.35 3.90 3.18 2.86 2.53
ALL 365.8 312.6 94.86 79.57 21.01 16.75 10.08 7.90 4.83 3.88 2.30 *1.92 *1.18 *1.08 *0.64 *0.62
GRO 260.8 263.3 66.41 66.35 16.07 15.91 8.40 8.21 *4.33 4.21 *2.25 2.16 1.24 1.21 0.72 0.70
PPJ 245.8 225.7 *64.17 57.97 *15.86 *13.91 *8.38 *7.30 4.38 *3.85 2.30 2.04 1.23 1.15 0.70 0.66
   orkut ADA 162.1 178.9 79.96 81.73 41.89 40.92 29.80 28.72 19.71 18.88 11.91 11.46 7.16 6.87 4.27 4.14
ALL 220.2 238.6 70.55 74.57 25.10 25.65 15.42 15.71 9.10 9.38 5.12 5.17 *2.81 *2.84 *1.35 *1.36
GRO 155.2 160.7 56.43 57.34 *22.09 22.53 *14.00 14.02 8.71 8.62 5.18 5.17 3.03 3.11 1.52 1.51
PPJ *150.1 *155.4 *54.34 *55.25 22.14 *21.96 14.15 *13.91 *8.68 *8.52 *5.10 *5.11 2.93 2.89 1.42 1.40
   uniform ADA *56.58 *15.85 29.19 *6.80 11.03 *2.45 7.21 *1.62 3.93 0.89 1.08 0.28 0.54 0.16 0.17 0.06
ALL 58.78 27.33 25.36 14.05 8.44 5.30 5.38 3.55 2.81 1.92 0.86 0.58 0.44 0.31 0.15 0.12
GRO 57.30 38.75 *19.49 12.39 *5.48 3.43 *2.95 1.77 *1.29 *0.72 *0.44 *0.23 *0.22 *0.12 *0.06 *0.03
PPJ 61.14 33.92 25.35 15.93 8.74 6.14 5.59 4.15 3.00 2.26 1.16 0.86 0.65 0.49 0.21 0.16
   zipf ADA *1.09 *0.74 *0.61 0.45 0.40 0.32 0.34 0.28 0.29 0.23 0.23 0.19 0.18 0.15 0.13 0.12
ALL 1.97 0.85 0.79 *0.38 0.36 *0.20 0.25 *0.15 0.17 *0.11 *0.09 *0.07 *0.05 *0.04 *0.02 *0.02
GRO 2.15 2.19 0.86 0.85 0.39 0.39 0.27 0.27 0.18 0.17 0.11 0.11 0.06 0.06 0.03 0.03
PPJ 1.68 1.06 0.70 0.48 *0.33 0.26 *0.23 0.19 *0.16 0.13 0.10 0.08 0.05 0.05 0.03 0.03
Table 5: CPU runtimes (in seconds) comparing the baseline execution (orig.) with the Bitmap Filter (BF) execution. Speedups are highlighted in bold and slowdowns in gray. Inside each group with four algorithm executions, the best runtime is marked with asterisk (*).

For each of the 72 different inputs in Table 5, there is an asterisk (*) in the algorithm that achieved the best runtime, regarding each group of 4 algorithms (separated for the original and modified algorithm groups). Considering only the original algorithms, it can be seen that gro achieved the best runtimes for 33 out of 72 inputs (46%), followed by ada with 15 (21%), all with 13 (18%) and ppj with 11 (15%). Considering only the modified algorithms, ada-bf and all-bf were the best in 25 out of 72 inputs (35% each), followed by ppj-bf with 12 (17%) and gro-bf with 10 (14%). Considering both groups together, ada-bf and all-bf obtained the best runtime in 23 out of 72 inputs (32% each), followed by gro-bf with 10 (14%), ppj-bf with 9 (13%), ppj with 3 (4%), all and ada with 2 (3% each). In total, 90% of the inputs presented best runtime when using algorithms with Bitmap Filter.

The runtime sum of all 72 experiments for each original algorithm was ppj:1535s, gro:1561s, all:1878s, ada:1944s. Considering the modified algorithms with Bitmap Filter, the runtime sums and the related reduction when compared to the original algorithms were: ada-bf:1233s (37%), ppj-bf:1359s (11%), gro-bf:1515s (3%) and all-bf:1549s (18%).

Table 6 presents the runtime improvement of the Bitmap Filter considering the formula , where is the runtime with Bitmap Filter and the original baseline runtime. The Bitmap Filter was able to improve the running times in % on average, although in some situations it improved up to %. Some experiments presented a small slowdown (negative values) in the computation, but the maximum slowdown was not greater than %.

Threshold

0.5

0.6

0.7

0.75

0.8

0.85

0.9

0.95

   aol ADA 287% 278% 212% 212% 154% 69% 61% 62%
ALL 61% 64% 51% 37% 32% 20% 15% 16%
GRO 6% 7% 13% 14% 17% 18% 18% 18%
PPJ 27% 28% 31% 30% 30% 28% 22% 23%
   bms-pos ADA 179% 186% 166% 166% 155% 130% 108% 94%
ALL 133% 85% 73% 65% 59% 50% 36% 27%
GRO 9% 11% 19% 22% 25% 26% 30% 39%
PPJ 61% 49% 43% 40% 40% 37% 35% 30%
   dblp ADA -6% 16% 118% 148% 144% 137% 125% 121%
ALL -7% 12% 39% 51% 58% 69% 75% 63%
GRO -5% 5% 8% 9% 10% 7% 8% 11%
PPJ -4% 10% 23% 28% 29% 34% 36% 39%
   enron ADA -4% 6% 22% 33% 41% 38% 29% 7%
ALL -4% 7% 21% 25% 21% 16% 8% 1%
GRO -1% 2% 5% 6% 5% 7% 7% 2%
PPJ 2% 10% 13% 15% 14% 12% 6% 6%
   kosarak ADA 241% 230% 98% 84% 73% 54% 42% 30%
ALL 64% 62% 54% 44% 38% 26% 19% 11%
GRO 5% 6% 11% 12% 11% 7% 12% 12%
PPJ 14% 22% 34% 32% 29% 24% 19% 19%
   livej ADA 14% 15% 24% 31% 39% 41% 23% 13%
ALL 17% 19% 26% 28% 24% 20% 9% 3%
GRO -1% 0% 1% 2% 3% 4% 2% 3%
PPJ 9% 11% 14% 15% 14% 13% 8% 7%
   orkut ADA -9% -2% 2% 4% 4% 4% 4% 3%
ALL -8% -5% -2% -2% -3% -1% -1% -1%
GRO -3% -2% -2% 0% 1% 0% -2% 1%
PPJ -4% -2% 1% 2% 2% 0% 2% 2%
   uniform ADA 257% 329% 350% 344% 342% 282% 231% 170%
ALL 115% 81% 59% 52% 47% 47% 41% 28%
GRO 48% 57% 60% 67% 78% 92% 81% 80%
PPJ 80% 59% 42% 35% 33% 35% 34% 30%
   zipf ADA 48% 35% 24% 25% 24% 24% 20% 11%
ALL 130% 108% 78% 69% 54% 30% 19% 0%
GRO -2% 1% 2% 0% 6% 2% 4% 0%
PPJ 59% 45% 30% 21% 19% 16% 8% 0%
Table 6: Runtime improvement using Bitmap Filter

Table 7 shows the runtime improvement of the algorithms with respect to each threshold, considering the runtime average of all collections. All the algorithms presented improvements: ada-bf was the one with highest gain, varying from 57% up to 121%; all-bf presented improvements ranging from 17% to 56%; ppj-bf showed improvements from 17% to 27%; gro-bf was the one with lowest gains, varying from 6% up to 18%.

Threshold

0.5

0.6

0.7

0.75

0.8

0.85

0.9

0.95

   Average ADA 112% 121% 113% 116% 109% 86% 72% 57%
ALL 56% 48% 44% 41% 37% 31% 25% 17%
GRO 6% 10% 13% 15% 17% 18% 18% 18%
PPJ 27% 26% 26% 24% 23% 22% 19% 17%
Table 7: Average Bitmap Filter improvement per algorithm

Table 8 presents the average improvement for each collection and threshold, considering the average runtimes between the algorithms. A very good improvement is noted for collections uniform, bms-pos, aol, dblp and kosarak, followed by intermediate gains in livej and zip. The smallest improvements were observed in collections enron and orkut. Small slowdowns of up to 6% were detected in collections dblp, enron and orkut at low thresholds (0.5 and 0.6). The collection gain is related to the set size distribution (Figure 8) and the number of distinct tokens (Table 4), such that collections with many small sets and few unique tokens (e.g. uniform and bms-pos) present higher improvements then collections with many large sets and many unique tokens (e.g. enron and orkut).

Threshold

0.5

0.6

0.7

0.75

0.8

0.85

0.9

0.95

  aol 95% 94% 77% 73% 58% 34% 29% 29%
  bms-pos 96% 83% 75% 73% 70% 61% 53% 48%
  dblp -6% 11% 47% 59% 60% 61% 61% 58%
  enron -2% 6% 15% 20% 20% 18% 13% 4%
  kosarak 81% 80% 49% 43% 38% 28% 23% 18%
  livej 10% 11% 16% 19% 20% 19% 10% 6%
  orkut -6% -3% 0% 1% 1% 1% 1% 1%
  uniform 125% 132% 128% 124% 125% 114% 97% 77%
  zipf 59% 47% 33% 29% 26% 18% 13% 3%
Table 8: Average Bitmap Filter improvement per collection

5.1.2 Filtering Ratio

In order to assess the efficiency of the Bitmap Filter, we collected the number of candidate pairs that were filtered out by the Bitmap Filter. Table 9 presents the filtering ratio, defined by the number of filtered pairs divided by the total number of candidates. The filtered ratio is strongly related to the runtime improvement (Table 6).

Threshold

0.5

0.6

0.7

0.75

0.8

0.85

0.9

0.95

  aol 97% 98% 99% 99% 99% 99% 99% 99%
  bms-pos 98% 99% 99% 99% 99% 99% 99% 99%
  dblp 11% 54% 97% 99% 99% 99% 99% 99%
  enron 14% 29% 56% 71% 83% 86% 89% 85%
  kosarak 86% 90% 93% 95% 96% 98% 99% 99%
  livej 45% 52% 64% 74% 86% 99% 99% 99%
  orkut 4% 5% 8% 10% 13% 18% 27% 54%
  uniform 99% 99% 99% 99% 99% 99% 99% 99%
  zipf 99% 99% 100% 100% 100% 100% 100% 100%
Table 9: Bitmap Filter ratio per collection (AllPairs Algorithm)

Figure 10 presents the filtering ratio for the three bitmap creation algorithms, using bitmap sizes with bits and without the cutoff point (Section 3.5). In the figure, the Bitmap-Xor was consistently the best option for the three analyzed collections, followed by Bitmap-Set and Bitmap-Next. The only situation where the Bitmap-Set was the best option was for the enron collection with Jaccard threshold 0.5. As stated in Section 3.5, Bitmap-Set is slightly better around . In the dblp collection, the Bitmap-Xor presented filtering ratios up to 6.37 better than Bitmap-Set, due to the large set sizes presented in dblp. This high filtering ratio difference reduced the runtime by up to .

Figure 10: Bitmap Filter ratio for different Bitmap generation methods and Jaccard thresholds ( bits)

5.1.3 Filtering Precision

The Bitmap Filter can be considered as a binary classification rule that classifies a candidate pair as

unfiltered (positive class) or filtered (negative class). Since the Bitmap Filter is an exact method, there is no wrongly filtered pair (false negative), so the Bitmap Filter has a recall (true positives divided by true positives plus false negatives). Nevertheless, there may be some dissimilar pairs that are not filtered (false positive), leading to an extra verification cost. We define the filtering precision as the ratio between similar pairs (true positives) and unfiltered pairs (false positives plus true positives). The number of false positives tends to increase when the bitmaps are too small or when the number or unique tokens are very large.

The cutoff point defined in Section 3.5 disables the Bitmap Filter whenever the filtering precision falls drastically. In order to show this precision drop off, the filtering precision was measured for the self-joins of dblp and enron collections, without using the cutoff point. The bitmap size used for this experiment was . Figure 11 shows the average filtering precision for the different set sizes found in the collection, considering Jaccard threshold varying from 0.50 to 0.80. Although each collection has its own characteristics, we can see the precision drop off in almost the same position in the plots. This clearly represents the cutoff point, such that beyond this point the filter becomes inefficient. So, higher runtime improvements tends to occur in collections whose majority of sets contains less tokens than the observed precision drop off point.

Figure 11: Bitmap Filter precision for different number of tokens and Jaccard thresholds ( bits)

5.2 GPU Experiments

The GPU algorithm was executed over the collections of up to a 606,770 sets: bms-pos, dblp, enron, kosarak, uniform, zipf. The algorithm was tested with bitmap sizes ranging from to and threshold from to . The number of CUDA blocks was fixed in 512, each one with 256 threads. Table 10 presents the running times of the GPU algorithm considering the bitmap of size which presented the best execution time. The running times does not include the file loading time and device initialization, but includes the GPU memory allocation, data transfers from/to the GPU, the GPU kernel execution and the CPU procedures related to the join algorithm. Table 10 also gives the speedup up of the GPU implementation compared to the best runtime obtained in the experiment with the baseline CPU algorithms (Table 5).

Threshold

0.5

0.6

0.7

0.75

bms-pos GPU time 0.77 0.41 0.31 0.28
CPU time (ada)24.49 (ada)8.47 (ada)2.61 (gro)1.62
speedup 31.7 20.6 8.3 5.7
bitset 192bits 128bits 128bits 128bits
dblp GPU time 0.28 0.18 0.12 0.09
CPU time (ada)161.59 (ada)75.25 (ada)31.01 (ada)17.47
speedup 577.9 414.8 250.5 190.3
bitset 512bits 320bits 192bits 192bits
enron GPU time 3.27 1.92 1.26 1.00
CPU time (ada)30.35 (gro)9.90 (gro)2.93 (gro)1.69
speedup 9.3 5.2 2.3 1.7
bitset 2560bits 2560bits 2048bits 1280bits
kosarak GPU time 7.65 2.64 1.21 0.95
CPU time (gro)37.54 (gro)5.82 (gro)1.00 (all)0.59
speedup 4.9 2.2 0.8 0.6
bitset 512bits 256bits 192bits 128bits
uniform GPU time 1.96 0.19 0.07 0.06
CPU time (ada)56.58 (gro)19.49 (gro)5.48 (gro)2.95
speedup 28.9 100.5 79.9 47.8
bitset 192bits 128bits 128bits 128bits
zipf GPU time 0.14 0.11 0.09 0.09
CPU time (ada)1.09 (ada)0.61