1 Introduction
The problem of nearduplicates or similar record pairs is associated with having multiple representations or records of the same entity or concept in a database. Detecting nearduplicates has been studied in the past under different names such as data deduplication, mergepurge, record linkage, etc. [1].
Analyzing the similarity selfjoin size provides important insight when the semantics of rows and columns is lessknown, and this is a commonly expected case in open data [2]. Consider, for example, an open data of bibliographic records with untagged attributes title, author, journal, volume and year. The similarity selfjoin size with a match expected in 4 out of 5 columns (i.e. 80% similarity) gives the degree of uniformity under any projection of 4 attributes. It can be noted that the field title is not expected to have many duplicates, whereas the author field may have a limited number of duplicates since two authors can have the same name or an author can have more than one record. More duplicates are expected in columns journal, volume and year. For the same reason, the similarity selfjoin size under projections of 4 attributes is not expected to be much different than that of 5, but the similarity selfjoin size for projections of size 3 is expected to be much larger; the same is observed for real DBLP records in our experiments (see Table III). Such information about columns and their relationships can be obtained by analyzing the sizes of similarity selfjoins. In other words, the similarity selfjoin size describes not only the frequency distribution of the rows but also the soft dependencies and functional relationships between the columns [3].
Many other applications can be listed for similarity join size estimation. The selfjoin size gives the degree of uniformity or skew, and the similarity selfjoin size gives the degree of skew under some
projections. The degree of skew is an important statistics in parallel and distributed database applications and may determine the choice of data partitioning strategies (e.g. vertical or horizontal) and algorithms being employed [4, 5]. For example, the Hadoopbased algorithm that won the Terasort benchmark in 2008 included a partitioning function that heavily relied on the key distributions present in the 2008 benchmark, which may not be present in other datasets [6]. A more general solution is expected to detect the skew, which often arise in the ‘reduce’ phase, and to balance the load accordingly. In projected clustering [7], one needs to find the set of dimensions such that the spread (or dissimilarity) is the least, and a similarity join size can be an important statistics in detecting those dimensions. In general, estimating the skew can also help with data storage and indexing [8], data cleaning [9] and maybe homogeneity analysis [10]. For example, before running a data deduplication that can take days or even weeks, one may want to quickly find out if there are enough nearduplicates and that running an expensive cleaning operation is justified.The setting we assume in this paper is that the synopsis of a table must be constructed in one pass. This is desirable for rapidly growing tables where a multipass method can be costly. Also the input data to a similarity join sometimes consists of data streams, which may not be fully available when a synopsis is constructed. For example, a similarity join placed in a join tree may take its input from other operators, and while the input to the join is streaming in, estimates of the join size may be sought. It has been noted that cardinality estimation errors in a query cost model can easily be in multiple orders of magnitude and join queries are usually the largest contributors to those misestimates [11]. Morales and Gionis [12] cite trend detection and nearduplicate filtering in a microblogging site as some applications of similarity selfjoin in a streaming setting.
The problem addressed in this paper can be stated as follows: given a collection of records and a threshold, estimate the number of record pairs that have a Hamming distance equal to or less than the threshold. The Hamming distance function is welldefined on bit strings and binary vectors and naturally extends to more general strings, vectors, multifield records, etc. For example, the Hamming distance between two records gives the minimum number of substitutions that would transform one record to the other. This quantity can be divided by the number of fields to get the fraction of fields or features in which two records differ. One minus that fraction will give what may be referred to as the
Hamming similarity. Also other distance functions may be mapped to the Hamming distance and our method can be used under those mappings. Many applications of Hamming distance are reported in the literature, for example to detect duplicate Web pages [13], duplicate records in academic digital libraries [14], duplicate images [15], etc.A naive algorithm to estimate the number of nearduplicates is to compare every pair of records, which requires storing the entire dataset and has a quadratic time complexity. However, an exact estimate often is not needed and an algorithm that more efficiently obtains an estimate may be preferred [16, 17, 18]. Also the memory used for computing an estimate is usually limited, and storing the data structures in external memory has additional overhead which grows with the dataset size and should be avoided. To the best of our knowledge, our method is the first that computes an estimate of similarity selfjoin size within only one scan over such data and with a limited storage. Also our experimental results show that the error of our estimates is often less than or comparable to some of the latest multipass algorithms.
Our contributions can be summarized as follows:

We study and address the problem of similarity selfjoin size estimation in a streaming setting where the input, given one by one, cannot be fully stored. While the problem has been studied for input that consists of 1dimensional records such as numbers [19, 20, 21], we are not aware of any previous work that addresses the problem on input that has more than one dimension. This paper studies the general problem and presents an efficient and elegant probabilistic algorithm, extending previous techniques to streaming input with more than one dimension. Extending our algorithm to similarity join size estimation is straightforward (as discussed in Section 6).

We analyze the time and space costs of our algorithm, showing that the join size can be accurately estimated in logarithmic space as long as the input dimensionality is kept low (10 or less as shown in our analysis and experiments). More precisely, we prove that our algorithm gives an unbiased accurate estimate with high probability using a set of counters and that the number of those counters is independent of
, the number of records in the dataset, and only depends on , the dimensionality; the space needed for each counter is bounded to bits. 
We evaluate the performance of our method in terms of the accuracy of the estimates and the running time. Compared to random sampling which is the only competitor in a streaming setting, our algorithm gives significantly more accurate results and scales better for large input sizes. We also compare the performance of an “offline” version of our algorithm in which the intermediate results are materialized (and not sketched) to two recently proposed nonstreaming algorithms [22, 18]. Our experiments on real data show that the proposed algorithm (under a comparable space requirement) is much more accurate than these alternative algorithms; it has to be pointed out that both these methods, unlike ours, require more than one pass over data.
1.1 Definitions and the problem statement
Given a bag of records where denotes the number of distinct records and denotes the multiplicity of record
, the selfjoin size of the bag (also referred to as the second frequency moment) is defined as
(1) 
Given a pair of records with the same schema, we call the pair ssimilar if the number of attributes where the pair have the same values is . For examples, records and in Table I are 2similar, and so are records and . Compared to the Hamming distance which is defined on binary vectors (i.e. vectors with 0/1 values), similarity is defined on general records (e.g. of employees or students).
Extending Eq. 1, consider the selfjoin of a collection of dimensional records, and let denote the number of different record pairs that are similar. As in selfjoin, similar pairs (,) contribute twice to , but selfpairs (,) are not counted. Hence, the number of selfpairs is added to similarity selfjoin size, , defined as
(2) 
for records, each of dimensionality . gives the number of record pairs that are at least similar. In a streaming setting, similarity join may be computed at any point while the stream is being received (aka continuous queries).
Problem statement. Given records each with attributes and a similarity threshold , we want to estimate , the similarity selfjoin size in one scan of the input with a limited memory smaller than . We also want to extend our estimation to similarity join sizes.
Organization. The outline of the rest of the paper is as follows: Three baseline algorithms are briefly presented in Section 2, and our similarity selfjoin size estimation algorithm is discussed in Section 3. We present an analysis of our algorithm in terms of the estimation error and time and space bound in Section 4. Running time is analyzed in Section 5, and an extension of our algorithm for similarity join size estimation is studied in Section 6. Experimental results are reported in Section 7, and related work is reviewed in Section 8. Section 9 concludes the paper.
2 Baseline Algorithms
One widely used baseline is random sampling; it is applied in contexts similar to ours [20, 18] and is very easy to implement. We also present the signature pattern counting [22] and the LSHbased bucketing [18], as two more baselines.
2.1 Random sampling
For a set of records, one can pick different records uniformly at random (sampling without replacement); then use the straightforward pairwise comparison between the selected records to find , for the sample. Next, the estimates can be scaled by the ratio of the size of the population ^{1}^{1}1Our population here is the set of record pairs being joined. to the size of the sample space, i.e. . Finally is estimated as in Eq. 2. For example, suppose random sampling selects rows and from Table I with a sampling ratio of . The selected rows are 2similar, and the estimates of , and for the sample are respectively 0, 2 and 0, and for the population are 0, and 0.
Alon et al. [20] used a similar randomsampling technique in their experiments for estimating the selfjoin sizes of data streams. However, the results show that it is not as accurate as other methods. Random sampling is also used in the literature for similarity join size estimation [18].
Lemma 1.
Randomsampling requires a sample of size to give an estimate of the similarity selfjoin size with a relative error less than with high probability.
Proof.
See Appendix A. ∎
2.2 Signature pattern counting
Lee et al. [22] map the similarity selfjoin size estimation into the problem of finding a set of frequent signature patterns and estimating the number of records that match each pattern. Each signature pattern is a record with some constants and some wildcard symbols say *. A data record matches a signature pattern if both have the same constants in respective columns; there is no matching constraint on columns marked with *. Clearly two records that match a signature pattern with s constants must be similar for . Having a set of signature patterns each with at least s constants, and the number of records matching each pattern, one can estimate the similarity selfjoin size for the set of tuples matching each pattern and add up the estimates. For example, this algorithm, applied to Table I, can produce the patterns [a1,b1,*] and [*,b2,c2] both with frequencies .
However, this approach has some problems: (1) the estimate does not take into account the overlap between patterns (e.g. [*,3,5,*] and [2,*,5,*]) which can lead to a doublecounting; (2) the search space for patterns with frequency at least 2 is huge, and the number of such patterns may not be much smaller than the size of the dataset. The authors address the first problem by placing the patterns in a lattice structure and estimating the size of the overlap between patterns for each level of the lattice. The second problem is addressed by modelling the pattern distribution as a power law and estimating (but not actually computing) the number of matches for lowfrequency patterns based on the counts obtained for highfrequency ones.
Their algorithm must (1) compute a set of frequent signature patterns and (2) collect the counts of records matching each pattern. A typical frequent counting is expected to make d passes over the data, where d is the record dimensionality, but one may consider either the partitioning scheme of Savasere et al. [23] or the sampling method of Toivonen [24] to cut the number of passes to two. Even Manku and Motwani [25] suggest a onepass algorithm based on sticky sampling. However, these algorithms are generally useful in detecting very frequent patterns and this is reflected in their error bounds which is relative to the input length; they are likely to miss many lessfrequent patterns. Also once the signature patterns are found, one more pass over data is needed to collect the actual counts.
2.3 LSHbased bucketing
In this approach, Lee et al. [18] map data records into some buckets using a locality sensitive hashing scheme. Two strata are defined for the sake of selfjoin sampling: (1) record pairs that are mapped to the same bucket, (2) record pairs that are mapped to two different buckets. Record pairs are sampled from each stratum and their similarities are assessed. Finally, the results are scaled (based on record counts which are kept in each bucket) to derive an estimate for the similarity join size. To construct LSH buckets on disk, the algorithm has to read and write the whole data. One more pass is also needed to sample the record pairs.
3 Our Estimation Algorithm
The basic idea behind our estimation algorithm is to map the problem of similarity selfjoin size estimation into a set of selfjoin size estimations for which more efficient solutions are available, before putting together the results to obtain an estimate for the similarity selfjoin size. To illustrate the concept, consider Table with 3 columns and 4 rows, as shown in Table I.
The selfjoin size of is , and so is the number of records, hence has no 3similar pairs other than selfpairs (as shown in Table I). Now consider the projection of on columns (A,B), (A,C) and (B,C) with duplicates kept. This would yield tables , and , as shown with their selfjoin sizes.



The selfjoin size of is 6, and that of is . Excluding 3similar pairs, must have pairs of rows that are 2similar on columns (A,B). Similarly the selfjoin size of is 6, which indicates that has two pairs of records that are 2similar on columns (B,C). No pairs of records in are 2similar on columns (A,C). Putting the results together, one can conclude that has pairs of 2similar records. This is the exact number of 2similar pairs, calculated solely based on the join sizes and the size of .
There are three problems associated with this approach: (1) the number of possible projections of a relation with attributes is and so is the number of selfjoin size estimates; (2) the join sizes are not independent simply because the projections of two similar records are expected to have some similar pairs; (3) the sum of the number of records in the projected tables can be much larger than the input (or more precisely it can be larger by factor of ) and this has implications in the required space usage and perrecord processing time.
We address the first problem by reducing the number of selfjoin size estimations to . This is done by collapsing all projections with attributes into a single stream. To make distinctions between tuples from different projections though, we attach another column to the stream to indicate the projection from which the tuple is drawn. Otherwise, two tuples that have the same values under different projections can join, leading to wrong join sizes. Applying this to the projections with two attributes in our running example will yield a table with three columns, twelve rows and the selfjoin size of 16 (as shown in Table II), of which 12 are selfpairs. This gives 1612=4 pairs of 2similar records, consistent with our previous results.
The second problem is addressed in the next section by calculating the contributions of an similar pair to the projections and incorporating it in our size estimations. We address the third problem through a combination of sampling and sketching, showing that both the space usage and the error can be bounded, and that the proposed algorithm outperforms our competitive baselines by a large margin.
3.1 Handling doublecounting
Given a table with attributes, the set of possible groupings of the attributes can be described in a lattice. Suppose the grouping that includes all attributes is at Level of the lattice; then level will have all subsets consisting of attributes, and so on (as shown in Fig. 1 with 4 attributes).
To obtain a relationship between the selfjoin size and the number of similar pairs, let denote the selfjoin size at level of the lattice, and be the number of record pairs that are similar.
Lemma 2.
Given , for a set of records with no similar pairs,
Proof.
If two records are similar, then there must be a unique projection at level under which both records produce the same values; hence those values can join and will contribute to . However, also includes selfpairs where a record joins itself. The number of those pairs is the same as the number of records at level , which is . Subtracting the two will give .∎
Now we consider the scenario where the set can have similar pairs. Consider two records and that join at level , meaning they have the same values in all attributes. All projections of these records will also join, and this introduces doublecounting in joinsize estimates at levels . The size of this projection can be precisely computed with not much effort though.
Lemma 3.
For a set of records and ,
(3) 
Proof.
Consider two records that are similar for . Then there must be a projection at level under which the two records emit the same values; all projections of those values at level will be identical. There are such projections. With giving the number of record pairs that are similar, the expression gives the contribution of all similar pairs to . The rest follows from Lemma 2.∎
Unlike the approach of Lee et al. [22] that computes the overlap between signature patterns, which is an approximation with no clear bound on the error, our method computes the exact size of the overlap between projections.
3.2 Sampling from the projections
To calculate the number of pairs that are similar, we need the selfjoin sizes at levels to of the lattice. Each level of the lattice emits a stream that includes all record projections at that level. As discussed earlier in Section 3, attaching a projection ordering to each row in this stream allows different projections at the same level all be collapsed into a single stream without introducing an estimation error, hence cutting the number of size estimations. However, as shown in Table II for our running example, each row in our initial stream is listed under multiple projections and all those projections contribute to our size estimation. Our objective in this section is to cut the size through sampling.
Level 3  Level 2  



Let be our sampling ratio, meaning each row of an emitted stream is selected uniformly into the sample with probability . This is sampling without replacement and is done by uniformly selecting at random projections of each record at level . The sampling here is in the form of inclusionexclusion (unlike the one discussed in Section 2.1) and one does not need to store the sample to estimate the selfjoin size. For the same reason, the sample size can grow linearly with the input to avoid the estimation problem discussed in Section 2.1.
Given a sample as discussed above, let random variables
and be estimates of for the population and the sample respectively. Also let the random variable be an estimate of for the sample. The relationship between the expected number of similar pairs in the population, , and the selfjoin size of a sample from a value stream, , can be expressed as follows.Lemma 4.
For a set of records, each of arity , and sampling ratio ,
(4) 
Proof.
Let us initially assume that there are no similar pairs. Given that each record is included with probability , giving us a sample of size , the relationship between and can be expressed as
For a pair of similar records, the probability that they both make to the sample (and are counted in ) is ; hence . Replacing this in the equation above, we get and this can be rewritten as
(5) 
We can now relax our assumption and show by induction on that the statement of the lemma holds. The basis holds for . Now suppose the lemma holds for , meaning we can drive the values of using Eq. 4. Then the contributions of similar pairs toward can be computed as (see the discussion in the proof of Lemma 3). Subtracting this quantity from our earlier estimate in Eq. 5 will give the final result, and this completes our proof.∎
3.3 The algorithm
Our onepass Similarity SelfJoin Pair Count (SJPC) method is depicted in Algorithm 1. The algorithm can be broken down into three main steps.
Step 1: Generate projections and construct sketches (lines 120). For each record and each , the algorithm selects different attributes uniformly at random, and projects the record under these attributes (with duplicates kept); the projected attribute values are encoded into a string along with the text of the attribute combination. We call this record a subvalue, and the set of all subvalues at level a subvalue stream. For example, if the selected attributes for a row are A, B and C and their respective values are , and , then the generated subvalue will be . With this coding, all subvalues can be placed on the same stream and no two subvalues from different projections can join. This would reduce the number of selfjoin size estimations at level of the lattice from to one. The process is repeated times ().
This step will produce subvalue streams, one for each , and the number of subvalues in each subvalue stream is controlled by the sampling ratio . Subvalues may be fingerprinted into more concise fixed length strings [26], and a sketch may be constructed for each subvalue stream instead of directly storing it. There are several sketching algorithms that estimate the self join size in one pass [19, 21]. We use FastAGMS [21], which maintains counters (sketch width) and map elements in the stream into one of those counters. Two universal hash functions and are used where maps each element into either or and maps it into , both uniformly at random. For each incoming element , the sketch is updated by adding to the counter at index . Once the stream is processed, the selfjoin size is estimated by adding up the squares of all counter values. In our case, sketches are needed to estimate the selfjoin sizes for that many subvalue streams. To provide a better error bound, the process is often repeated times (sketch depth) and the median of those estimates are chosen. The sketch requires counters to implement, and we are constructing such sketches for our estimation.
Step 2: Find the selfjoin sizes (lines 2123). The algorithm, finds the selfjoin size of each subvalue stream, using standard selfjoin size estimation methods [21].
Step 3: Estimate the similarity selfjoin size (lines 2428). With the selfjoin sizes computed for in the previous step, the similarity selfjoin size can be computed using Equation 4.
As an example, let , and and suppose , and are computed; we can compute the similarity selfjoin size by adding up , and , where the latter can be obtained by solving the following equation system:
(6) 
Step 1 can be done while the input is being read, and Step 3 simply takes selfjoin size estimates and computes using Eq. 4, which is straightforward. This leaves us with the selfjoin size estimation in Step 2 for which onepass algorithms are available.
4 Analysis
There are two sources of randomness in the proposed algorithm: (1) randomness due to the sampling in Step 1, and (2) randomness due to the selfjoin size estimation in Step 2. To get a better insight into the algorithm and its steps, we analyze it both without and with the randomness in Step 2. We refer to the case where an exact selfjoin size is computed in Step 2 as the offline case, and the case where this is estimated using a sketch as the online case.
Theorem 1.
(Unbiased estimate and variance  offline case) The SJPC algorithm gives an unbiased estimate of the
similarity selfjoin size under the offline scenario, i.e. , and is at mostwhere is the estimate and is the true value.
Proof.
See Appendix A. ∎
Remarks. Since the estimate is unbiased, can be considered as a measure of relative error in practice. There are a few observations that can be made. First, this is an upper bound of the error and the actual error is expected to be less. More specifically, when , the estimate has no error (see Lemma 3) whereas the bound can still be large depending on and . Second, the variance increases significantly as the gap between and widens. However, in many practical settings such as duplicate detection, often higher similarities (80% or higher) are sought; in those cases, the error is expected to be low, as shown in Figure 2 (left) as well. Third, when other parameters are fixed, the expected relative error decreases when the true similarity selfjoin size increases. Assuming increases quadratically with (which is the case in our real dataset), the relative error goes down linearly with . The results shown in the experiment section confirm this observation.
When the dimensionality increases, the expected error increases and time and space costs are also affected. Since the algorithm generates subvalues for each record, if the sampling ratio is chosen such that for some constant , then both the space and time costs in an offline case is for processing all records. Also for large , it may be possible to select a subset of the columns and gauge the similarity based on the subset.
Next we report the performance of our algorithm under an online scenario where the selfjoin size in Step 2 is estimated using a sketch.
Theorem 2.
(Unbiased estimate and variance) The SJPC algorithm gives an unbiased estimate of the similarity selfjoin size in an online scenario, i.e. , and the variance of is at most
where is the FastAGMS sketch width (depth is ), is the number of attributes, is the given similarity threshold, is the sampling ratio, is the true value of the similarity selfjoin size, and is the estimated value.
Proof.
See Appendix A. ∎
Remarks. A few observations can be made here. First, as in the offline case, this is an upper bound of the error. In particular, when the sampling ratio is close to , the offline estimates are expected to be accurate and the only source of error is from sketches. Second, the variance gets a hit as the gap between and widens or becomes large. Again, this is not an issue in many practical settings where a high similarity (80% or higher) is desired. Third, to bound the variance, the space usage (denoted by ) does not have to increase when increases as long as increases proportionally, which is usually an expected case. Finally, the statement provides a formulation of the interaction between sketching and sampling, and how the variance changes with (see the right two columns of Figure 2). A similar (but more extensive) study on the interaction between sketching and sampling in a different context is conducted by Rusu and Dobra [27], where they reach the same conclusion that sketching over samples is a viable option, reducing the processing time with not much loss in accuracy.
Theorem 3.
(Space and time cost to bound the selectivity estimation error) The SJPC algorithm guarantees that the estimated selectivity of the similarity selfjoin deviates from the true value by at most with probability at least . More precisely, , where is the estimated selectivity and is the true value. The space cost is , and the time cost for processing each record is .
Proof.
See Appendix A. ∎
Remarks. Note that appears neither in the time nor in the space complexity. Although this statement discusses the error of selectivity estimation, which is a relative error based on , by slightly changing the proof, it is not hard to see the statement also holds for relative errors defined based on . The algorithm constructs sketches each of size , giving a space cost of , meaning that using constant time per record and constant number of counters, the algorithm can give accurate estimates of similarity selfjoin size with high probability. It should be noted though that each counter needs bits to implement, where is the maximum frequency of a subvalue. In the extreme case where records all have the same value, would be .
The statement also shows that both time and space costs will increase when increases or the gap between and widens. There is a clear tradeoff between time and space controlled by (implicitly by ). If is large, time cost will be smaller while space cost will be larger. Compared with the offline case, the online case requires much less space and returns a final estimate much faster after scanning the dataset once. Our experiments show that the overall error in the online case is still negligible and much less than the competitors under typical settings of and .
5 Asymptotic Time Compared to Random Sampling
In terms of a time comparison, there are two main stages in both SJPC and random sampling: data summarization and size calculation. At the data summarization stage, randomsampling takes time per record, whereas SJPC has to construct subvalues for each record, and for each subvalue counters will be updated, where is the sampling ratio and is the sketch depth. Thus SJPC will take time per record. Random sampling is clearly faster at the data summarization stage.
At the size calculation stage, having the data summary, SJPC has to compute the mean or median of counters and plug them in Equation 4 to find the estimates. Hence it will take time, which basically is the time for scanning the data summary once, while random sampling takes time for a sample of size R, quadratic in the sample size. Thus, online SJPC is faster at the size calculation stage.
The total time for SJPC is , which can be written as , with the sampling ratio and the sketch depth treated as constants. Also since , we can use as an upper bound of the running time.
The cost of random sampling depends on the sample size, which is a function of . We know from Lemma 1 that the size of the sample must be larger than to obtain an estimate that has an error less than 100%. Let the sample size be for . Random sampling needs time to obtain an estimate. Figure 3 shows how the two methods cope as the dataset size and the dimensionality increase. As expected, random sampling suffers when increases whereas SJPC suffers when increases or widens. On the other hand, SJPC scales linearly with .
6 Similarity Join Size Estimation
An important problem related to similarity selfjoin size estimation is estimating similarity join sizes. First we show an estimate that does not hold for similarity join sizes. A simple estimate for join size is based on the selfjoin sizes. Alon et al. [20] show that for two relations and
where and are selfjoin sizes of and respectively on the joining attributes. This does not hold for similarity join though. Here is a counter example. Let consists of the row and consists of and . With the similarity threshold set to , rows in and join, and the join size is ; but the similarity selfjoin size of is and that of is and the bound on the join size does not hold. The same can be shown for larger thresholds. For example, with a similarity threshold set to , let be the same as above and be the set of three rows , , and . Again the bound does not hold.
A wellknown fact in both join and selfjoin size estimation is that an estimation is generally ineffective when the size to be estimated is small compared to the sizes of the relations being joined, and a sanity check may be performed to avoid such cases [20]. Consider the problem of similarity join size estimation between two relations and in the presence of one such sanity check. An estimation algorithm may look like this: (1) project the records of and independently into subvalue streams (as discussed in Sec. 3), (2) construct a sketch for each subvalue stream for a total of sketches, (3) estimate the join sizes between subvalue streams of and at each of the levels , and (4) estimate the join size based on the join sizes of the subvalue streams. As discussed for selfjoin sizes, the computation in Step 4 is exact meaning given exact join sizes of the subvalue streams, no error can be introduced in Step 4. The only source of error here is (a) error due to sampling from projections in Steps 12. and (b) the sketch error in estimating join sizes between subvalue streams.
The join size estimation in Step 3 uses the product of the sketches for subvalue streams; in particular, given (AGMS and FastAGMS) sketches and of two relations and respectively, an estimator for is . It is easy to show that this estimate is unbiased since the expected contribution of nonmatching values to the product is zero when the sketch mapping functions are 2wise (in case of AGMS) or 4wise (in case of FastAGMS) independent. Alon et al. [20] show that this estimate has a variance which does not exceed two times the product of selfjoin sizes of and .
Let random variables and denote respectively the similarity join size and the join size both at level . The relationship between the two variables can be written as
(7) 
where is the sampling rate, set to the same value for both streams. Note that in case of a similarity join size estimation, there is no selfpair (where a record joins itself) and this gives rise to the slight difference between this estimate and that in Equation 4.
The estimate is unbiased since is unbiased and has the expectation
(8) 
and the variance
which can be bounded as
(9) 
7 Experiments
To verify our analytical findings in more practical settings, and to assess both the robustness and the performance of the SJPC algorithm, we conducted a set of experiments on both real and synthetic data under different settings including different similarity thresholds, dataset sizes, and dimensionalities. When applicable, the performance of our method is compared to that of the competitors including the LSHbased bucketing and random sampling (see Sec. 2 for details of these algorithms).
7.1 Experimental Setup
The following three datasets were used in our evaluation (see also Section 7.5 for larger datasets and experiments).
DBLP5. This was a set of records selected from DBLP ^{2}^{2}2http://dblp.unitrier.de/xml. The selection criteria was that a record was selected if it had nonempty values in (all of) the following 5 fields: title, author, journal, volume and year. In the first 20,000 records that were qualified, there were 19884 unique titles, 15917 unique authors, 29 unique journals, 125 unique volumes and 49 unique years.
DBLP6. This was similar to DBLP5 except every record here had nonempty values in the following 6 fields: title, author, journal, month, year and volume. The dataset had 2468 records. There were 2456 unique titles, 1601 unique authors, 9 unique journals, 150 unique volumes, 41 unique years and 26 unique months.
DBLPtitles. This was a set of paper titles from DBLP with each title fingerprinted into 6 supershingles where each supershingle was a 64 bit fingerprint. This resembled the experimental setting of Henzinger [28] and Broder et al. [29], where their goal was to find nearduplicate Web pages. This resulted in 467,468 records, each with 6 attributes. The number of unique values in each column ranged from 27000 to 30000.
DBLP5  DBLP6  DBLPtitles  
sn  20,000  2468  200,000 
6  na  0  19,356 
5  70  26  210,666 
4  761  7,984  1,900,702 
3  1,827,680  29,405  16607104 
2  2,112,300  184,287  103,992,978 
1  39,556,445  1,655,537  521,423,328 
Table III
gives more stats on these datasets including similarity join sizes for different similarity thresholds. Unless stated otherwise, all experiments are repeated 30 times and either the mean, the standard deviation or both of the relative error is reported.
7.2 Offline scenario
In the first set of our experiments, we wanted (1) to evaluate our method without introducing any error due to the sketching and (2) to characterize its performance compared to other baselines. This was possible in the offline scenario (as discussed in Section 4), under which the baseline algorithms introduced in Section 2 could as well be applied. Next we compare the performance of our method to some of these nonstreaming solutions under the same or similar space requirements.
A note on the signature pattern counting of Lee et al. [22]. We think there is a mistake in the formulation presented in the paper. In particular, with the formulation of in the authors’ Equation 4, the estimates of similarity selfjoin size can be negative. This is what we observed in our experiments of running this algorithm on DBLP5 and DBLP6. Also the estimates were sometimes off by a factor of 2 or larger. We carefully verified our implementation and it was indeed consistent with the paper. We also noticed that Equation 4 applied to the authors’ own example of LC(2) on Page 6 would give instead of the reported result . After communicating this with the authors and given the fact that the same authors show LSHSS outperforms the signature pattern counting, we decided not to report our results for the latter.
Relative error on DBLP5, DBLP6 and DBLPtitles. In this experiment, we compare the performance of our method to LSHbased bucketing of Lee et al. [18]; the selected algorithm for LSHbased bucketing is referred to as LSHSS by the authors, which is shown to perform the best in their experiments. The sampling ratio for SJPC was set to 0.5 and and for LSHSS was set to , the size of the dataset, as suggested by the authors.
Figures 6 shows both the mean and the standard deviation (std) of the relative error over 30 runs on DBLP6. In terms of both the mean and the standard deviation of the error, SJPC outperforms LSHSS and has a standard deviation of the error which is sometimes an order of magnitude smaller than that of LSHSS. The dataset had no 6similar pairs, and both algorithms detected that correctly. Similar results are observed on DBLP titles as shown in Figures 6. In another experiment, we evaluated LSHSS under two different sampling strategies. In the first strategy, referred to LSHSSv1, the sampling ratio was set as suggested by the authors, i.e. , and the sample size grew linearly with . In our second strategy, we set the sampling ratio to a constant (set to in our experiments), meaning each pair was sampled with a fixed probability and the sample size grew linearly with the number of pairs. The results on DBLP5 are shown in Figure 6.
Materializing subvalue streams. Limiting our algorithm into the offline case (i.e. without the space and timecost optimization due to the sketching) allowed us to compare its performance to multipass algorithms that assume the dataset and/or the intermediate data structures can be materialized. This is not usually feasible in a streaming environment and is not the right setting for our algorithm. That said, the offline case can be executed if the intermediate subvalue streams can be materialized. This is what we did in an implementation of both SJPCoffline and LSHSS, where the memory usage for each method was tracked at various points during the execution (e.g. when a variable is defined or loaded) by calling a task manager function, before and after, and the largest difference for each method was reported. We verified the accuracy of this method by loading datasets of known sizes and comparing the space usage reported using this method with the actual size, and the method was accurate in the range of Kilobytes especially if the experiment was repeated. As shown in Figure 7, the space needed for materializing subvalue streams, to our surprise, was not much more than that of LSHSS especially for large similarity thresholds (which is usually the case in similarity estimations), and this makes SJPCoffline a viable option due to its better error bounds.
7.3 Online scenario
In an online scenario where only one pass can be made over the data, random sampling is the only competitor. In this section, we compare the accuracy of SJPC to random sampling.
Comparison to random sampling. Similar to the offline scenario, we set the sampling ratio to , and ran our online SJPC on the first rows of DBLPtitles. The sketch width (number of counters) was set to , and the sketch depth was set to . SJPC needs one sketch for every subvalue stream, and the number of subvalue streams is where is the data dimensionally and is the minimum similarity threshold that is desired. One can cover all useful similarity ranges (e.g. ) by creating sketches on this particular dataset; this translates to 12,000 counters, each implemented as a 32bit integer, giving a total space of 48,000 bytes. The same amount of space was allocated to random sampling. Every record of DBLPtitles had 6 fields, and each field was a 64bit fingerprint, adding up to bytes per record. That meant, random sampling were given space for records.
Both random sampling and SJPC give unbiased estimates, hence we compare their standard deviations of the estimates. As shown in Figure 8, SJPC outperforms random sampling by a large margin. The standard deviation of the estimates for random sampling is almost an order of magnitude higher. Also it should be noted that this was under the setting that the sketches are maintained for all similarity thresholds . For example, with larger values of , the space usage of SJPC is reduced (in terms of the number of counters to the number of records) while the accuracy remains the same; this cannot be done in random sampling without affecting its accuracy.
Records in both DBLP5 and DBLP6 were longer and a bigger difference in performance between SJPC and Random sampling was expected. In particular, the same setting of our sketching could be used on DBLP5 and DBLP6 since a 32bit counter was enough to keep the counts. However, random sampling suffered because of the space limitations. The average length of a record in DBLP5 was 167 characters and in DBLP6 was 121 characters. Under ASCII encoding where each character takes one byte, random sampling would have enough space to store 287 records of DBLP5 and 397 records of DBLP6. For the same reason, the results are not reported on these two datasets.
7.4 Varying the parameters of SJPC
In this section, we evaluate the performance of the SJPC algorithm under different parameter settings.
Varying the sampling ratio. In all our previous experiments, the sampling ratio was set to meaning only half of the subvalues were sampled. The sampling ratio only affects the perrecord processing time and not the space usage, hence if the processing time is not a constraint, the sampling ratio should be to obtain a better estimate. To study the relationship between the sampling ratio and the accuracy, we varied the sampling ratio from to while keeping everything else the same as before, i.e. 200K records of DBLPtitles with the sketch width and depth set at 1000 and 3 respectively. Figure 9 (left) shows the effect of the sampling ratio on the standard deviation of the error. The error consistently drops as the sampling rate increases with an exception at the similarity threshold 1 where a sampling ratio of 0.5 performs slightly better than the next sampling ratio. We don’t have a good explanation here other than confirming that this is due to the interaction between sketching and sampling and that the sketch in this particular case performed better on the sample. A similar behaviour is observed by Rusu and Dorba [27] in some of their experiments on constructing sketches over samples. The mean error also drops (not shown here) but the drop is not as significant as the drop in the standard deviation. An observation that can be made is that the sampling ratio can vary between subvalue streams, for example, to reduce the error at certain values of . It is easy to incorporate this in our formulation in Equation 4.
Varying the dimensionality. In another experiment, we used the first 100K records of DBLPtitles but varied the number of columns by generating different number of supershingles for each text. Other parameters were kept the same (i.e. sampling ratio of and sketch width at and sketch depth at ). As shown in Figure 9 (middle), the standard deviation of the error takes a hit when the dimensionality increases, which is consistent with our analytical prediction. This is under the condition that all other parameters are kept the same. Clearly one can reduce the error by increasing the sketch width/depth and/or the sampling ratio, as shown in our previous experiments.
Varying dataset size and the number of duplicates. In this set of experiments, we started with 400K records of DBLPtitles and duplicated each record times with taking the values , , and . This gave us datasets of sizes 400K, 800K, 1600K and 3200K. This particular setting allowed us to easily compute the true sizes without doing a join on larger files. We also included the first 200K records of DBLPtitles to see if the trend is the same when records are not duplicates. As before, the sampling ratio of our method was set to , and sketch width and depth to and respectively. Figure 9 (right) shows that SJPC does not suffer when the dataset size increases while keeping the space usage and the sampling ratio the same; in fact for some similarity thresholds (e.g. 3, 4 and 5), the error drops as the dataset size increases. Compare this to random sampling where the sample size should increase at least as a square root function of the input size to maintain the same error rate. In contrast, having a larger number of records can even be helpful for the sampling part of our algorithm, as hinted in our analytical results and also revealed in Figure 9.
7.5 Running time
We conducted experiments to evaluate the running time of SJPC and its scalability with the dataset size, compared to random sampling. The evaluation was conducted on larger datasets (more precisely, orders of magnitude larger than those used in earlier sections), as discussed next. One dataset was real, and three dataset were generated synthetically with varying degrees of skew to show how the skew may affect the scalability. Each records in both real and synthetic data consisted of 5 columns ^{3}^{3}3A synthetic data record with 5 columns may represent, for example, papers with fields such as first author, second author (if any), title, year, and venue; a 4similar pair in this case can be two copies of the same paper with one mismatched field. .
Nearuniform 4060. This is a set of randomly generated 5fields records with each field formed by concatenating two long integers (making a 64 bits field). 40% of the records are unique, and each of the remaining 60% have one similar pair.
Skewed 2080. This is a set of randomly generated records with each field formed by concatenating two long integers. 20% of the records are unique and each of the remaining 80% have 15 similar pairs. If each set of similar records is treated as an entity, then 20% of the entities make up 80% of the records.
Skewed 1090. Similar to Skewed 2080, 10% of the entities (each described by a set of similar records) make up 90% of the records.
YFCC. This is a set of 21 million records from Flickr 100 million photo dataset [30]. Each record in our case includes the following 5 fields: userid, date taken, the capturing device, the latitude and the longitude.
The experiments were conducted on a machine with AMD quadcore 2.3 GHz CPU and 16GB ram running Ubuntu. Both algorithms SJPC and random sampling were implemented in C programming language and were compiled using gcc.
For SJPC, the space usage was fixed for all datasets, with the sketch depth and width respectively set to 1000 and 3 as before and the sampling ratio . For random sampling, we varied the sample size until random sampling could catch up SJPC in terms of the absolute value of the relative error. That did not happen until the sample size passed , and respectively for Nearuniform 4060, Skewed 2080 and Skewed 1090. At those sample sizes, SJPC was always faster in our experiments with any dataset larger than one million records we tried. Figure 10 (left) shows the mean running time of SJPC over 10 runs, on Skewed 2080 and YFCC, as the dataset size is varied. First, for each method, the running time on YFCC closely matched that of Skewed 2080, and as a result they are not distinguishable in the figure. Second, as expected, the running time of SJPC grows linearly with the input, whereas the running time for random sampling ^{4}^{4}4The sample size was set to to have an error not that far from that of SJPC. increases quadratically with the input. Each run of random sampling on 8 million records was taking more than 4 days and we could not run it for larger datasets. The absolute value of the relative error for both methods are also shown in Figure 10 (right). In terms of the space usage, random sampling requires at least an order of magnitude more space than SJPC, and the space usage of random sampling must increase with data skewness for its error rate to keep up with that of SJPC.
7.6 Discussions
The objective of our experimental evaluation was stated as verifying our analytical findings in more practical settings, and assessing both the robustness and the performance of the SJPC algorithm, as compared to competitors (when applicable). We evaluated our method on four real datasets, including DBLP5, DBLP6, DBLPtitles and YFCC and some synthetic data including Nearuniform and Skewed, showing that our algorithm outperforms the stateoftheart methods from the literature (i.e. LSHbased bucketing in the offline case and random sampling in the online case) in terms of the accuracy of the estimates, the space usage and running time. We experimented with different parameter settings of SJPC, showing that the algorithm is robust and the performance can be managed with different parameters. Our evaluation also confirmed that the SJPC algorithm scales linearly with the input, making it suitable in settings where only one pass over data is feasible. A limitation of the SJPC algorithm is that it does not scale so well to large data dimensionality and this is the price paid for the linear scaling with , for example compared to random sampling.
8 Related Work
Our work relates to the areas of efficient similarity join, selectivity estimation, and sketching techniques .
Efficient similarity join. The problem of similarity join (of tuples) under hamming distance can be mapped to a set similarity join where each tuple becomes a set, for which many algorithms have been developed (e.g., [31, 32]). A general and often efficient algorithm to evaluate set similarity join is index nested loop join, where the inner index returns a set of candidates and the outer loop filters those candidates before performing a pairwise comparison to produce the result. For example, all algorithms recently evaluated by Mann et al. [33] follow this framework and vary in their filtering and candidate generation steps. The time complexity of all these algorithms is quadratic in the input size. To reduce the cost, parallel set similarity is studied using MapReduce [34] and with data represented as arrays [35].
Similarity join is also studied in the context of dimensional points with . A common approach is to associate points to cubes or cells and only join points with overlapping cells [36, 37]. EGObased approaches use a combination of sort and divide operations to identify sets of points that cannot join [38]. These algorithms are quadratic for typical values of dimensions and similarity thresholds.
Selectivity estimation. Selectivity estimation has been an important component of query optimization, and accurate estimates often provide huge savings in cost. Although the problem is widely studied for relational operators with exact predicates (e.g., range predicates [39], substring queries [40], spatiotemporal queries [41], joins [42]) and despite its importance for similarity predicates (e.g. [43]), there has not been much study on estimating the selectivity of similarity predicates. Tata and Patel [44] study the problem in the context of Cosine predicates, discussing some of the difficulties. Hadjieleftheriou et al. [45] study the selectivity estimation for set similarity queries and show that more concise samples can be constructed from the inverted lists of tokens and also report on the performance of different sampling strategies. Lee et al. [22, 18] study the same problem as ours, and Heise et al. [46] use random sampling to estimate the sizes of clusters formed by fuzzy duplicates. We compare our work to both that of Lee et al. and random sampling, when applicable or appropriate.
Sketching techniques. As our work uses sketching to estimate the size of a subvalue stream, there are quite some works on sketching techniques that are applicable. For example, instead of FastAGMS [47, 21], which is used in our experiments, Bloom filters may be extended to answer frequency related queries including join and selfjoin size estimation [48, 49]. Rusu and Dobra [50] review and evaluate some of these sketches for join size estimation. The same authors also study the problem of sketching over samples and show that a speed up in factors of 10 is achievable without much decrease in accuracy [27]. Our sampling is slightly different in that we are sampling from the space of projections of each record.
Others. Deng et al. [51] study the problem of diversity analysis where similar randomized techniques are used to estimate an average pairwise similarity. String similarity join [52] may also be mapped to set similarity (for tokenbased) or hamming similarity (for characterbased), where join size estimation techniques will be useful. Our work may also be applicable in data cleaning and record deduplication settings (see [1] and Christen [53] for extensive surveys).
9 Conclusions and Future Directions
In this paper, we studied the problem of similarity selfjoin size estimation and presented a solution for efficiently finding an estimate within one pass over data. We analyzed the accuracy, time and space usage of our algorithm and experimentally evaluated it on both real and synthetic datasets. Our evaluation showed that the proposed algorithm has a relatively high accuracy (often an order of magnitude better than the competitors) and low time and space cost.
Our algorithm scales linearly with the input, and even larger input sizes can help with the accuracy, but it does not scale so well with the dimensionality, which is the price paid for the linear scale up with . Our method is readily applicable in cases where , the dimensionality of the data (or the number of columns), is low, or the similarity threshold is high so that does not explode to avoid
. On the other hand, when the input has a large number of columns, it is often the case that a subset of the columns are selected in queries or analyzed (this has been the premise in some of the work on projected clustering [7] and detecting unique column combinations [54]).More studies are needed to understand the behaviour of our algorithm, applied to high dimensional data, and the conditions under which more accurate estimates can be obtained. In particular, one area is studying some of the conditions under which our work can be extended to higher dimensions. For example, one may decompose a table into smaller attribute groupings, and compute the similarity selfjoin size under each grouping before merging the results. Finding decompositions under which the similarity selfjoin size can be accurately estimated from that of the decomposed table is an interesting future direction. Another area is studying the problem and the proposed solution under some simplifying assumptions (e.g. on the data distribution) that allows tighter bounds to be obtained and/or a better understanding of the problem is gained. One more interesting question is if (the data structure or the estimate of) a similarity join size estimation can be part of a similarity join algorithm, possibly to speed up the join.
Acknowledgments
This research is supported by the Natural Sciences and Engineering Research Council of Canada.
References
 [1] A. Elmagarmid, P. Ipeirotis, and V. Verykios, “Duplicate record detection: a survey,” IEEE TKDE, vol. 19, no. 1, pp. 1–16, 2007.
 [2] R. Miller, “Open data integration,” PVLDB, vol. 11, no. 2, pp. 2130–2139, 2018.
 [3] S. Chaudhuri, V. Ganti, and K. Shriraghav, “Systems and methods for estimating functional relationships in a database,” Jul. 14 2009, uS Patent 7,562,067.
 [4] D. Dewitt, J. Naughton, D. Schneider, and S. Seshadri, “Practical skew handling in parallel joins,” in Proc. of the VLDB Conf., 1992, p. 27.
 [5] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, and S. Manegold, “Columnstore support for RDF data management: not all swans are white,” PVLDB, pp. 1553–1563, 2008.
 [6] Y. Kwon, K. Ren, M. Balazinska, and B. Howe, “Managing skew in hadoop,” IEEE Data Eng Bulletin, vol. 36, no. 1, pp. 24–33, March 2013.
 [7] C. Aggarwal, J. Han, J. Wang, and P. Yu, “On high dimensional projected clustering of data streams,” Data Min. Knowl. Discov., vol. 10, no. 3, pp. 251–273, 2005.
 [8] J. Bonwick, “(oracle) zfs deduplication,” http://blogs.oracle.com/bonwick/zfs_dedupv2, 2009.
 [9] S. Jeffery, “Payasyougo data cleaning and integration,” Ph.D. dissertation, University of California, Berkeley, 2008.
 [10] I. Good, “Surprise indexes and pvalues,” Journal of Statistical Computation and Simula, vol. 32, pp. 90–92, 1989.
 [11] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann, “How good are query optimizers, really?” Proceedings of the VLDB Endowment, vol. 9, no. 3, pp. 204–215, 2015.
 [12] G. Morales and A. Gionis, “Streaming similarity selfjoin,” PVLDB, pp. 792–803, 2016.
 [13] G. Manku, A. Jain, and A. Sarma, “Detecting nearduplicates for web crawling,” in Proc. of the WWW Conf., 2007, pp. 141–150.
 [14] K. Williams and C. Giles, “Near duplicate detection in an academic digital library,” in Proc. of the Document Eng. Conf., 2013, pp. 91–94.

[15]
Y. Ke, R. Sukthankar, and L. Huston, “An efficient partbased nearduplicate and subimage retrieval system,” in
Proc. of the ACM Multimedia Conf., 2004, pp. 869–876.  [16] A. Broder, “On the resemblance and containment of documents,” in Proc. of the Compression and Complexity of Sequences Conf., 1997.
 [17] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and efficient fuzzy match for online data cleaning,” in Proc. of the SIGMOD Conf., 2003, pp. 313–324.
 [18] H. Lee, R. Ng, and S. Shim, “Similarity join size estimation using locality sensitive hashing,” in Proc. of the VLDB Conf., 2011, pp. 338–349.
 [19] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of approximating the frequency moments,” Journal of Computer and System Sciences, vol. 58, no. 1, pp. 137–147, 1999.
 [20] N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy, “Tracking join and selfjoin sizes in limited storage,” in Proc. of the PODS Conf., 1999, pp. 10–20.
 [21] G. Cormode and M. N. Garofalakis, “Sketching streams through the net: distributed approximate query tracking,” in Proc. of the VLDB Conf., 2005, pp. 13–24.
 [22] H. Lee, R. Ng, and S. Shim, “Powerlaw based estimation of set similarity join size,” in Proc. of the VLDB Conf., 2009, pp. 568–669.
 [23] A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases,” in Proc. of the VLDB Conf., 1995, pp. 432–444.
 [24] H. Toivonen, “Sampling large databases for association rules,” in Proc. of the VLDB Conf., 1996, pp. 134–145.
 [25] G. S. Manku and R. Motwani, “Approximate frequency counts over data streams,” in Proc. of the VLDB Conf., 2002, pp. 346–357.
 [26] A. Broder, “Some applications of rabin’s fingerprinting method,” in Sequences II: Methods in Communications, Security, and Computer Science, 1993, pp. 143–152.
 [27] F. Rusu and A. Dobra, “Sketching sampled data,” in Proc. of the ICDE Conf., 2009, pp. 381–391.
 [28] M. R. Henzinger, “Finding nearduplicate web pages: a largescale evaluation of algorithms.” in Proc. of the SIGIR Conf., 2006, pp. 284–291.
 [29] A. Broder, S. Glassman, M. Manasse, and G. Zweig, “Syntactic clustering of the web,” Computer Networks, vol. 29, no. 813, pp. 1157–1166, 1997.
 [30] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
 [31] R. Bayardo, Y. Ma, and R. Srikant, “Scaling up all pairs similarity search,” in Proc. of the WWW Conf., 2007, pp. 131–140.
 [32] C. Xiao, W. Wang, X. Lin, and J. Yu, “Efficient similarity joins for nearduplicate detection,” ACM Trans. Database Syst., vol. 36, no. 3, p. 15, 2011.
 [33] W. Mann, N. Augsten, and P. Bouros, “An empirical evaluation of set similarity join techniques,” Proc. of the VLDB Endowment, vol. 9, no. 9, pp. 636–647, 2016.
 [34] R. Vernica, M. J. Carey, and C. Li, “Efficient parallel setsimilarity joins using mapreduce,” in Proc. of the SIGMOD Conf., 2010, pp. 495–506.
 [35] W. Zhao, F. Rusu, B. Dong, and K. Wu, “Similarity join over array data,” in Proc. of the SIGMOD Conf., 2016, pp. 2007–2022.
 [36] N. Koudas and K. C. Sevcik, “High dimensional similarity joins: Algorithms and performance evaluation,” IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 1, pp. 3–18, 2000.
 [37] E. H. Jacox and H. Samet, “Spatial join techniques,” ACM Transactions on Database Systems (TODS), vol. 32, no. 1, p. 7, 2007.
 [38] D. V. Kalashnikov, “Superego: fast multidimensional similarity join,” The VLDB Journal, vol. 22, no. 4, pp. 561–585, 2013.
 [39] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita, “Improved histograms for selectivity estimation of range predicates,” in Proc. of the SIGMOD Conf., 1996, pp. 294–305.
 [40] Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan, “Generalized substring selectivity estimation.” Journal of Computer and System Sciences, vol. 66, no. 1, pp. 98–132, 2003.
 [41] Y. Choi and C. Chung, “Selectivity estimation for spatiotemporal queries to moving objects,” in Proc. of the SIGMOD Conf., 2002, pp. 440–451.
 [42] L. Getoor, B. Taskar, and D. Koller, “Selectivity estimation using probabilistic models.” in Proc. of the SIGMOD Conf., 2001, pp. 461–472.
 [43] Y. Silva, W.G.Aref, and M. Ali, “Similarity groupby,” in Proc. of the ICDE Conf., 2009, pp. 904–915.

[44]
S. Tata and J. Patel, “Estimating the selectivity of tfidf
based cosine similarity predicates,”
SIGMOD Record, vol. 36, no. 4, pp. 75–80, 2007.  [45] M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava, “Hashed samples: selectivity estimators for set similarity selection queries,” in Proc. of the VLDB Conf., 2008, pp. 201–212.
 [46] A. Heise, G. Kasneci, and F. Naumann, “Estimating the number and sizes of fuzzy duplicate clusters,” in Proc. of the CIKM Conf., 2014, pp. 959–968.
 [47] M. Charikar, K. Chen, and M. FarachColton, “Finding frequent items in data streams,” Theoretical Computer Science, vol. 312, no. 1, 2004.
 [48] G. Cormode and S. Muthukrishnan, “An improved data stream summary: the countmin sketch and its applications,” Journal of Algorithms, vol. 55, no. 1, pp. 58–75, 2005.
 [49] F. Deng and D. Rafiei, “New estimation algorithms for streaming data: Countmin can do more,” http://www.cs.ualberta.ca/~drafiei/papers/cmm.pdf, 2007.
 [50] F. Rusu and A. Dobra, “Sketches for size of join estimation,” ACM Trans. Database Syst., vol. 33, no. 3, 2008.
 [51] F. Deng, S. Siersdorfer, and S. Zerr, “Efficient jaccardbased diversity analysis of large document collections.” in Proc. of the CIKM Conf., 2011, pp. 1402–1411.
 [52] Y. Jiang, G. Li, J. Feng, and W.S. Li, “String similarity joins: An experimental evaluation,” Proc. of the VLDB Endowment, vol. 7, no. 8, pp. 625–636, 2014.
 [53] P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE transactions on knowledge and data engineering, vol. 24, no. 9, pp. 1537–1555, 2012.
 [54] Z. Abedjan, J. QuianéRuiz, and F. Naumann, “Detecting unique column combinations on dynamic data,” in Proc. of the ICDE Conf., 2014, pp. 1036–1047.
 [55] E. Weisstein, “Binomial theorem (from mathworld–a wolfram web resource),” http://mathworld.wolfram.com/BinomialTheorem.html, 2015.
 [56] N. Weiss, A Course in Probability. Addison Wesley, 2005.
Appendix A Proofs
Lemma 1 Randomsampling requires a sample of size to give an estimate of the similarity selfjoin size with a relative error less than with high probability.
Proof.
This is an adaptation of the proof of Lemma 2.3 in [20]. Let be an arbitrary similarity threshold. Construct two datasets and , each with records such that no record in is similar to any other record for all values of , but has similar pairs of records and there is no other form of similarity between the records. A samplingbased estimate of the similarity selfjoin size for will be , and that for will be also using samples of size with high probability. This is simply because the chance that a similar pair (not including selfpairs) makes to the sample is and the expected number of such pairs in a sample of size is , which is close to zero for large . However, the similarity selfjoin sizes for and are and respectively, and the estimate is off by a factor of with high probability. As another instance, suppose has records that are identical on columns and is as defined before. The ssimilarity selfjoin sizes of is and that of is . However, the chance that one of those ssimilar pairs is included in a sample of size is , and the chance that of them are included in a sample of size is
This probability is very close to zero for large values of or . That means random sampling will report with a high probability an ssimilarity selfjoin size of for both and . ∎
Theorem 1. The SJPC algorithm gives an unbiased estimate of the similarity selfjoin size under the offline scenario, i.e. , and the standard deviation of is at most
where is the estimate and is the true value.
Proof.
To find the variance of , we need the variance of (). Eq. 4 gives a recursive expression of , as a function of and . First, we show that can be represented as a function of with the recursion removed. Second, we prove the unbiased property of . Last, we derive an upper bound of the variance of , and this allows us to bound the variance of . The details are as follows.
First, we prove by induction that
(10) 
where , and is a constant hence not important in the expression of the variance. From Eq. 4, we can easily verify that Eq. 10 holds for and . Assuming Eq. 10 holds for an arbitrary , we want to prove that it holds for as well. From Eq. 4 we have
Using the induction hypothesis to replace , we have
If we change the indexes to the filled part in Figure 11(a) and denote the constants with , the right side becomes
It is easy to verify that .
Also
for (see Lemma 5 in the Appendix), hence
Eq. 10 holds for , thus it holds for all . Now the similarity selfjoin size, , can be rewritten as follows (with the replacement of indexes to the filled part of Figure 11(b) in the last step):
(11) 
Second, we show that SelfJoinPairCount gives an unbiased estimate. Let be the set of all similar record pairs excluding self pairs (i.e. when a record joins itself), and be the value that a similar record pair, denoted by , contributes to (the selfjoin size of the subvalue stream); then we have
(12) 
Note that , and are all random variables. The expected value of in the sample is
Therefore
(13) 
is the expected value that a similar pair contributes to and is the true number of similar record pairs. From Eq. 4 we can see that SelfJoinPairCount removes the contributions of {}similar pairs from , thus it is not hard to verify that is an unbiased estimate for .
Last, we derive an upper bound of the variance of . Let denote the expected number of times a record will appear in a sample of level , i.e. , and be the probability that a similar record pair contributes to , then the variance of is
The variance of can be written as
For two pairs and , they may or may not have a row in common. When the two pairs have no row in common, the covariance term will be zero. Now suppose there is a row that is common between the two pairs. Let’s denote the pairs as and . Since the projections of and are independently chosen uniformly at random, the covariance term is zero even though the projections of is the same for both pairs. Hence the covariance term can be ignored, and we have
Using Equation 11, we can bound from above to
Comments
There are no comments yet.