GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

09/03/2018 ∙ by Yang Yang, et al. ∙ University of Technology Sydney UNSW 0

In this paper, we study the problem of approximate containment similarity search. Given two records Q and X, the containment similarity between Q and X with respect to Q is |Q intersect X|/ |Q|. Given a query record Q and a set of records S, the containment similarity search finds a set of records from S whose containment similarity regarding Q are not less than the given threshold. This problem has many important applications in commercial and scientific fields such as record matching and domain search. Existing solution relies on the asymmetric LSH method by transforming the containment similarity to well-studied Jaccard similarity. In this paper, we use a different framework by transforming the containment similarity to set intersection. We propose a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a good trade-off between the sketch size and the accuracy. We provide a set of theoretical analysis to underpin the proposed augmented KMV sketch technique, and show that it outperforms the state-of-the-art technique LSH-E in terms of estimation accuracy under practical assumption. Our comprehensive experiments on real-life datasets verify that GB-KMV is superior to LSH-E in terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch construction time. For instance, with similar estimation accuracy (F-1 score), GB-KMV is over 100 times faster than LSH-E on some real-life dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In many applications such as information retrieval, data cleaning, machine learning and user recommendation, an object (e.g., document, image, web page and user) is described by a set of elements (e.g., words,

-grams, and items). One of the most critical components in these applications is to define the set similarity between two objects and develop corresponding similarity query processing techniques. Given two records (objects) and , a variety of similarity functions/metrics have been identified in the literature for different scenarios (e.g., [28][15]). Many indexing techniques have been developed to support efficient exact and approximate lookups and joins based on these similarity functions.

Many of the set similarity functions studied are symmetric functions, i.e.,

, including widely used Jaccard similarity and Cosine similarity. In recent years, much research attention has been given to the asymmetric set similarity functions, which are more appropriate in some applications. Containment similarity (a.k.a, Jaccard containment similarity) is one of the representative asymmetric set similarity functions, where the similarity between two records

and is defined as in which and are intersection size of and and the size of , respectively.

Compared with symmetric similarity such as Jaccard similarity, containment similarity gives special consideration on the query size, which makes it more suitable in some applications. As shown in [35], containment similarity is useful in record matching application. Given two text descriptions of two restaurants and which are represented by two “set of words” records: five, guys, burgers, and, fries, downtown, brooklyn, new, york and five, kitchen, berkeley respectively. Suppose query is {five, guys}, we have that the Jaccard similarity of and (resp. ) is (). Note the Jaccard similarity is . Based on the Jaccard similarity, record matches better to query , but intuitively should be a better choice. This is because the Jaccard similarity unnessesarily favors the short records. On the other hand, the containment similarity will lead to the desired order with and . Containment similarity search can also support online error-tolerant search for matching user queries against addresses (map service) and products (product search). This is because the regular keyword search is usually based on the containment search, and containment similarity search provides a natural error-tolerant alternative [5]. In [44], Zhu et al. show that containment similarity search is essential in domain search which enables users to effectively search Open Data.

The containment similarity is also of interest to applications of computing the fraction of values of one column that are contained in another column. In a dataset, the discovery of all inclusion dependencies is a crucial part of data profiling efforts. It has many applications such as foreign-key detection and data integration(e.g.,  [22, 31, 8, 33, 30]).

Challenges. The problem of containment similarity search has been intensively studied in the literature in recent years (e.g., [5, 35, 44]). The key challenges of this problem come from the following three aspects: () The number of elements (i.e., vocabulary size) may be very large. For instance, the vocabulary will blow up quickly when the higher-order shingles are used [35]. Moreover, query and record may contain many elements. To deal with the sheer volume of the data, it is desirable to use sketch technique to provide effectively and efficiently approximate solutions. (

) The data distribution (e.g., record size and element frequency) in real-life application may be highly skewed. This may lead to poor performance in practice for

data independent sketch methods. () A subtle difficulty of the approximate solution comes from the asymmetric property of the containment similarity. It is shown in [34] that there cannot exist any locality sensitive hashing (LSH) function family for containment similarity search.

To handle the large scale data and provide quick response, most existing solutions for containment similarity search seek to the approximate solutions. Although the use of LSH is restricted, the novel asymmetric LSH method has been designed in [34]

to address the issue by padding techniques. Some enhancements of asymmetric

LSH techniques are proposed in the following works by introducing different functions (e.g., [35]). Observe that the performance of the existing solutions are sensitive to the skewness of the record size, Zhu et. al propose a partition-based method based on Minhash LSH function. By using optimal partition strategy based on the size distribution of the records, the new approach can achieve much better time-accuracy trade-off.

We notice that all existing approximate solutions rely on the LSH functions by transforming the containment similarity to well-studied Jaccard similarity. That is,

As the size of query is usually readily available, the estimation error come from the computation of Jaccard similarity and union size of and . Note that although the union size can be derived from jaccard similarity [44]

, the large variance caused by the combination of two estimations remains. This motivates we to use a different framework by transforming the containment similarity to

set intersection size estimation, and the error is only contributed by the estimation of . The well-known KMV sketch [11] has been widely used to estimate the set intersection size, which can be immediately applied to our problem. However, this method is data-independent and hence cannot well handle the skewed distributions of records size and element frequency, which is common in real-life applications. Intuitively, the record with larger size and the element with high-frequency should be allocated more resources. In this paper, we theoretically show that the existing KMV

-sketch technique cannot consider these two perspectives by simple heuristics, e.g., explictly allocating more resource to record with large size. Consequently, we develop an augmented

KMV sketch to exploit both record size distribution and the element frequency distribution for better space-accuracy and time-accuracy trade-offs. Two technique are proposed: () we impose a global threshold to KMV sketch, namely G-KMV sketch, to achieve better estimate accuracy. As disscussed in Section IV-A(2), this technique cannot be extended to the Minhash LSH. () we introduce an extra buffer for each record to take advantage of the skewness of the element frequency. A cost model is proposed to carefully choose the buffer size to optimize the accuracy for the given total space budget and data distribution.

Contributions. Our principle contributions are summarized as follows.

  • We propose a new augmented KMV sketch technique, namely GB-KMV, for the problem of approximate containment similarity search. By imposing a global threshold and an extra buffer for KMV sketches of the records, we significanlty enhance the performance as the new method can better exploit the data distributions.

  • We provide theoretical underpinnings to justify the design of GB-KMV method. We also theoretically show that GB-KMV outperforms the state-of-the-art technique LSH-E in terms of accuracy under realistic assumption on data distributions.

  • Our comprehensive experiments on real-life set-valued data from various applications demonstrate the effectiveness and efficiency of our proposed method.

Road Map. The rest of the paper is organized as follows. Section II presents the preliminaries. Section III introduces the existing solutions. Our approach, GB-KMV sketch, is devised in Section IV. Extensive experiments are reported in Section V, followed by the related work in Section VI. Section VII concludes the paper.

Ii Preliminaries

In this section, we first formally present the problem of containment similarity search, then introduce some preliminary knowledge. In Table I, we summarize the important mathematical notations appearing throughout this paper.

Notation Definition
a collection of records
record, query record
record size of , query size of
Jaccard similarity between query and set
Containment similarity of query in set
Jaccard similarity threshold
the KMV signature (i.e., hash values) of record X
all hash values of the elements in record X
the buffer of record X
containment similarity threshold
sketch space budget, measured by the number of
signatures (i.e., hash values or elements)
the global threshold for hash values
the buffer size(with bit unit) of GB-KMV sketch
number of records in dataset
number of distinct elements in dataset
TABLE I: The summary of notations

Ii-a Problem Definition

In this paper, the element universe is . Let be a collection of records (sets) ’ where ( ) is a set of elements from .

Before giving the definition of containment similarity, we first introduce the Jaccard similarity.

Definition 1 (Jaccard Similarity).

Given two records and from , the Jaccard similarity between and is defined as the size of the intersection divided by the size of the union, which is expressed as

(1)

Similar to the Jaccard similarity, the containment similarity (a.k.a Jaccard containment similarity) is defined as follows.

Definition 2 (Containment Similarity).

Given two records and from , the containment similarity of in , denoted by is the size of the intersection divided by record size , which is formally defined as

(2)

Note that by replacing the union size in Equation 1 with size , we get the containment similarity. It is easy to see that Jaccard similarity is symmetric while containment similarity is asymmetric.

In this paper, we focus on the problem of containment similarity search which is to look up a set of records whose containment similarity towards a given query record is not smaller than a given threshold. The formal definition is as follows.

Definition 3 (Containment Similarity Search).

Given a query , and a threshold on the containment similarity, search for records from a dataset such that:

(3)

Next, we give an example to show the problem of containment similarity search.

Example 1.

Fig. 1 shows a dataset with four records , , , , and the element universe is . Given a query and a containment similarity threshold , the records satisfying are , .

id record
Fig. 1: A four-record dataset and a query ; is the containment similarity of in

Problem Statement. In this paper, we investigate the problem of approximate containment similarity search. For the dataset with a large number of records, we aim to build a synopses of the dataset such that it () can efficiently support containment similarity search with high accuracy, () can handle large size records, and () has a compact index size.

Ii-B Minwise Hashing

Minwise Hashing is proposed by Broder in [13, 14] for estimating the Jaccard similarity of two records and . Let be a hash function that maps the elements of and to distinct integers, and define and to be the minimum hash value of a record and , respectively. Assuming no hash collision, Broder[13] showed that the Jaccard similarity of and

is the probability of two minimum hash values being equal:

. Applying such different independent hash functions to a record (, resp.), the MinHash signature of (, resp.) is to keep values of ( , resp.) for functions. Let be the indicator function such that

(4)

then the Jaccard similarity between record and can be estimated as

(5)

Let be the Jaccard similarity of set and , then the expectation of is

(6)

and the variance of is

(7)

Ii-C KMV Sketch

The minimum values(KMV) technique introduced by Bayer et. al in [11] is to estimate the number of distinct elements in a large dataset. Given a no-collision hash function which maps elements to range , a KMV synopses of a record , denoted by , is to keep minimum hash values of . Then the number of distinct elements can be estimated by where is -th smallest hash value. By , we denote hash values of all elements in the record .

In [11], Bayer et. al also methodically analyse the problem of distinct element estimation under multi-set operation. As for union operation, consider two records and with corresponding KMV synopses and of size and , respectively. In [11], represents the set consisting of the smallest hash values in where

(8)

Then the KMV synopses of is

. An unbiased estimator for the number of distinct elements in

, denoted by is as follows.

(9)

For intersection operation, the KMV synopses is where . Let , i.e., is the number of common distinct hash values of and within . Then the number of distinct elements in , denoted by , can be estimated as follows.

(10)

The variance of , as shown in[11], is

(11)

Iii Existing Solutions

In this section, we present the state-of-the-art technique for the approximate containment similarity search, followed by theoretical analysis on the limits of the existing solution.

Iii-a LSH Ensemble Method

LSH Ensemble technique, LSH-E for short, is proposed by Zhu et. al in [44] to tackle the problem of approximate containment similarity search. The key idea is : (1) transform the containment similarity search to the well-studied Jaccard similarity search; and (2) partition the data by length and then apply the LSH forest [9] technique for each individual partition.

Similarity Transformation. Given a record with size , a query with size , containment similarity and Jaccard similarity . The transformation back and forth are as follows.

(12)

Given the containment similarity search threshold as for the query , we may come up with its corresponding Jaccard similarity threshold by Equation 12. A straightforward solution is to apply the existing approximate Jaccard similarity search technique for each individual record with the Jaccard similarity threshold (e.g., compute Jaccard similarity between the query and a set based on their MinHash signatures). In order to take advantages of the efficient indexing techniques (e.g., LSH forest [9]), LSH-E will partition the dataset .

Data Partition. By partitioning the dataset according to the record size, LSH-E can replace in Equation 12 with its upper bound (i.e., the largest record size in the partition) as an approximation. That is, for the given containment similarity we have

(13)

The use of upper bound will lead to false positives. In [44]

, an optimal partition method is designed to minimize the total number of false positives brought by the use of upper bound in each partition. By assuming that the record size distribution follows the power-law distribution and similarity values are uniformly distributed, it is shown that the optimal partition can be achieved by ensuring each partition has the equal number of records (i.e., equal-depth partition).

Containment Similarity Search. For each partition of the data, LSH-E applies the dynamic LSH technique (e.g., LSH forest [9]). Particularly, the records in are indexed by a MinHash LSH with parameter (, ) where is the number of bands used by the LSH index and is the number of hash values in each band. For the given query , the and values are carefully chosen by considering their corresponding number of false positives and false negatives regarding the existing records. Then the candidate records in each partition can be retrieved from the MinHash index according to the corresponding Jaccard similarity thresholds obtained by Equation 13. The union of the candidate records from all partitions will be returned as the result of the containment similarity search.

Iii-B Analysis

One of the LSH-E’s advantages is that it converts the containment similarity problem to Jaccard similarity search problem which can be solved by the mature and efficient MinHash LSH method. Also, LSH-E carefully considers the record size distribution and partitions the records by record size. In this sense, we say LSH-E is a data-dependent method and it is reported that LSH-E significantly outperforms existing asymmetric LSH based solutions [34, 35] (i.e., data-independent methods) as LSH-E can exploit the information of data distribution by partitioning the dataset. However, this benefit is offset by the fact that the the upper bound will bring extra false positives, in addition to the error from the MinHash technique.

Below we theoretically analyse the performance of LSH-E by studying the expectation and variance of its estimator.

Using the notations same as above, let be the Jaccard similarity between query and set and be the containment similarity of in . By Equation 5, given the MinHash signature of query and respectively, an unbiased estimator of Jaccard similarity is the ratio of collisions in the signature, and the variance of is where is signature size of each record. Then by transformation Equation 12, the estimator of containment similarity by MinHash LSH is

(14)

where and . The estimator of containment similarity by LSH-E is

(15)

where and is the upper bound of .

Next, we use Taylor expansions to approximate the expectation and variance of a function with one random variable 

[26]. We first give a lemma.

Lemma 1.

Given a random variable with expectation and variance , the expectation of can be approximated as

(16)

and the variance of can be approximated as

(17)

According to Equation 14, let where . We can see that the estimator is a function of , and and . Then based on Lemma 1, the expectation and variance of are approximated as

(18)
(19)

Similarly, the expectation and variance of LSH-E estimator can be approximated as

(20)
(21)

The computation details are in technique report [41]. Since is the upper bound of , the variance of LSH-E estimator is larger than that of MinHash LSH estimator. Also, by Equation 18 and Equation 20, we can see that both estimators are biased and LSH-E method is quite sensitive to the setting of the upper bound by Equation 20. Because the presence of upper bound will enlarge the estimator off true value, LSH-E method favours recall while the precision will be deteriorated. The larger the upper bound is, the worse the precision will be. Our empirical study shows that LSH-E cannot achieve a good trade-off between accuracy and space, compared with our proposed method.

Iv Our Approach

In this section, we introduce an augmented KMV sketch technique to achieve better space-accuracy trade-off for approximate containment similarity search. Section IV-A briefly introduces the motivation and main technique of our method, namely GB-KMV. The detailed implementation is presented in Section IV-B, followed by extensive theoretical analysis in Section IV-C.

Iv-a Motivation and Techniques

The key idea of our method is to propose a data-dependent indexing technique such that we can exploit the distribution of the data (i.e., record size distribution and element frequency distribution) for better performance of containment similarity search. We augment the existing KMV technique by introducing a global threshold for sample size allocation and a buffer for frequent elements, namely GB-KMV, to achieve better trade-off between synopses size and accuracy. Then we apply the existing set similarity join/search indexing technique to speed up the containment similarity search.

Below we outline the motivation of the key techniques used in this paper. Detailed algorithms and theoretical analysis will be introduced in Section IV-B and IV-C, respectively.

(1) Directly Apply KMV Sketch

Given a query and a threshold on containment similarity, the goal is to find record from dataset such that

(22)

Applying some simple transformation to Equation 22, we get

(23)

Let , then the containment similarity search problem is converted into finding record whose intersection size with the query is not smaller than , i.e., .

Therefore, we can directly apply the KMV method introduced in Section II-C. Given KMV signatures of a record and a query , we can estimate their intersection size () according to Equation 10. Then the containment similarity of in is immediately available given the query size . Below, we show an example on how to apply KMV method to containment similarity search.

Fig. 2: The KMV sketch of the dataset in Example 1, each signature consists of element-hash value pairs. is the signature size of
Example 2.

Fig. 2 shows the KMV sketch on dataset in Example 1. Given KMV signature of () and (), we have , then the size- KMV synopses of is , the -th smallest hash value is 0.33 and the size of intersection of and within is . Then the intersection size of and is estimated as , and the containment similarity is . Then is returned if the given containment similarity threshold is .

Remark 1.

In [44], the size of the query is approximated by MinHash signature of , where KMV sketch can also serve for the same purpose. But the exact query size is used their implementation for performance evaluation. In practice, the query size is readily available, we assume query size is given throughout the paper.

Optimization of KMV Sketch. Given a space budget , we can keep size- KMV signatures (i.e., minimal hash values) for each record with . A natural question is how to allocate the resource (e.g., setting of values) to achieve the best overall estimation accuracy. Intuitively, more resources should be allocated to records with more frequent elements or larger record size, i.e., larger for record with larger size. However, Theorem 1 (Section IV-C2) suggests that, the optimal resource allocation strategy in terms of estimation variance is to use the same size of signature for each record. This is because the minimal of two k-values is used in Equation 8, and hence the best solution is to evenly allocate the resource. Thus, we have the KMV sketch based method for approximate containment similarity search. For the given budget , we keep minimal hash values for each record .

(2) Impose a Global Threshold to KMV Sketch (G-KMV)

The above analysis on optimal KMV sketch suggests an equal size allocation strategy, that is, each record is associated with the same size signature. Intuitively we should assign more resources (i.e., signature size) to the records with large size because they are more likely to appear in the results. However, the estimate accuracy of KMV for two sets size intersection is determined by the sketch with smaller size since we choose for KMV signatures of and for and in Equation 9, thus it is useless to give more resource to one of the records. We further explain the reason behind with the following example.

Before we introduce the global threshold to KMV sketch, consider the KMV sketch shown in the Fig. 2.

Example 3.

Suppose we have and . Although there are four hash values in , we can only consider smallest hash values of by Equation 8, which is , and the -th () minimum hash value used in Equation 9 is . We cannot use (i.e., =) to estimate because the -th smallest hash value in may not be the -th smallest hash values in , because the unseen -rd smallest hash value of might be for example, which is smaller than . Recall that denote the hash values of all elements in .

Nevertheless, if we know that all the hash values smaller than a global threshold, say , are kept for every record, we can safely use the -th hash value of (i.e., ) for the estimation. This is because we can ensure the -th smallest hash value in must be the -th smallest hash values in .

Inspired by the above observation, we can carefully choose a global threshold (e.g., in the above example) for a given space budget , and ensure all hash values smaller than will be kept for KMV sketch of the records. By imposing a global threshold, we can identify a better (i.e., larger) value used for estimation, compared with Equation 8.

Given a record and a global threshold , the sketch of a record is obtained as where is the hash function. The sketch of () is defined in the same way. In this paper, we say a KMV sketch is a G-KMV sketch if we impose a global threshold to generate KMV sketch. Then we set value of the KMV estimation as follows.

(24)

Meanwhile, we have . Let be the -th minimal hash value in , then the overlap size of and can be estimated as

(25)

Then the containment similarity of in is

(26)

where is the query size. We remark that, as a by-product, the global threshold favours the record with large size because all elements with hash value smaller than are kept for each record.

Fig. 3: The G-KMV sketch of the dataset in Example 1 with hash value threshold

Below is an example on how to compute the containment similarity based on G-KMV sketch.

Example 4.

Fig.  3 shows the KMV sketch of dataset in Example 1 with a global threshold . Given the signature of () and (), the KMV sketch of is , the -th() smallest hash value is , and the size of intersection of and within is . Then the intersection size of and is estimated as , and the containment similarity is . Then is returned if the given containment similarity threshold is .

Correctness of G-KMV sketch. Theorem 2 in Section IV-C3 shows the correctness of the G-KMV sketch.

Comparison with KMV. In Theorem 3 (Section IV-C4), we theoretically show that G-KMV can achieve better accuracy compared with KMV.

Remark 2.

Note that the global threshold technique cannot be applied to MinHash based techniques. In minHash LSH, the minimum hash values are corresponding to different independent hash functions, while in KMV sketch, the -value sketch is obtained under one hash function. Thus we can only impose this global threshold on the same hash function for the KMV sketch based method.

(3) Use Buffer for KMV Sketch (GB-KMV)

In addition to the skewness of the record size, it is also worthwhile to exploit the skewness of the element frequency. Intuitively, more resource should be assigned to high-frequency elements because they are more likely to appear in the records. However, due to the nature of the hash function used by KMV sketch, the hash value of an element is independent to its frequency; that is, all elements have the same opportunity contributing to the KMV sketch.

One possible solution is to divide the elements into multiple disjoint groups according to their frequency (e.g., low-frequency and high-frequency ones), and then apply KMV sketch for each individual group. The intersection size between two records and can be computed within each group and then sum up together. However, our initial experiments suggest that this will lead to poor accuracy because of the summation of the intersection size estimations. In Theorem 4 (Section IV-C5), our theoretical analysis suggests that the combination of estimated results are very likely to make the overall accuracy worse.

To avoid combining multiple estimation results, we use a bitmap buffer with size for each record to exactly keep track of the most frequent elements, denoted by . Then we apply G-KMV technique to the remaining elements, resulting in a new augmented sketch, namely GB-KMV. Now we can estimate by combining the intersection of their bitmap buffers (exact solution) and KMV sketches (estimated solution).

As shown in Fig. 4, suppose we have and the global threshold for hash value is , then the sketch of each record consists of two parts and ; that is, for each record we use bitmap to keep the elements corresponding to high-frequency elements , then we store the left elements with hash value less than .

Fig. 4: The GB-KMV sketch of dataset in Example 1
Example 5.

Given the signature of () and (), the intersection of High-frequency part is with intersection size as ; next we consider the G-KMV part. Similar to Example  4, we compute the intersection of part. The KMV sketch is . According to Equation 24, the -th() smallest hash value is , and the size of intersection of and within is . Then the intersection size of and in part is estimated as ; together with the High-frequency part, the intersection size of and is estimated as and the containment similarity is . Then is returned if the given containment similarity threshold is .

Optimal Buffer Size. The key challenge is how to set the size of bitmap buffer for the best expected performance of GB-KMV sketch. In Section IV-C6, we provide a theoretical analysis, which is verified in our performance evaluation.

Comparison with G-KMV. As the G-KMV is a special case of GB-KMV with buffer size and we carefully choose the buffer size with our cost model, the accuracy of GB-KMV is not worse than G-KMV.

Comparison with LSH-E. Through theoretical analysis, we show that the performance (i.e., the variance of the estimator) of GB-KMV can always outperform that of LSH-E in Theorem 5 (Section IV-C7).

Iv-B Implementation of GB-KMV

In this section, we introduce the technique details of our proposed GB-KMV method. We first show how to build GB-KMV sketch on the dataset and then present the containment similarity search algorithm.

GB-KMV Sketch Construction. For each record , its GB-KMV sketch consists of two components: (1) a buffer which exactly keeps high-frequency elements, denoted by ; and (2) a G-KMV sketch, which is a KMV sketch with a global threshold value, denoted by .

: dataset; : space budget; AlgoLine0.1
: a hash function; : buffer size , the GB-KMV index of dataset Compute buffer size based on distribution statistics of and the space budget Top most frequent elements; compute the global threshold for hash values foreach record do  elements of in hash values of elements of with  
Algorithm 1 GB-KMV Index Construction

Algorithm 1 illustrates the construction of GB-KMV sketch. Let the element universe be and each element is associated with its frequency in dataset . Line 1 calculates a buffer size for all records based on the skewness of record size and elements as well as the space budget in terms of elements. Details will be introduced in Section IV-C6. We use to denote the set of top- most frequent elements (Line 1), and they will be kept in the buffer of each record. Let denote the remaining elements. Line 1 identifies maximal possible global threshold for elements in such that the total size of GB-KMV sketch meets the space budget . For each record , let denote the number of elements in with hash values less than , we have . Then Lines 1-1 build the buffer and G-KMV sketch for every record . In section 2, we will show the correctness of our sketch in Theorem 2.

1

Containment Similarity Search. Given the GB-KMV sketch of the query record and the dataset , we can conduct approximate similarity search as illustrated in Algorithm 2. Given a query with size and the similarity threshold , let (Lines 1-2). With GB-KMV sketch , we can calculate the containment similarity based on

(27)

where is the estimation of overlap size of and which is calculated by Equation 25 in Section IV-A.

Note that is the number of common elements of and in .

, a query setAlgoLine0.1
, containment similarity threshold records with foreach record do  if then      return
Algorithm 2 Containment Similarity Search
1

Implementation of Containment Similarity Search. In our implementation, we use a bitmap with size to keep the elements in buffer where each bit is reserved for one frequent element. We can use bitwise intersection operator to efficiently compute in Line 2 of Algorithm 2. Note that the estimator of overlap size by G-KMV method in Equation 25 is . As to the computation of , we apply some transformation to . Then we get where and . Since is the overlap size, then we make use of the PPjoin* [40] to speed up the search. Note that in order to make the PPjoin* which is designed for similarity join problem to be applicable to the similarity search problem, we partition the dataset by record size, and in each partition we search for the records which satisfy , where overlap size is modified by the lower bound in corresponding partition.

Remark 3.

Note that the size-aware overlap set similarity joins algorithm in [25] can not be applied to our GB-KMV method, because we need to online construct -subset inverted list for each incoming query, which results in very inefficient performance.

Processing Dynamic Data. Note that our algorithm can be modified to process dynamic data. Particularly, when new records come, we compute the new global threshold under the fixed space budget by Line  1 of Algorithm  1, and with the new global threshold, we maintain the sketch of each record as shown in Line  1 of Algorithm  1.

Iv-C Theoretical Analysis

In this section, we provide theoretical underpinnings of the claims and observations in this paper.

Iv-C1 Background

We need some reasonable assumptions on the record size distribution, element frequency distribution and query work-load for a comprehensive analysis. Following are three popular assumptions widely used in the literature (e.g., [6, 29, 27, 18, 16, 44, 34]):

  • The element frequency in the dataset follows the power-law distribution, with .

  • The record size in the dataset follows the power-law distribution, with .

  • The query is randomly chosen from the records.

Throughout the paper, we use the variance to evaluate the goodness of an estimator. Regarding the KMV based sketch techniques (KMV, G-KMV and GB-KMV), we have

Lemma 2.

In KMV sketch based methods, the larger the value used in Equation 8 and Equation 24 is, the smaller the variance will be.

It is easy to verify the above lemma by calculating the derivative of Equation 11 with respect to the variable . Thus, in the following analysis of KMV based sketch techniques. We use the value (i.e., the sketch size used for estimation) to evaluate the goodness of the estimation, the larger the better.

Iv-C2 Optimal KMV Signature Scheme

In this part, we give an optimal resource allocation strategy for KMV sketch method in similarity search.

Theorem 1.

Given a space budget , each set is associated with a size- KMV signature and . For KMV sketch based containment similarity search, the optimal signature scheme is to keep the minimal hash values for each set .

Proof.

Given a query and dataset , an optimal signature scheme for containment similarity search is to minimize the average variance between and . Considering the query and set with size- KMV sketch and size- sketch respectively, the sketch size is according to Equation 8. By Lemma 2, an optimal signature scheme is to maximize the total value(say ), then we have the following optimization goal,

Rank the by increasing order, w.l.o.g., let be the sketch size sequence after reorder. Let be the first in the sequence such that , then we have . In order to maximize , we set . Then by , we have . Note that , we must have . Since is randomly selected from dataset , we can get that all the are equal and . ∎

Iv-C3 Correctness of GKMV Sketch

In this section, we show that the G-KMV sketch is a valid KMV sketch.

Theorem 2.

Given two records and , let and be the G-KMV sketch of and , respectively. Let , then the size- KMV synopses of is .

Proof.

We show that the above is a valid KMV sketch of . Let and is the -th smallest hash value in . In order to prove that is valid, we show that corresponds the element with the -th minimal hash value in . If not, there should exist an element such that and . Note that , then , thus is included in , which contradicts to the above statement. ∎

Iv-C4 G-KMV: A Better KMV Sketch

In this part, we show that by imposing a global threshold to KMV sketch, we can achieve better accuracy. Let and be the KMV sketch of and respectively. Let