I Introduction
In many applications such as information retrieval, data cleaning, machine learning and user recommendation, an object (e.g., document, image, web page and user) is described by a set of elements (e.g., words,
grams, and items). One of the most critical components in these applications is to define the set similarity between two objects and develop corresponding similarity query processing techniques. Given two records (objects) and , a variety of similarity functions/metrics have been identified in the literature for different scenarios (e.g., [28], [15]). Many indexing techniques have been developed to support efficient exact and approximate lookups and joins based on these similarity functions.Many of the set similarity functions studied are symmetric functions, i.e.,
, including widely used Jaccard similarity and Cosine similarity. In recent years, much research attention has been given to the asymmetric set similarity functions, which are more appropriate in some applications. Containment similarity (a.k.a, Jaccard containment similarity) is one of the representative asymmetric set similarity functions, where the similarity between two records
and is defined as in which and are intersection size of and and the size of , respectively.Compared with symmetric similarity such as Jaccard similarity, containment similarity gives special consideration on the query size, which makes it more suitable in some applications. As shown in [35], containment similarity is useful in record matching application. Given two text descriptions of two restaurants and which are represented by two “set of words” records: five, guys, burgers, and, fries, downtown, brooklyn, new, york and five, kitchen, berkeley respectively. Suppose query is {five, guys}, we have that the Jaccard similarity of and (resp. ) is (). Note the Jaccard similarity is . Based on the Jaccard similarity, record matches better to query , but intuitively should be a better choice. This is because the Jaccard similarity unnessesarily favors the short records. On the other hand, the containment similarity will lead to the desired order with and . Containment similarity search can also support online errortolerant search for matching user queries against addresses (map service) and products (product search). This is because the regular keyword search is usually based on the containment search, and containment similarity search provides a natural errortolerant alternative [5]. In [44], Zhu et al. show that containment similarity search is essential in domain search which enables users to effectively search Open Data.
The containment similarity is also of interest to applications of computing the fraction of values of one column that are contained in another column. In a dataset, the discovery of all inclusion dependencies is a crucial part of data profiling efforts. It has many applications such as foreignkey detection and data integration(e.g., [22, 31, 8, 33, 30]).
Challenges. The problem of containment similarity search has been intensively studied in the literature in recent years (e.g., [5, 35, 44]). The key challenges of this problem come from the following three aspects: () The number of elements (i.e., vocabulary size) may be very large. For instance, the vocabulary will blow up quickly when the higherorder shingles are used [35]. Moreover, query and record may contain many elements. To deal with the sheer volume of the data, it is desirable to use sketch technique to provide effectively and efficiently approximate solutions. (
) The data distribution (e.g., record size and element frequency) in reallife application may be highly skewed. This may lead to poor performance in practice for
data independent sketch methods. () A subtle difficulty of the approximate solution comes from the asymmetric property of the containment similarity. It is shown in [34] that there cannot exist any locality sensitive hashing (LSH) function family for containment similarity search.To handle the large scale data and provide quick response, most existing solutions for containment similarity search seek to the approximate solutions. Although the use of LSH is restricted, the novel asymmetric LSH method has been designed in [34]
to address the issue by padding techniques. Some enhancements of asymmetric
LSH techniques are proposed in the following works by introducing different functions (e.g., [35]). Observe that the performance of the existing solutions are sensitive to the skewness of the record size, Zhu et. al propose a partitionbased method based on Minhash LSH function. By using optimal partition strategy based on the size distribution of the records, the new approach can achieve much better timeaccuracy tradeoff.We notice that all existing approximate solutions rely on the LSH functions by transforming the containment similarity to wellstudied Jaccard similarity. That is,
As the size of query is usually readily available, the estimation error come from the computation of Jaccard similarity and union size of and . Note that although the union size can be derived from jaccard similarity [44]
, the large variance caused by the combination of two estimations remains. This motivates we to use a different framework by transforming the containment similarity to
set intersection size estimation, and the error is only contributed by the estimation of . The wellknown KMV sketch [11] has been widely used to estimate the set intersection size, which can be immediately applied to our problem. However, this method is dataindependent and hence cannot well handle the skewed distributions of records size and element frequency, which is common in reallife applications. Intuitively, the record with larger size and the element with highfrequency should be allocated more resources. In this paper, we theoretically show that the existing KMVsketch technique cannot consider these two perspectives by simple heuristics, e.g., explictly allocating more resource to record with large size. Consequently, we develop an augmented
KMV sketch to exploit both record size distribution and the element frequency distribution for better spaceaccuracy and timeaccuracy tradeoffs. Two technique are proposed: () we impose a global threshold to KMV sketch, namely GKMV sketch, to achieve better estimate accuracy. As disscussed in Section IVA(2), this technique cannot be extended to the Minhash LSH. () we introduce an extra buffer for each record to take advantage of the skewness of the element frequency. A cost model is proposed to carefully choose the buffer size to optimize the accuracy for the given total space budget and data distribution.Contributions. Our principle contributions are summarized as follows.

We propose a new augmented KMV sketch technique, namely GBKMV, for the problem of approximate containment similarity search. By imposing a global threshold and an extra buffer for KMV sketches of the records, we significanlty enhance the performance as the new method can better exploit the data distributions.

We provide theoretical underpinnings to justify the design of GBKMV method. We also theoretically show that GBKMV outperforms the stateoftheart technique LSHE in terms of accuracy under realistic assumption on data distributions.

Our comprehensive experiments on reallife setvalued data from various applications demonstrate the effectiveness and efficiency of our proposed method.
Road Map. The rest of the paper is organized as follows. Section II presents the preliminaries. Section III introduces the existing solutions. Our approach, GBKMV sketch, is devised in Section IV. Extensive experiments are reported in Section V, followed by the related work in Section VI. Section VII concludes the paper.
Ii Preliminaries
In this section, we first formally present the problem of containment similarity search, then introduce some preliminary knowledge. In Table I, we summarize the important mathematical notations appearing throughout this paper.
Notation  Definition 

a collection of records  
record, query record  
record size of , query size of  
Jaccard similarity between query and set  
Containment similarity of query in set  
Jaccard similarity threshold  
the KMV signature (i.e., hash values) of record X  
all hash values of the elements in record X  
the buffer of record X  
containment similarity threshold  
sketch space budget, measured by the number of  
signatures (i.e., hash values or elements)  
the global threshold for hash values  
the buffer size(with bit unit) of GBKMV sketch  
number of records in dataset  
number of distinct elements in dataset  
Iia Problem Definition
In this paper, the element universe is . Let be a collection of records (sets) ’ where ( ) is a set of elements from .
Before giving the definition of containment similarity, we first introduce the Jaccard similarity.
Definition 1 (Jaccard Similarity).
Given two records and from , the Jaccard similarity between and is defined as the size of the intersection divided by the size of the union, which is expressed as
(1) 
Similar to the Jaccard similarity, the containment similarity (a.k.a Jaccard containment similarity) is defined as follows.
Definition 2 (Containment Similarity).
Given two records and from , the containment similarity of in , denoted by is the size of the intersection divided by record size , which is formally defined as
(2) 
Note that by replacing the union size in Equation 1 with size , we get the containment similarity. It is easy to see that Jaccard similarity is symmetric while containment similarity is asymmetric.
In this paper, we focus on the problem of containment similarity search which is to look up a set of records whose containment similarity towards a given query record is not smaller than a given threshold. The formal definition is as follows.
Definition 3 (Containment Similarity Search).
Given a query , and a threshold on the containment similarity, search for records from a dataset such that:
(3) 
Next, we give an example to show the problem of containment similarity search.
Example 1.
Fig. 1 shows a dataset with four records , , , , and the element universe is . Given a query and a containment similarity threshold , the records satisfying are , .
id  record  

Problem Statement. In this paper, we investigate the problem of approximate containment similarity search. For the dataset with a large number of records, we aim to build a synopses of the dataset such that it () can efficiently support containment similarity search with high accuracy, () can handle large size records, and () has a compact index size.
IiB Minwise Hashing
Minwise Hashing is proposed by Broder in [13, 14] for estimating the Jaccard similarity of two records and . Let be a hash function that maps the elements of and to distinct integers, and define and to be the minimum hash value of a record and , respectively. Assuming no hash collision, Broder[13] showed that the Jaccard similarity of and
is the probability of two minimum hash values being equal:
. Applying such different independent hash functions to a record (, resp.), the MinHash signature of (, resp.) is to keep values of ( , resp.) for functions. Let be the indicator function such that(4) 
then the Jaccard similarity between record and can be estimated as
(5) 
Let be the Jaccard similarity of set and , then the expectation of is
(6) 
and the variance of is
(7) 
IiC KMV Sketch
The minimum values(KMV) technique introduced by Bayer et. al in [11] is to estimate the number of distinct elements in a large dataset. Given a nocollision hash function which maps elements to range , a KMV synopses of a record , denoted by , is to keep minimum hash values of . Then the number of distinct elements can be estimated by where is th smallest hash value. By , we denote hash values of all elements in the record .
In [11], Bayer et. al also methodically analyse the problem of distinct element estimation under multiset operation. As for union operation, consider two records and with corresponding KMV synopses and of size and , respectively. In [11], represents the set consisting of the smallest hash values in where
(8) 
Then the KMV synopses of is
. An unbiased estimator for the number of distinct elements in
, denoted by is as follows.(9) 
For intersection operation, the KMV synopses is where . Let , i.e., is the number of common distinct hash values of and within . Then the number of distinct elements in , denoted by , can be estimated as follows.
(10) 
The variance of , as shown in[11], is
(11) 
Iii Existing Solutions
In this section, we present the stateoftheart technique for the approximate containment similarity search, followed by theoretical analysis on the limits of the existing solution.
Iiia LSH Ensemble Method
LSH Ensemble technique, LSHE for short, is proposed by Zhu et. al in [44] to tackle the problem of approximate containment similarity search. The key idea is : (1) transform the containment similarity search to the wellstudied Jaccard similarity search; and (2) partition the data by length and then apply the LSH forest [9] technique for each individual partition.
Similarity Transformation. Given a record with size , a query with size , containment similarity and Jaccard similarity . The transformation back and forth are as follows.
(12) 
Given the containment similarity search threshold as for the query , we may come up with its corresponding Jaccard similarity threshold by Equation 12. A straightforward solution is to apply the existing approximate Jaccard similarity search technique for each individual record with the Jaccard similarity threshold (e.g., compute Jaccard similarity between the query and a set based on their MinHash signatures). In order to take advantages of the efficient indexing techniques (e.g., LSH forest [9]), LSHE will partition the dataset .
Data Partition. By partitioning the dataset according to the record size, LSHE can replace in Equation 12 with its upper bound (i.e., the largest record size in the partition) as an approximation. That is, for the given containment similarity we have
(13) 
The use of upper bound will lead to false positives. In [44]
, an optimal partition method is designed to minimize the total number of false positives brought by the use of upper bound in each partition. By assuming that the record size distribution follows the powerlaw distribution and similarity values are uniformly distributed, it is shown that the optimal partition can be achieved by ensuring each partition has the equal number of records (i.e., equaldepth partition).
Containment Similarity Search. For each partition of the data, LSHE applies the dynamic LSH technique (e.g., LSH forest [9]). Particularly, the records in are indexed by a MinHash LSH with parameter (, ) where is the number of bands used by the LSH index and is the number of hash values in each band. For the given query , the and values are carefully chosen by considering their corresponding number of false positives and false negatives regarding the existing records. Then the candidate records in each partition can be retrieved from the MinHash index according to the corresponding Jaccard similarity thresholds obtained by Equation 13. The union of the candidate records from all partitions will be returned as the result of the containment similarity search.
IiiB Analysis
One of the LSHE’s advantages is that it converts the containment similarity problem to Jaccard similarity search problem which can be solved by the mature and efficient MinHash LSH method. Also, LSHE carefully considers the record size distribution and partitions the records by record size. In this sense, we say LSHE is a datadependent method and it is reported that LSHE significantly outperforms existing asymmetric LSH based solutions [34, 35] (i.e., dataindependent methods) as LSHE can exploit the information of data distribution by partitioning the dataset. However, this benefit is offset by the fact that the the upper bound will bring extra false positives, in addition to the error from the MinHash technique.
Below we theoretically analyse the performance of LSHE by studying the expectation and variance of its estimator.
Using the notations same as above, let be the Jaccard similarity between query and set and be the containment similarity of in . By Equation 5, given the MinHash signature of query and respectively, an unbiased estimator of Jaccard similarity is the ratio of collisions in the signature, and the variance of is where is signature size of each record. Then by transformation Equation 12, the estimator of containment similarity by MinHash LSH is
(14) 
where and . The estimator of containment similarity by LSHE is
(15) 
where and is the upper bound of .
Next, we use Taylor expansions to approximate the expectation and variance of a function with one random variable
[26]. We first give a lemma.Lemma 1.
Given a random variable with expectation and variance , the expectation of can be approximated as
(16) 
and the variance of can be approximated as
(17) 
According to Equation 14, let where . We can see that the estimator is a function of , and and . Then based on Lemma 1, the expectation and variance of are approximated as
(18) 
(19) 
Similarly, the expectation and variance of LSHE estimator can be approximated as
(20) 
(21) 
The computation details are in technique report [41]. Since is the upper bound of , the variance of LSHE estimator is larger than that of MinHash LSH estimator. Also, by Equation 18 and Equation 20, we can see that both estimators are biased and LSHE method is quite sensitive to the setting of the upper bound by Equation 20. Because the presence of upper bound will enlarge the estimator off true value, LSHE method favours recall while the precision will be deteriorated. The larger the upper bound is, the worse the precision will be. Our empirical study shows that LSHE cannot achieve a good tradeoff between accuracy and space, compared with our proposed method.
Iv Our Approach
In this section, we introduce an augmented KMV sketch technique to achieve better spaceaccuracy tradeoff for approximate containment similarity search. Section IVA briefly introduces the motivation and main technique of our method, namely GBKMV. The detailed implementation is presented in Section IVB, followed by extensive theoretical analysis in Section IVC.
Iva Motivation and Techniques
The key idea of our method is to propose a datadependent indexing technique such that we can exploit the distribution of the data (i.e., record size distribution and element frequency distribution) for better performance of containment similarity search. We augment the existing KMV technique by introducing a global threshold for sample size allocation and a buffer for frequent elements, namely GBKMV, to achieve better tradeoff between synopses size and accuracy. Then we apply the existing set similarity join/search indexing technique to speed up the containment similarity search.
Below we outline the motivation of the key techniques used in this paper. Detailed algorithms and theoretical analysis will be introduced in Section IVB and IVC, respectively.
(1) Directly Apply KMV Sketch
Given a query and a threshold on containment similarity, the goal is to find record from dataset such that
(22) 
Applying some simple transformation to Equation 22, we get
(23) 
Let , then the containment similarity search problem is converted into finding record whose intersection size with the query is not smaller than , i.e., .
Therefore, we can directly apply the KMV method introduced in Section IIC. Given KMV signatures of a record and a query , we can estimate their intersection size () according to Equation 10. Then the containment similarity of in is immediately available given the query size . Below, we show an example on how to apply KMV method to containment similarity search.
Example 2.
Fig. 2 shows the KMV sketch on dataset in Example 1. Given KMV signature of () and (), we have , then the size KMV synopses of is , the th smallest hash value is 0.33 and the size of intersection of and within is . Then the intersection size of and is estimated as , and the containment similarity is . Then is returned if the given containment similarity threshold is .
Remark 1.
In [44], the size of the query is approximated by MinHash signature of , where KMV sketch can also serve for the same purpose. But the exact query size is used their implementation for performance evaluation. In practice, the query size is readily available, we assume query size is given throughout the paper.
Optimization of KMV Sketch. Given a space budget , we can keep size KMV signatures (i.e., minimal hash values) for each record with . A natural question is how to allocate the resource (e.g., setting of values) to achieve the best overall estimation accuracy. Intuitively, more resources should be allocated to records with more frequent elements or larger record size, i.e., larger for record with larger size. However, Theorem 1 (Section IVC2) suggests that, the optimal resource allocation strategy in terms of estimation variance is to use the same size of signature for each record. This is because the minimal of two kvalues is used in Equation 8, and hence the best solution is to evenly allocate the resource. Thus, we have the KMV sketch based method for approximate containment similarity search. For the given budget , we keep minimal hash values for each record .
(2) Impose a Global Threshold to KMV Sketch (GKMV)
The above analysis on optimal KMV sketch suggests an equal size allocation strategy, that is, each record is associated with the same size signature. Intuitively we should assign more resources (i.e., signature size) to the records with large size because they are more likely to appear in the results. However, the estimate accuracy of KMV for two sets size intersection is determined by the sketch with smaller size since we choose for KMV signatures of and for and in Equation 9, thus it is useless to give more resource to one of the records. We further explain the reason behind with the following example.
Before we introduce the global threshold to KMV sketch, consider the KMV sketch shown in the Fig. 2.
Example 3.
Suppose we have and . Although there are four hash values in , we can only consider smallest hash values of by Equation 8, which is , and the th () minimum hash value used in Equation 9 is . We cannot use (i.e., =) to estimate because the th smallest hash value in may not be the th smallest hash values in , because the unseen rd smallest hash value of might be for example, which is smaller than . Recall that denote the hash values of all elements in .
Nevertheless, if we know that all the hash values smaller than a global threshold, say , are kept for every record, we can safely use the th hash value of (i.e., ) for the estimation. This is because we can ensure the th smallest hash value in must be the th smallest hash values in .
Inspired by the above observation, we can carefully choose a global threshold (e.g., in the above example) for a given space budget , and ensure all hash values smaller than will be kept for KMV sketch of the records. By imposing a global threshold, we can identify a better (i.e., larger) value used for estimation, compared with Equation 8.
Given a record and a global threshold , the sketch of a record is obtained as where is the hash function. The sketch of () is defined in the same way. In this paper, we say a KMV sketch is a GKMV sketch if we impose a global threshold to generate KMV sketch. Then we set value of the KMV estimation as follows.
(24) 
Meanwhile, we have . Let be the th minimal hash value in , then the overlap size of and can be estimated as
(25) 
Then the containment similarity of in is
(26) 
where is the query size. We remark that, as a byproduct, the global threshold favours the record with large size because all elements with hash value smaller than are kept for each record.
Below is an example on how to compute the containment similarity based on GKMV sketch.
Example 4.
Fig. 3 shows the KMV sketch of dataset in Example 1 with a global threshold . Given the signature of () and (), the KMV sketch of is , the th() smallest hash value is , and the size of intersection of and within is . Then the intersection size of and is estimated as , and the containment similarity is . Then is returned if the given containment similarity threshold is .
Comparison with KMV. In Theorem 3 (Section IVC4), we theoretically show that GKMV can achieve better accuracy compared with KMV.
Remark 2.
Note that the global threshold technique cannot be applied to MinHash based techniques. In minHash LSH, the minimum hash values are corresponding to different independent hash functions, while in KMV sketch, the value sketch is obtained under one hash function. Thus we can only impose this global threshold on the same hash function for the KMV sketch based method.
(3) Use Buffer for KMV Sketch (GBKMV)
In addition to the skewness of the record size, it is also worthwhile to exploit the skewness of the element frequency. Intuitively, more resource should be assigned to highfrequency elements because they are more likely to appear in the records. However, due to the nature of the hash function used by KMV sketch, the hash value of an element is independent to its frequency; that is, all elements have the same opportunity contributing to the KMV sketch.
One possible solution is to divide the elements into multiple disjoint groups according to their frequency (e.g., lowfrequency and highfrequency ones), and then apply KMV sketch for each individual group. The intersection size between two records and can be computed within each group and then sum up together. However, our initial experiments suggest that this will lead to poor accuracy because of the summation of the intersection size estimations. In Theorem 4 (Section IVC5), our theoretical analysis suggests that the combination of estimated results are very likely to make the overall accuracy worse.
To avoid combining multiple estimation results, we use a bitmap buffer with size for each record to exactly keep track of the most frequent elements, denoted by . Then we apply GKMV technique to the remaining elements, resulting in a new augmented sketch, namely GBKMV. Now we can estimate by combining the intersection of their bitmap buffers (exact solution) and KMV sketches (estimated solution).
As shown in Fig. 4, suppose we have and the global threshold for hash value is , then the sketch of each record consists of two parts and ; that is, for each record we use bitmap to keep the elements corresponding to highfrequency elements , then we store the left elements with hash value less than .
Example 5.
Given the signature of () and (), the intersection of Highfrequency part is with intersection size as ; next we consider the GKMV part. Similar to Example 4, we compute the intersection of part. The KMV sketch is . According to Equation 24, the th() smallest hash value is , and the size of intersection of and within is . Then the intersection size of and in part is estimated as ; together with the Highfrequency part, the intersection size of and is estimated as and the containment similarity is . Then is returned if the given containment similarity threshold is .
Optimal Buffer Size. The key challenge is how to set the size of bitmap buffer for the best expected performance of GBKMV sketch. In Section IVC6, we provide a theoretical analysis, which is verified in our performance evaluation.
Comparison with GKMV. As the GKMV is a special case of GBKMV with buffer size and we carefully choose the buffer size with our cost model, the accuracy of GBKMV is not worse than GKMV.
IvB Implementation of GBKMV
In this section, we introduce the technique details of our proposed GBKMV method. We first show how to build GBKMV sketch on the dataset and then present the containment similarity search algorithm.
GBKMV Sketch Construction. For each record , its GBKMV sketch consists of two components: (1) a buffer which exactly keeps highfrequency elements, denoted by ; and (2) a GKMV sketch, which is a KMV sketch with a global threshold value, denoted by .
Algorithm 1 illustrates the construction of GBKMV sketch. Let the element universe be and each element is associated with its frequency in dataset . Line 1 calculates a buffer size for all records based on the skewness of record size and elements as well as the space budget in terms of elements. Details will be introduced in Section IVC6. We use to denote the set of top most frequent elements (Line 1), and they will be kept in the buffer of each record. Let denote the remaining elements. Line 1 identifies maximal possible global threshold for elements in such that the total size of GBKMV sketch meets the space budget . For each record , let denote the number of elements in with hash values less than , we have . Then Lines 11 build the buffer and GKMV sketch for every record . In section 2, we will show the correctness of our sketch in Theorem 2.
Containment Similarity Search. Given the GBKMV sketch of the query record and the dataset , we can conduct approximate similarity search as illustrated in Algorithm 2. Given a query with size and the similarity threshold , let (Lines 12). With GBKMV sketch , we can calculate the containment similarity based on
(27) 
where is the estimation of overlap size of and which is calculated by Equation 25 in Section IVA.
Note that is the number of common elements of and in .
Implementation of Containment Similarity Search. In our implementation, we use a bitmap with size to keep the elements in buffer where each bit is reserved for one frequent element. We can use bitwise intersection operator to efficiently compute in Line 2 of Algorithm 2. Note that the estimator of overlap size by GKMV method in Equation 25 is . As to the computation of , we apply some transformation to . Then we get where and . Since is the overlap size, then we make use of the PPjoin* [40] to speed up the search. Note that in order to make the PPjoin* which is designed for similarity join problem to be applicable to the similarity search problem, we partition the dataset by record size, and in each partition we search for the records which satisfy , where overlap size is modified by the lower bound in corresponding partition.
Remark 3.
Note that the sizeaware overlap set similarity joins algorithm in [25] can not be applied to our GBKMV method, because we need to online construct subset inverted list for each incoming query, which results in very inefficient performance.
Processing Dynamic Data. Note that our algorithm can be modified to process dynamic data. Particularly, when new records come, we compute the new global threshold under the fixed space budget by Line 1 of Algorithm 1, and with the new global threshold, we maintain the sketch of each record as shown in Line 1 of Algorithm 1.
IvC Theoretical Analysis
In this section, we provide theoretical underpinnings of the claims and observations in this paper.
IvC1 Background
We need some reasonable assumptions on the record size distribution, element frequency distribution and query workload for a comprehensive analysis. Following are three popular assumptions widely used in the literature (e.g., [6, 29, 27, 18, 16, 44, 34]):

The element frequency in the dataset follows the powerlaw distribution, with .

The record size in the dataset follows the powerlaw distribution, with .

The query is randomly chosen from the records.
Throughout the paper, we use the variance to evaluate the goodness of an estimator. Regarding the KMV based sketch techniques (KMV, GKMV and GBKMV), we have
Lemma 2.
It is easy to verify the above lemma by calculating the derivative of Equation 11 with respect to the variable . Thus, in the following analysis of KMV based sketch techniques. We use the value (i.e., the sketch size used for estimation) to evaluate the goodness of the estimation, the larger the better.
IvC2 Optimal KMV Signature Scheme
In this part, we give an optimal resource allocation strategy for KMV sketch method in similarity search.
Theorem 1.
Given a space budget , each set is associated with a size KMV signature and . For KMV sketch based containment similarity search, the optimal signature scheme is to keep the minimal hash values for each set .
Proof.
Given a query and dataset , an optimal signature scheme for containment similarity search is to minimize the average variance between and . Considering the query and set with size KMV sketch and size sketch respectively, the sketch size is according to Equation 8. By Lemma 2, an optimal signature scheme is to maximize the total value(say ), then we have the following optimization goal,
Rank the by increasing order, w.l.o.g., let be the sketch size sequence after reorder. Let be the first in the sequence such that , then we have . In order to maximize , we set . Then by , we have . Note that , we must have . Since is randomly selected from dataset , we can get that all the are equal and . ∎
IvC3 Correctness of GKMV Sketch
In this section, we show that the GKMV sketch is a valid KMV sketch.
Theorem 2.
Given two records and , let and be the GKMV sketch of and , respectively. Let , then the size KMV synopses of is .
Proof.
We show that the above is a valid KMV sketch of . Let and is the th smallest hash value in . In order to prove that is valid, we show that corresponds the element with the th minimal hash value in . If not, there should exist an element such that and . Note that , then , thus is included in , which contradicts to the above statement. ∎
IvC4 GKMV: A Better KMV Sketch
In this part, we show that by imposing a global threshold to KMV sketch, we can achieve better accuracy. Let and be the KMV sketch of and respectively. Let
Comments
There are no comments yet.