GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

09/03/2018
by   Yang Yang, et al.
0

In this paper, we study the problem of approximate containment similarity search. Given two records Q and X, the containment similarity between Q and X with respect to Q is |Q intersect X|/ |Q|. Given a query record Q and a set of records S, the containment similarity search finds a set of records from S whose containment similarity regarding Q are not less than the given threshold. This problem has many important applications in commercial and scientific fields such as record matching and domain search. Existing solution relies on the asymmetric LSH method by transforming the containment similarity to well-studied Jaccard similarity. In this paper, we use a different framework by transforming the containment similarity to set intersection. We propose a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a good trade-off between the sketch size and the accuracy. We provide a set of theoretical analysis to underpin the proposed augmented KMV sketch technique, and show that it outperforms the state-of-the-art technique LSH-E in terms of estimation accuracy under practical assumption. Our comprehensive experiments on real-life datasets verify that GB-KMV is superior to LSH-E in terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch construction time. For instance, with similar estimation accuracy (F-1 score), GB-KMV is over 100 times faster than LSH-E on some real-life dataset.

READ FULL TEXT
research
05/22/2019

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Estimating set similarity and detecting highly similar sets are fundamen...
research
04/22/2021

Sketch-QNet: A Quadruplet ConvNet for Color Sketch-based Image Retrieval

Architectures based on siamese networks with triplet loss have shown out...
research
12/16/2019

Trapezoidal Sketch: A Sketch Structure for Frequency Estimation of Data Streams

The sketch is one of the typical and widely used data structures for est...
research
06/26/2018

Record Linkage to Match Customer Names: A Probabilistic Approach

Consider the following problem: given a database of records indexed by n...
research
06/08/2018

Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment

We study the problem of similarity self-join and similarity join size es...
research
02/24/2021

Durable Top-K Instant-Stamped Temporal Records with User-Specified Scoring Functions

A way of finding interesting or exceptional records from instant-stamped...
research
05/08/2019

Locality-Sensitive Sketching for Resilient Network Flow Monitoring

Network monitoring is vital in modern clouds and data center networks fo...

Please sign up or login with your details

Forgot password? Click here to reset