Preference-driven Similarity Join

06/13/2017
by   Chuancong Gao, et al.
0

Similarity join, which can find similar objects (e.g., products, names, addresses) across different sources, is powerful in dealing with variety in big data, especially web data. Threshold-driven similarity join, which has been extensively studied in the past, assumes that a user is able to specify a similarity threshold, and then focuses on how to efficiently return the object pairs whose similarities pass the threshold. We argue that the assumption about a well set similarity threshold may not be valid for two reasons. The optimal thresholds for different similarity join tasks may vary a lot. Moreover, the end-to-end time spent on similarity join is likely to be dominated by a back-and-forth threshold-tuning process. In response, we propose preference-driven similarity join. The key idea is to provide several result-set preferences, rather than a range of thresholds, for a user to choose from. Intuitively, a result-set preference can be considered as an objective function to capture a user's preference on a similarity join result. Once a preference is chosen, we automatically compute the similarity join result optimizing the preference objective. As the proof of concept, we devise two useful preferences and propose a novel preference-driven similarity join framework coupled with effective optimization techniques. Our approaches are evaluated on four real-world web datasets from a diverse range of application scenarios. The experiments show that preference-driven similarity join can achieve high-quality results without a tedious threshold-tuning process.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2018

Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment

We study the problem of similarity self-join and similarity join size es...
research
06/08/2018

Similarity Join and Self-Join Size Estimation in a Streaming Environment

We study the problem of similarity self-join and similarity join size es...
research
04/16/2011

Similarity Join Size Estimation using Locality Sensitive Hashing

Similarity joins are important operations with a broad range of applicat...
research
07/21/2017

Scalable and robust set similarity join

Set similarity join is a fundamental and well-studied database operator....
research
11/20/2017

Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations

The Exact Set Similarity Join problem aims to find all similar sets betw...
research
12/18/2017

Error-Tolerant Big Data Processing

Real-world data contains various kinds of errors. Before analyzing data,...
research
03/12/2018

GPU Accelerated Self-join for the Distance Similarity Metric

The self-join finds all objects in a dataset within a threshold of each ...

Please sign up or login with your details

Forgot password? Click here to reset