DeepAI
Log In Sign Up

LES3: Learning-based Exact Set Similarity Search

Set similarity search is a problem of central interest to a wide variety of applications such as data cleaning and web search. Past approaches on set similarity search utilize either heavy indexing structures, incurring large search costs or indexes that produce large candidate sets. In this paper, we design a learning-based exact set similarity search approach, LES3. Our approach first partitions sets into groups, and then utilizes a light-weight bitmap-like indexing structure, called token-group matrix (TGM), to organize groups and prune out candidates given a query set. In order to optimize pruning using the TGM, we analytically investigate the optimal partitioning strategy under certain distributional assumptions. Using these results, we then design a learning-based partitioning approach called L2P and an associated data representation encoding, PTR, to identify the partitions. We conduct extensive experiments on real and synthetic datasets to fully study LES3, establishing the effectiveness and superiority over other applicable approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

08/17/2019

Comparison-Based Indexing From First Principles

Basic assumptions about comparison-based indexing are laid down and a ge...
03/19/2020

Top-k queries over digital traces

Recent advances in social and mobile technology have enabled an abundanc...
10/07/2022

Set2Box: Similarity Preserving Representation Learning of Sets

Sets have been used for modeling various types of objects (e.g., a docum...
01/30/2022

Similarity Search on Computational Notebooks

Computational notebook software such as Jupyter Notebook is popular for ...
04/09/2018

Set Similarity Search for Skewed Data

Set similarity join, as well as the corresponding indexing problem set s...
04/22/2021

HINT: A Hierarchical Index for Intervals in Main Memory

Indexing intervals is a fundamental problem, finding a wide range of app...
05/07/2020

Indexing Metric Spaces for Exact Similarity Search

With the continued digitalization of societal processes, we are seeing a...