LES3: Learning-based Exact Set Similarity Search

07/22/2021
by   Yifan Li, et al.
0

Set similarity search is a problem of central interest to a wide variety of applications such as data cleaning and web search. Past approaches on set similarity search utilize either heavy indexing structures, incurring large search costs or indexes that produce large candidate sets. In this paper, we design a learning-based exact set similarity search approach, LES3. Our approach first partitions sets into groups, and then utilizes a light-weight bitmap-like indexing structure, called token-group matrix (TGM), to organize groups and prune out candidates given a query set. In order to optimize pruning using the TGM, we analytically investigate the optimal partitioning strategy under certain distributional assumptions. Using these results, we then design a learning-based partitioning approach called L2P and an associated data representation encoding, PTR, to identify the partitions. We conduct extensive experiments on real and synthetic datasets to fully study LES3, establishing the effectiveness and superiority over other applicable approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/17/2019

Comparison-Based Indexing From First Principles

Basic assumptions about comparison-based indexing are laid down and a ge...
research
03/19/2020

Top-k queries over digital traces

Recent advances in social and mobile technology have enabled an abundanc...
research
01/30/2022

Similarity Search on Computational Notebooks

Computational notebook software such as Jupyter Notebook is popular for ...
research
10/07/2022

Set2Box: Similarity Preserving Representation Learning of Sets

Sets have been used for modeling various types of objects (e.g., a docum...
research
04/09/2018

Set Similarity Search for Skewed Data

Set similarity join, as well as the corresponding indexing problem set s...
research
04/20/2023

KOIOS: Top-k Semantic Overlap Set Search

We study the top-k set similarity search problem using semantic overlap....
research
05/07/2020

Indexing Metric Spaces for Exact Similarity Search

With the continued digitalization of societal processes, we are seeing a...

Please sign up or login with your details

Forgot password? Click here to reset