Log In Sign Up

LES3: Learning-based Exact Set Similarity Search

Set similarity search is a problem of central interest to a wide variety of applications such as data cleaning and web search. Past approaches on set similarity search utilize either heavy indexing structures, incurring large search costs or indexes that produce large candidate sets. In this paper, we design a learning-based exact set similarity search approach, LES3. Our approach first partitions sets into groups, and then utilizes a light-weight bitmap-like indexing structure, called token-group matrix (TGM), to organize groups and prune out candidates given a query set. In order to optimize pruning using the TGM, we analytically investigate the optimal partitioning strategy under certain distributional assumptions. Using these results, we then design a learning-based partitioning approach called L2P and an associated data representation encoding, PTR, to identify the partitions. We conduct extensive experiments on real and synthetic datasets to fully study LES3, establishing the effectiveness and superiority over other applicable approaches.


page 1

page 2

page 3

page 4


Comparison-Based Indexing From First Principles

Basic assumptions about comparison-based indexing are laid down and a ge...

Top-k queries over digital traces

Recent advances in social and mobile technology have enabled an abundanc...

Set2Box: Similarity Preserving Representation Learning of Sets

Sets have been used for modeling various types of objects (e.g., a docum...

Similarity Search on Computational Notebooks

Computational notebook software such as Jupyter Notebook is popular for ...

Set Similarity Search for Skewed Data

Set similarity join, as well as the corresponding indexing problem set s...

HINT: A Hierarchical Index for Intervals in Main Memory

Indexing intervals is a fundamental problem, finding a wide range of app...

Indexing Metric Spaces for Exact Similarity Search

With the continued digitalization of societal processes, we are seeing a...