DeepAI AI Chat
Log In Sign Up

Set2Box: Similarity Preserving Representation Learning of Sets

by   Geon Lee, et al.
KAIST 수리과학과

Sets have been used for modeling various types of objects (e.g., a document as the set of keywords in it and a customer as the set of the items that she has purchased). Measuring similarity (e.g., Jaccard Index) between sets has been a key building block of a wide range of applications, including, plagiarism detection, recommendation, and graph compression. However, as sets have grown in numbers and sizes, the computational cost and storage required for set similarity computation have become substantial, and this has led to the development of hashing and sketching based solutions. In this work, we propose Set2Box, a learning-based approach for compressed representations of sets from which various similarity measures can be estimated accurately in constant time. The key idea is to represent sets as boxes to precisely capture overlaps of sets. Additionally, based on the proposed box quantization scheme, we design Set2Box+, which yields more concise but more accurate box representations of sets. Through extensive experiments on 8 real-world datasets, we show that, compared to baseline approaches, Set2Box+ is (a) Accurate: achieving up to 40.8X smaller estimation error while requiring 60 (b) Concise: yielding up to 96.8X more concise representations with similar estimation error, and (c) Versatile: enabling the estimation of four set-similarity measures from a single representation of each set.


page 1

page 3

page 7

page 9

page 10


LES3: Learning-based Exact Set Similarity Search

Set similarity search is a problem of central interest to a wide variety...

Learning Certified Individually Fair Representations

To effectively enforce fairness constraints one needs to define an appro...

Deep Set-to-Set Matching and Learning

Matching two sets of items, called set-to-set matching problem, is being...

Set-to-Set Hashing with Applications in Visual Recognition

Visual data, such as an image or a sequence of video frames, is often na...

A Review for Weighted MinHash Algorithms

Data similarity (or distance) computation is a fundamental research topi...

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Estimating set similarity and detecting highly similar sets are fundamen...

SetSketch: Filling the Gap between MinHash and HyperLogLog

MinHash and HyperLogLog are sketching algorithms that have become indisp...