Set2Box: Similarity Preserving Representation Learning of Sets

by   Geon Lee, et al.

Sets have been used for modeling various types of objects (e.g., a document as the set of keywords in it and a customer as the set of the items that she has purchased). Measuring similarity (e.g., Jaccard Index) between sets has been a key building block of a wide range of applications, including, plagiarism detection, recommendation, and graph compression. However, as sets have grown in numbers and sizes, the computational cost and storage required for set similarity computation have become substantial, and this has led to the development of hashing and sketching based solutions. In this work, we propose Set2Box, a learning-based approach for compressed representations of sets from which various similarity measures can be estimated accurately in constant time. The key idea is to represent sets as boxes to precisely capture overlaps of sets. Additionally, based on the proposed box quantization scheme, we design Set2Box+, which yields more concise but more accurate box representations of sets. Through extensive experiments on 8 real-world datasets, we show that, compared to baseline approaches, Set2Box+ is (a) Accurate: achieving up to 40.8X smaller estimation error while requiring 60 (b) Concise: yielding up to 96.8X more concise representations with similar estimation error, and (c) Versatile: enabling the estimation of four set-similarity measures from a single representation of each set.


page 1

page 3

page 7

page 9

page 10


Bipartite Graph Convolutional Hashing for Effective and Efficient Top-N Search in Hamming Space

Searching on bipartite graphs is basal and versatile to many real-world ...

LES3: Learning-based Exact Set Similarity Search

Set similarity search is a problem of central interest to a wide variety...

Learning Certified Individually Fair Representations

To effectively enforce fairness constraints one needs to define an appro...

Deep Set-to-Set Matching and Learning

Matching two sets of items, called set-to-set matching problem, is being...

Set-to-Set Hashing with Applications in Visual Recognition

Visual data, such as an image or a sequence of video frames, is often na...

A Review for Weighted MinHash Algorithms

Data similarity (or distance) computation is a fundamental research topi...

Symmetric Spaces for Graph Embeddings: A Finsler-Riemannian Approach

Learning faithful graph representations as sets of vertex embeddings has...

Please sign up or login with your details

Forgot password? Click here to reset