Learning to Hash Robustly, with Guarantees

08/11/2021
by   Alexandr Andoni, et al.
0

The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Locality Sensitive Hashing (LSH), and its derivatives. In practice, many heuristic approaches exist to "learn" the best indexing method in order to speed-up NNS, crucially adapting to the structure of the given dataset. Oftentimes, these heuristics outperform the LSH-based algorithms on real datasets, but, almost always, come at the cost of losing the guarantees of either correctness or robust performance on adversarial queries, or apply to datasets with an assumed extra structure/model. In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms, while optimizing the hashing to the structure of the dataset (think instance-optimal algorithms) for performance on the minimum-performing query. We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically. On the theoretical side, we exhibit a natural setting (dataset model) where our algorithm is much better than the standard theoretical one. On the practical side, we run experiments that show that our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.

READ FULL TEXT
research
07/26/2020

Beyond the Worst-Case Analysis of Algorithms (Introduction)

One of the primary goals of the mathematical analysis of algorithms is t...
research
01/24/2019

Learning Sublinear-Time Indexing for Nearest Neighbor Search

Most of the efficient sublinear-time indexing algorithms for the high-di...
research
09/17/2015

Learning to Hash for Indexing Big Data - A Survey

The explosive growth in big data has attracted much attention in designi...
research
06/16/2023

MementoHash: A Stateful, Minimal Memory, Best Performing Consistent Hash Algorithm

Consistent hashing is used in distributed systems and networking applica...
research
12/15/2015

Data Driven Resource Allocation for Distributed Learning

In distributed machine learning, data is dispatched to multiple machines...
research
11/04/2021

A Unified Approach to Coreset Learning

Coreset of a given dataset and loss function is usually a small weighed ...
research
07/16/2014

In Defense of MinHash Over SimHash

MinHash and SimHash are the two widely adopted Locality Sensitive Hashin...

Please sign up or login with your details

Forgot password? Click here to reset