Distributed Tera-Scale Similarity Search with MPI: Provably Efficient Similarity Search over billions without a Single Distance Computation

08/05/2020
by   Nicholas Meisburger, et al.
0

We present SLASH (Sketched LocAlity Sensitive Hashing), an MPI (Message Passing Interface) based distributed system for approximate similarity search over terabyte scale datasets. SLASH provides a multi-node implementation of the popular LSH (locality sensitive hashing) algorithm, which is generally implemented on a single machine. We show how we can append the LSH algorithm with heavy hitters sketches to provably solve the (high) similarity search problem without a single distance computation. Overall, we mathematically show that, under realistic data assumptions, we can identify the near-neighbor of a given query q in sub-linear (≪ O(n)) number of simple sketch aggregation operations only. To make such a system practical, we offer a novel design and sketching solution to reduce the inter-machine communication overheads exponentially. In a direct comparison on comparable hardware, SLASH is more than 10000x faster than the popular LSH package in PySpark. PySpark is a widely-adopted distributed implementation of the LSH algorithm for large datasets and is deployed in commercial platforms. In the end, we show how our system scale to Tera-scale Criteo dataset with more than 4 billion samples. SLASH can index this 2.3 terabyte data over 20 nodes in under an hour, with query times in a fraction of milliseconds. To the best of our knowledge, there is no open-source system that can index and perform a similarity search on Criteo with a commodity cluster.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2021

Practical Near Neighbor Search via Group Testing

We present a new algorithm for the approximate near neighbor problem tha...
research
06/25/2019

Pyramid: A General Framework for Distributed Similarity Search

Similarity search is a core component in various applications such as im...
research
06/05/2019

Fair Near Neighbor Search: Independent Range Sampling in High Dimensions

Similarity search is a fundamental algorithmic primitive, widely used in...
research
01/26/2021

Sampling a Near Neighbor in High Dimensions – Who is the Fairest of Them All?

Similarity search is a fundamental algorithmic primitive, widely used in...
research
03/17/2021

IRLI: Iterative Re-partitioning for Learning to Index

Neural models have transformed the fundamental information retrieval pro...
research
10/13/2020

It's the Best Only When It Fits You Most: Finding Related Models for Serving Based on Dynamic Locality Sensitive Hashing

In recent, deep learning has become the most popular direction in machin...
research
02/18/2019

RACE: Sub-Linear Memory Sketches for Approximate Near-Neighbor Search on Streaming Data

We demonstrate the first possibility of a sub-linear memory sketch for s...

Please sign up or login with your details

Forgot password? Click here to reset