Unconventional application of k-means for distributed approximate similarity search

08/04/2022
by   Felipe Ortega, et al.
0

Similarity search based on a distance function in metric spaces is a fundamental problem for many applications. Queries for similar objects lead to the well-known machine learning task of nearest-neighbours identification. Many data indexing strategies, collectively known as Metric Access Methods (MAM), have been proposed to speed up queries for similar elements in this context. Moreover, since exact approaches to solve similarity queries can be complex and time-consuming, alternative options have appeared to reduce query execution time, such as returning approximate results or resorting to distributed computing platforms. In this paper, we introduce MASK (Multilevel Approximate Similarity search with k-means), an unconventional application of the k-means algorithm as the foundation of a multilevel index structure for approximate similarity search, suitable for metric spaces. We show that inherent properties of k-means, like representing high-density data areas with fewer prototypes, can be leveraged for this purpose. An implementation of this new indexing method is evaluated, using a synthetic dataset and a real-world dataset in a high-dimensional and high-sparsity space. Results are promising and underpin the applicability of this novel indexing method in multiple domains.

READ FULL TEXT
research
07/26/2019

qwLSH: Cache-conscious Indexing for Processing Similarity Search Query Workloads in High-Dimensional Spaces

Similarity search queries in high-dimensional spaces are an important ty...
research
11/07/2021

Em-K Indexing for Approximate Query Matching in Large-scale ER

Accurate and efficient entity resolution (ER) is a significant challenge...
research
05/07/2020

Indexing Metric Spaces for Exact Similarity Search

With the continued digitalization of societal processes, we are seeing a...
research
10/15/2021

Role Similarity Metric Based on Spanning Rooted Forest

As a fundamental issue in network analysis, structural node similarity h...
research
03/18/2020

PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries

Range aggregate queries find frequent application in data analytics. In ...
research
06/20/2020

Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search

Data series are a special type of multidimensional data present in numer...
research
10/22/2018

Indexing Point Sets for Approximate Bottleneck Distance Queries

The bottleneck distance is a natural measure of the distance between tw...

Please sign up or login with your details

Forgot password? Click here to reset