Indexing Metric Spaces for Exact Similarity Search

by   Lu Chen, et al.

With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes.


page 5

page 11

page 12


A Learned Index for Exact Similarity Search in Metric Spaces

Indexing is an effective way to support efficient query processing in la...

Unconventional application of k-means for distributed approximate similarity search

Similarity search based on a distance function in metric spaces is a fun...

Comparison-Based Indexing From First Principles

Basic assumptions about comparison-based indexing are laid down and a ge...

A Triangle Inequality for Cosine Similarity

Similarity search is a fundamental problem for many data analysis techni...

LES3: Learning-based Exact Set Similarity Search

Set similarity search is a problem of central interest to a wide variety...

Pruning Algorithms for Low-Dimensional Non-metric k-NN Search: A Case Study

We focus on low-dimensional non-metric search, where tree-based approach...

A Review for Weighted MinHash Algorithms

Data similarity (or distance) computation is a fundamental research topi...