DeepAI
Log In Sign Up

A Learned Index for Exact Similarity Search in Metric Spaces

Indexing is an effective way to support efficient query processing in large databases. Recently the concept of learned index, which replaces or complements traditional index structures with machine learning models, has been actively explored to reduce storage and search costs. However, accurate and efficient similarity query processing in high-dimensional metric spaces remains to be an open challenge. In this paper, we propose a novel indexing approach called LIMS that uses data clustering, pivot-based data transformation techniques and learned indexes to support efficient similarity query processing in metric spaces. In LIMS, the underlying data is partitioned into clusters such that each cluster follows a relatively uniform data distribution. Data redistribution is achieved by utilizing a small number of pivots for each cluster. Similar data are mapped into compact regions and the mapped values are totally ordinal. Machine learning models are developed to approximate the position of each data record on disk. Efficient algorithms are designed for processing range queries and nearest neighbor queries based on LIMS, and for index maintenance with dynamic updates. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes and state-of-the-art learned indexes.

READ FULL TEXT VIEW PDF
07/26/2019

qwLSH: Cache-conscious Indexing for Processing Similarity Search Query Workloads in High-Dimensional Spaces

Similarity search queries in high-dimensional spaces are an important ty...
01/04/2021

A Pluggable Learned Index Method via Sampling and Gap Insertion

Database indexes facilitate data retrieval and benefit broad application...
05/07/2020

Indexing Metric Spaces for Exact Similarity Search

With the continued digitalization of societal processes, we are seeing a...
07/15/2022

GLIN: A Lightweight Learned Indexing Mechanism for Complex Geometries

Although spatial index structures shorten the query response time, they ...
03/18/2020

PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries

Range aggregate queries find frequent application in data analytics. In ...
02/02/2019

Learned Indexes for Dynamic Workloads

The recent proposal of learned index structures opens up a new perspecti...
11/03/2020

Memory-Efficient RkNN Retrieval by Nonlinear k-Distance Approximation

The reverse k-nearest neighbor (RkNN) query is an established query type...

1 Introduction

Similarity search is one of the fundamental operations in the era of big data. It finds objects from a large database within a distance threshold to a given query object (called range queries) or the top- most similar to the query object (called nearest neighbor queries, or NN queries), based on certain similarity measures or distance functions. For example, in spatial databases, a similarity query can be used to find all restaurants within a given range in terms of the Euclidean distance. In image databases, a similarity query can be used to find the top 10 most similar images to a given image in terms of the Earth mover’s distance[28]. To accommodate a wide range of data types and distance functions, we consider similarity search in the context of metric spaces in this paper. A metric space is a generic space that makes no requirement of any particular data representation, but only a distance function that satisfies the four properties, namely non-negativity, identity, symmetry and triangle inequality (Definition 1, Section 3). A number of metric-space indexing methods have been proposed in the literature to accelerate similarity query processing[4, 15, 27, 35, 8]. However, these indexing methods that are based on tree-like structures are increasingly challenged by the rapidly growing volume and complexity of data. On the one hand, query processing with such indexes requires traversing many index nodes (i.e., nodes on the path from the root node to a leaf node) in the tree structure, which can be time-consuming. On the other hand, tree-like indexes impose non-negligible storage pressure on datasets that store complex and large objects, such as image and audio feature data.

In recent years, the concept of learned index [20] has been developed to provide a new perspective on indexing. By enhancing or even replacing traditional index structures with machine learning models that can reflect the intrinsic patterns of data, a learned index can look up a key quickly and save a lot of memory space required by traditional index structures at the same time. The original idea is limited to the one-dimensional case where data is sorted in an in-memory dense array. Directly adapting this idea for a multi-dimensional case is unattractive, since multi-dimensional data has no natural sort order. Several multi-dimensional learned index structures have been proposed to address this issue[32, 21, 26, 11, 19, 22, 12, 34] (detailed discussions refer to Section 2). Despite the significant success of these learned indexes compared with traditional indexing methods, they still have some limitations. First, the existing learned index structures do not support similarity search in metric spaces. The metric space has neither coordinate structure nor dimension information (Remark 1, Section 3), so the numbering rules (e.g., z-order[24]

) and specific pruning strategies designed for vector spaces are not applicable.

The triangle inequality is the only property we can utilize to reduce the search space. The generality of metric space provides an opportunity to develop unified indexing methods, while it also presents a significant challenge to develop an efficient learned indexing method. Second, the existing learned multi-dimensional index structures suffer from the phenomenon called curse of dimensionality. By integrating machine learning models into traditional multi-dimensional indexes, these learned indexes are restricted to certain types of data space partitioning (e.g., grid partitioning), which inevitably leads to rapid performance degradation when the number of dimensions grows. Third, the time to train a machine learning model that can well approximate complex data distributions is typically very long, which makes learned indexes difficult to adapt to frequent insertion/deletion operations and query pattern changes. Finally, some existing learned indexes[26] can only return approximate query results, i.e., there may exist false negatives in the result set, because of errors caused by machine learning models.

To address the aforementioned limitations, we develop a novel disk-based learned index structure for metric spaces, called LIMS, to facilitate exact similarity queries (i.e., point, range and NN queries). In contrast to the

coordinate-based data partitioning, LIMS adopts a distance-based clustering strategy to group the underlying data into a number of subsets so as to decompose complex and potentially correlated data into clusters with simple and relatively uniform distributions. LIMS selects a small set of pivots for each cluster and utilizes the distances to the pivots to perform data redistribution. This reduces the dimensionality of the data to the number of pivots adopted. By using a proper pivot-based mapping, LIMS organizes similar objects into compact regions and imposes a total order over the data. Such

an organization can significantly reduce the number of distance computations and page accesses during query processing. In order to further boost the search performance, LIMS follows the idea of learned index, using several simple polynomial regression models to quickly locate data records that might match the query filtering conditions. Furthermore, LIMS can be partially reconstructed quickly due to its independent index structure for each cluster, which makes LIMS adaptable to changes. As we will show later, LIMS significantly outperforms other multi-dimensional learned indexes and traditional indexes in terms of the

average query time and the number of page accesses, especially when processing high dimensional data.

The main contributions of this paper include:

  • We design LIMS, the first learned index structure for metric spaces, to facilitate exact similarity search.

  • Efficient algorithms for processing point, range and NN queries are proposed, enabling a unified solution for searching complex data in a representation-agnostic way. An update strategy is also proposed for LIMS.

  • To the best of our knowledge, no experiment evaluation between different learned indexes has been performed. In this paper, we compare four multi-dimensional learned indexes. Extensive experiments on real-world and synthetic data demonstrate the superiority of LIMS.

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 introduces the basic concepts and formulates the research problem. Section 4 describes the details of LIMS. LIMS-based similarity query algorithms are discussed in Section 5. Section 6 reports the experimental results. Section 7 concludes the paper.

2 Related Work

We focus on reviewing learned multi-dimensional indexes here. Good surveys of various traditional metric-space indexing methods can be found in [4, 15, 27, 35, 8].

The idea of learned index is that indexes can be regarded as models which take a key as the input and output the position of the corresponding record. If such a “black-box” model can be learned from data, a query can be processed by a function invocation in time instead of traversing a tree structure in time. RMI is the first to explore how to enhance or replace classic index structures with machine learning models[20]

. It assumes that data is sorted and kept in an in-memory dense array. In light of this, a machine learning model essentially is to learn a cumulative distribution function (CDF). RMI consists of a hierarchy of models, where internal nodes in the hierarchy are the models responsible for predicting the child model to use, and a leaf model predicts the position of record. Since RMI utilizes the distribution of data and requires no comparison in each node, it provides significant benefits in terms of storage consumption and query processing time. For the sake of quickly correcting errors caused by machine learning models and supporting range queries, RMI is limited to index key-sorted datasets, which makes a direct application of RMI to multi-dimensional data infeasible because there is no obvious ordering of points. Even if these points are embedded into an ordered space, guaranteeing the correctness and efficiency of a query (

e.g., range query and NN query) remains a challenging task.

ZM index is the first effort to apply the idea of learned index to multi-dimensional spaces[32]. It adopts the z-order space filling curve [24] to establish the ordering relationship for all points and then invokes RMI to support point and range queries. The correctness is guaranteed by a nice geometric property of the z-order curve, i.e., monotonic ordering. However, ZM index needs to check many irrelevant points during the refinement phase, which would get worse for high dimensional spaces. It does not support NN queries and index updates.

Recursive spatial model index (RSMI) builds on RMI and ZM[26]

. It develops a recursive partitioning strategy to partition the original space, and then groups data according to predictions. This results in a learned point grouping, which is different from RMI that fixes the data layout first and trains a model to estimate positions. For each partition, RSMI first maps points into

the rank space and then invokes ZM to support point, range and NN queries. However, the correctness of range and NN queries can not be guaranteed. In addition, since RSMI is still based on space filling curves, the good performance of RSMI is confined to low dimension spaces.

LISA[21], a learned index structure for spatial data can effectively reduces the number of false positives compared with ZM, by 1) partitioning the original space into grid cells based on the data distribution; 2) ordering data with a partially monotonic mapping function and rearranging data layout according to mapped values; 3) decomposing a large query range into multiple small ones. LISA has a dynamic data layout as RSMI, but the correctness of range and NN queries can be guaranteed by the monotonicity of the models. However, its advantage in low scan overhead comes with the costly checking procedure and high index construction time. And the grid-based partitioning strategy makes it unsuitable for high dimensional spaces. Besides, LISA-based NN query processing suffers from many repeated page accesses due to doing range queries with increasing radius from the scratch.

Similar to LISA, Flood [22]

also partitions data space into grid cells along dimensions such that for each dimension, the number of points in each partition is approximately the same. Flood assumes a known query workload and utilizes sample queries to learn an optimal combination of indexing dimensions and the number of partitions. Once these are learned, Flood maintains a table to record the position of the first point in each cell. At query time, Flood invokes RMI for each dimension to identify the cells intersecting the query and looks up the cell table to locate the corresponding records. However, it cannot efficiently adapt to correlated data distribution and skewed query workloads. Tsunami

[12] extends Flood by utilizing query skew to partition data space into some regions, and then further dividing each region based on data correlations. However, simply choosing a subset of dimensions could degrade the performance as dimensionality increases. These studies are not discussed further as we do not assume a known query workload.

A multi-dimensional learned (ML) index [11] combines the idea of iDistance [17] and RMI. It first partitions data into clusters, and then identifies the cluster center as the reference point. After all data points are represented in a one-dimensional space based on the distances to the reference point, RMI can be applied. Different from iDistance, ML uses a scaling value rather than a constant to stretch the data range. However, points along a fixed radius have the same value after the transformation, leading to many irrelevant points to be checked. ML does not support data updates. Note that ML cannot be directly applied in metric spaces since reference points selection is realized by the KMeans algorithm[33]. The reference point that is the mean of points in the clusters may not be in the dataset. It is not always possible to create “artificial” objects in metric datasets[18].

Different from finding a sort order over multi-dimensional data and then learning the CDF, a reinforcement learning based R-tree for spatial data (RLR-Tree)

[14]

uses machine learning techniques to improve on the classic R-tree index. Instead of relying on hand-crafted heuristic rules, RLR-Tree models two basic operations in R-tree,

i.e., choosing a subtree for insertion and splitting a node, as Markov decision process [13], so reinforcement learning models can be applied. Because it does not need to modify the basic structure of the R-tree and query processing algorithms, it is easier to be deployed in the current databases systems than the learned indexes. However, due to the curse of dimensionality, the minimum bounding rectangle (MBR) for a leaf node (even in an optimal R-tree) can be nearly as large as the entire data space, such that the R-tree becomes ineffective. Similar to RLR-Tree, Qd-Tree [34] uses the reinforcement learning to optimize the data partitioning strategy of kd-tree based on a given query workload, and suffers from the same problem. These studies are not discussed further since they are out of our scope.

Notation Description
The dataset
The data space, distance metric, dimensionality
, A data point, a query point
Query radius
The number of nearest neighbors
The number of clusters
The number of pivots
The number of super rings
The th cluster
The th pivots in th cluster
The distance of the furthest object in th cluster from the th pivot
The distance of the nearest object in th cluster from the th pivot
TABLE I: List of key notations

3 Background

In this section, we first introduce some basic concepts and then the formal definition of learned index for exact similarity search in metric spaces is presented. Table I lists the key notations and acronyms used in this paper.

Definition 1 (Metric space).

A metric space is a pair , where is a set of objects and is a function so that , the following holds:

  • [leftmargin = 9pt]

  • non-negativity:

  • identity: iff

  • symmetry:

  • triangle inequality: .

Remark 1.

Metric space is generic because it only requires the distance function satisfying the above properties. Vector space with the Euclidean distance is a special metric space, where additional properties, e.g., the dimension and coordinates, are specified, which can be used to accelerate the search.

In this paper, we consider three types of exact similarity queries in metric spaces: the range query, point query, and NN query.

Definition 2 (Range query).

Given a set , a query object , and a query radius , a range query returns all objects in within the distance of , i.e., .

Remark 2.

Point query is a special case of range query with and an arbitrary metric. In this case, we say .

Definition 3 (NN query).

Given a set , a query object , and a positive integer , a NN query returns a set of objects, denoted as NN, such that NNNN

Example 1.

Consider a word dataset associated with the edit distance [23]. A range query returns all words in within the edit distance 2 to , i.e., . The NN query NN returns the nearest neighbor of , i.e., .

Fig. 1: LIMS index structure

Problem Statement. Let be a metric space and be a set of objects. The learned index for exact similarity search in metric spaces is to learn an index structure for so that point query, range query and NN query can be processed accurately and efficiently. In addition, the index structure is supposed to support insertion and deletion operations.

4 Lims

In this section, we first give an overview of the index structure of LIMS and then present everything needed to build LIMS. LIMS-based query processing will be discussed in Section 5.

4.1 Overview

LIMS consists of three parts: the data clustering and pivot selection, the pivot-based mapping function and associated binary relationship, as well as rank prediction models. Fig. 1 gives an overview of LIMS index structure in a metric space associated with the Euclidean distance, although other metric spaces also apply for LIMS. LIMS first partitions the underlying data into a set of clusters, e.g., 2 clusters in Fig. 1(a), so that each of them follows a relatively uniform data distribution, and then a set of data-dependent pivots for each cluster are picked, e.g., 3 pivots and for cluster (detailed discussions refer to Section 4.3). Then, LIMS maintains a learned index for each cluster separately. Since the index structure is same for each cluster, we take cluster for example. LIMS computes the distances from each object in the cluster to the well-chosen pivots, e.g., in Fig. 1(b). The maximum and minimum distances from each pivot to the corresponding objects, e.g., and are stored so as to support efficient queries. Since objects are sorted by distance values, LIMS can learn a series of rank prediction models, e.g., in Fig. 1(c), for quick computation of the rank of an object given its distance to the pivot. After that, a well-defined pivot-based mapping function is called to transform each object into an ordered set (Definition 6, 7, Section 4.2). We call elements in this set LIMS values. Finally, we physically maintain all data objects sequentially on disk in ascending order of their LIMS values and the relationship between LIMS values and the addresses of data objects in disk pages can be learned by another rank prediction model, e.g., (detailed discussions refer to Section 4.2).

In what follows, we first focus on the specific learned index structure for each cluster, and then turn back to clustering and pivot selection methods. In other words, we assume that the data space has been partitioned, and the pivots in each cluster have been determined.

4.2 Index Structure

Suppose that clusters, say , and pivots for each cluster, say , , have been determined. Then, for each cluster and pivot , all data objects (unique identifiers) are sorted in ascending order of their distances to the pivot. Based on sorted lists in , LIMS learns rank prediction models. For model reuse, we define it formally as follows:

Definition 4 (Rank).

Let be a finite multiset drawn from an ordered set . For any element , we define the rank of as the number of elements smaller than , i.e.,

(1)
Example 2.

Let be a multiset of distance values to a given pivot sorted in ascending order. Then, .

Definition 5 (Rank Prediction Model).

Let be a finite multiset drawn from an ordered set . Rank prediction model is a function learned from , so that it can predict the rank for any element , i.e.,

(2)
Remark 3.

In the strict sense, there is no definition of rank for element . What we want to express here is the number of elements in smaller than . Without confusion, we still use rank for simplicity.

A series of rank prediction models (a.k.a. one-dimensional learned indexes) can be trained as follows: let , then the training set is . Let be a polynomial function of , and

the loss function

be the squared error as follows:

(3)

Then, the rank prediction models can be determined by minimizing loss using the gradient descent. For example, if the degree of polynomial is 2, then , where are parameters to be learned. After is trained, we can get the approximate rank for any object with the distance in and the error can be easily corrected by exponential search in time, where is the difference between the estimated rank and the correct rank. For object with the distance not in , we can also use and exponential search to find its corresponding rank, i.e., the rank of the first element larger than in with the same time complexity.

In order to accelerate query processing, to be introduced in Section 5, LIMS divides ranks of data objects into equal parts, i.e., makes the data in the cluster covered by super rings as evenly as possible. Fig. 2 gives two examples in a metric space with the Euclidean distance. In Fig. 2(a), the number of pivots and rings are specified as 1 and 3, respectively, so we partition all data objects in the cluster into 3 rings w.r.t. the pivot such that each ring includes about 5 data objects. In Fig. 2(b), the number of pivots and rings are specified as 2 and 3, respectively, so we partition all data objects into 6 rings, where 3 w.r.t. the pivot and 3 w.r.t. . In this way, LIMS effectively avoids that lots of objects whose ring IDs defined in Equation (4) are same, while a few objects have different ring IDs.

(a)
(b)
Fig. 2: Examples of data partitioning

The ring ID, i.e., which ring the data object is located, denoted as , can be computed by:

(4)

When the above steps are completed, each data object is equipped with ring IDs. Then, LIMS designs a novel pivot-based mapping function to transform all data objects in the metric space into ordered sets with an associated binary relation . The formal definitions are as follows.

Definition 6 (Pivot-based mapping function).

Given a cluster , , and its corresponding pivots , , let be a set of ring ID functions. Then, for any data object , we define a pivot-based mapping function as follows:

(5)

We call LIMS value of .

Example 3.

Consider the cluster in Fig. 2(b). The ring IDs of and are and , and , and , respectively. The ring IDs of other data objects can be computed similarly. According to the definition of pivot-based mapping function, and .

In order to build a learned index structure on LIMS values, we need to impose a binary relationship between LIMS values as follows:

Definition 7 (Binary Relation ).

Let be a multiset of LIMS values in cluster with pivots, . Then, can be ordered as follows:

(6)

if and only if condition or is satisfied.

  1. (7)
  2. (8)

where and in conditions is the order of natural numbers.

It’s straightforward to prove that the binary relation in Definition 7 is well-defined, i.e., it is reflexive, antisymmetric and transitive [29], so is an ordered set. In our implementation, we use the concatenation of ring IDs as LIMS value, which satisfies conditions in Definition 7 obviously.

Example 4.

Reconsider Example 3, LIMS values of and are sorted as follows: .

Now that all objects in the metric space are transformed into the corresponding ordered sets, we can sort them sequentially in ascending order of LIMS values, and store data in a number of disk pages with each page fully utilized. To quickly locate the addresses of data objects, LIMS learns a rank prediction model. Specifically, let be a multiset of LIMS values and , then a rank prediction model can be learned based on . Similar to the loss function in Equation (3), we still try to minimize the squared error. After is trained, we can get the approximate rank (address) for any object with the LIMS value . The error can be easily corrected by exponential search that stops when the first occurrence of is found. For object with the LIMS value not in , we can also use and exponential search to find its corresponding rank, i.e., the position where the first occurrence of the element larger than in .

4.3 Data Clustering and Pivot Selection

As mentioned in [20]

, one challenge of replacing traditional tree-like indexes with learned indexes is that it is difficult to approximate complex data distributions with a single model. If we construct a learned index on the whole dataset directly, the function to be learned would be very steep in regions where the data objects are dense, while very gentle for sparse regions. Such a complicated relationship can be fit by a neural network, but it will incur expensive query costs in practice. Based on the observation that real-life data are usually clustered and correlated

[3], LIMS groups the underlying data into a number of clusters and maintains a learned index for each cluster instead. This strategy gives two advantages: first, the data distribution of each cluster becomes simpler, which simplifies the model to be learned (e.g., a polynomial function); second, simple models have lower query and (re-)construction costs. In this paper, we simply adopt the k-center algorithm [16], a simple yet effective algorithm that guarantees to return a 2-approximate optimal centroid set. We can follow the same steps to build LIMS on top of other clustering algorithms such as the kMeans[33], which can potentially further improve our approach. Different from a general clustering problem where the number of clusters can be flexible, the number of clusters used for indexing affects the index structure and search performance. Therefore, we propose a statistic to determine the number of clusters, to be discussed in detail in Section 5.4. For now, we assume that the number of clusters has been determined.

Once the clusters are obtained, LIMS picks a few data objects, named pivots [5], for each cluster and computes distances from each data object in the cluster to pivots. This strategy gives three advantages: first, we can use the triangle inequality on these pre-computed distances to prune the search space; second, learned indexes can be built naturally on these 1-dimensional distance values; third, re-distributing data with reference to well-chosen pivots may effectively ease the curse of dimensionality because metric search performance depends critically on the intrinsic dimension, a property relying on the data distribution itself, as opposed to the dimension where the data is represented[25, 4, 10, 18]. For example, the intrinsic dimension of a plane is two no matter if it is embedded in a high dimensional space. While LIMS is not dependent on the underlying pivot selection method, the number and locations of pivots have an influence on the retrieval performance. The more high-quality pivots there are, the more information they provide and the higher pruning power for query processing. However, the time taken for checking pruning conditions also increases (For LIMS, it mainly refers to the cost of generating search intervals, to be discussed in detail in Section 5). Extensive methods for pre-defining an optimal set of pivots have been proposed. A good survey can be found in [36]. In our implementation, we adopt the farthest-first-traversal (FFT) algorithm[16] because of its linear time and space complexity.

5 LIMS-based Query Processing

In this section, we proceed to present query processing algorithms for point, range and NN queries using LIMS. Section 5.3 explains dynamic updates. Section 5.4 discusses the choice of . Without loss of generality, we assume there is no two objects in are totally same.

5.1 Range Query

Input: : a query object; : a query radius
Output: : objects in satisfying
1Let be an array of all , be arrays of all ;
2 for each cluster  do
3        for  each pivot  do
4               if  OR  then
5                      ;
6                      ;
7                     
8              
9       
10 for each TRUE cluster  do
11        for each pivot  do
12               ;
13               ;
14               ;
15               ;
16               /*assume , otherwise discard */;
17               ;
18               if  then
19                      ;
20                     
21              else
22                      ;
23                     
24              
25       
26 generate LIMS-value ranges based on and ;
27 for each range  do
28        ;
29        ;
30        ;
31        if  then
32               ;
33              
34       else
35               ;
36              
37       /* assume , otherwise discard */;
38        add to all unvisited pages from to ;
39       
40add to all objects saved in satisfying ;
return
Algorithm 1 Range Query

Given a query object , query radius and dataset , a range query is to retrieve all objects in within the distance of , i.e., . Algorithm 1 outlines the range query processing, which consists of 4 subprocedures. TriPrune (Line 1-1): prune irrelevant clusters by triangle inequality; AreaLocate (Line 1-1): determine affected areas of relevant clusters with the help of rank prediction functions ; IntervalGen (Line 1): generate search intervals on LIMS values; PosLocate (Line 1-1): locate positions of records in the disk by rank prediction functions .

TriPrune (Line 1-1). The algorithm starts by computing the distances between the query and the pivots, and then utilizes triangle inequality property in the metric space to prune a number of irrelevant clusters and thus accelerate the search. Specifically, according to triangle inequality, an object in cluster may fall into the query range, it must satisfy the following: ,

(9)

Recall that LIMS also maintains both maximum distance and minimum distance of the cluster, hence, for any object in the cluster, we must have: ,

(10)

Combing Equations (9) and (10), we can derive that a cluster is needed to be further checked if and only if the following condition (Line 1) is satisfied:

(11)

AreaLocate (Line 1-1). For a cluster that needs to be searched, LIMS first determines the affected areas in the metric space according to the following equations:

(12)
(13)

where . Then, LIMS invokes corresponding rank prediction models to predict the min and max ranks of affected areas and the error is fixed via exponential search (Line 1-1). Min ring ID can be easily calculated by calling function, while the max ring ID is divided to 2 cases in order to narrow down search range as much as possible (Line 1-1).

IntervalGen (Line 1). Instead of doing intersection of several candidate sets in the metric space via costly distance computations, LIMS reduces the search space by doing intersection of LIMS-value intervals directly. Specifically, for a relevant cluster , let }, and . Depth-first search (DFS) is run on a directed acyclic graph (DAG) composed of vertexes and fully connected edges from to , , to find all paths from to . These paths form a total of LIMS-value search ranges. It is through IntervalGen that the number of data objects to be accessed is significantly reduced, while the pruning cost remains low. Here is an example of this step.

Example 5.

Consider a cluster with pivots and rings. Suppose it is relevant to a range query such that the minimum and maximum ring IDs of affected region w.r.t. the 1st, 2nd and 3rd pivots are , , , , , , respectively, i.e., and . Then, LIMS-value search ranges can be computed by running DFS on the DAG shown in Figure 3. The final search ranges are the union of , , , , , , , and .

PosLocate (Line 1-1). For each search range, the position of lower bound in the disk can be easily located via rank prediction model and exponential search. is the maximum number of objects each page can hold. The upper bound is divided into 2 cases to make sure the results correct. For example, it’s possible that different objects have the same LMIS value, so we find the last occurrence of the LIMS value via another exponential search, denoted as , to guarantee no objects missed (i.e., no false negatives). Finally, all retrieved objects are further refined in the refinement step (Line 1), where exact distance computations are performed.

Fig. 3: An illustration of finding LIMS-value search ranges

Correctness. To prove that Algorithm 1 offers exact answers for a range query, we need to show that 1) all data objects in the result set satisfy , i.e., no false positives; and 2) no objects satisfying are missed i.e., no false negatives. Obviously, no false positives can be returned because a final refinement step is applied, where the exact distance computations are performed to guarantee all data objects in the result set satisfy . We prove there are no false negatives by contradiction. Assume that there exists an object that satisfies but is not returned. According to the triangle inequality, we know that , Similarly, we can derive . It indicates that the cluster (and thus ) will not be pruned in TriPrune step (Equation (11)). Since and , we know that (Equation (13)). Similarly, we can derive (Equation (12)). Although the rank prediction models have an error, the error is fixed by exponential search. Thus, we have after AreaLocate step. Denote by the value in satisfying , then can be written as . According to the procedure of generating LIMS-value search ranges in IntervalGen step, there must exist a range . Based on the binary relation in Definition 7, we know that falls in the above search range, and thus in the union of all search ranges after IntervalGen step. The rank prediction model in PosLocate step may have an error, but the error is fixed by exponential search. Hence, exact distance computations between all objects in the search ranges and the query object are performed. will be returned, which leads to a contraction. Therefore, Algorithm 1 can answer the range query correctly.

Fig. 4: An example of range query based on LIMS

Query Cost. TriPrune takes time, where represents the cost of distance computation. The cost of AreaLocate depends on rank prediction models. We use and to denote the prediction cost of or , and the cost of fixing error incurred by models via exponential search, respectively. Hence, we need time to locate affected areas of relevant clusters. The cost of IntervalGen is from running DFS on DAG, which takes time, where is the number of vertexes and is the number of edges. Similar to AreaLocate subprocedure, PosLocate takes time, where is the number of LIMS-value search ranges. Generally, . In addition, we need to access disk pages in to refine and retrieve finial result, which takes time. Therefore, the overall query time is .

Example 6.

Figure 4 is an example of LIMS-based range query. It is straightforward to see that cluster does not satisfy , and thus can be discarded from further processing directly, while cluster cannot be discarded. As for , LIMS first inputs the min and max boundaries (i.e., purple dashed lines in Figure 4(a)) of affected areas into corresponding rank prediction models and ring ID functions , . Then, LIMS runs DFS on DAG to transform the intersection of several candidate sets in the metric space (the grey region in Figure 4(a)) to the intersection of intervals (the grey region in Figure 4(b)), to significantly reduce the number of distance computations and page accesses. Next, positions of objects on disk are estimated by the rank prediction function . Finally, candidate objects are retrieved and a refinement step is applied (Figure 4(c)).

5.2 kNN Query

In LIMS, a NN query is processed by conducting a series of range queries with increasing search radius , , until nearest neighbors are found, where is a given small initial radius. Algorithm 2 outlines the LIMS-based NN query. Given a query and , a range query with a radius is issued at the beginning. To answer this query, LIMS invokes Algorithm 1 (Line 2). is a max priority queue to record the candidate nearest neighbors. If the distance from the retrieved object to the query object is smaller than the current furthest distance in , LIMS will extract the object with the maximum distance value from and insert into (Line 2-2). The search stops if and only if the furthest object in the current priority queue falls within the current query range and further expansion of the query radius does not change the answer set (Line 2-2). Otherwise, it enlarges the query radius by (Line 2) and invokes Algorithm 1 till the termination conditions are satisfied. LIMS also maintains an array to record whether a page has been processed. If so, it will be skipped during the next range query to avoid repeated accesses (Line 2). The number of range query calls depends on and the distance between and the -th NN in the dataset, which can be determined by .

Input: : a query point; : a positive integer; : a positive number
Output: : top nearest neighbors to
1, is a max priority queue on objects initialized to ;
2 while  do
3        ;
4        if  then
5               ;
6              
7       call to get pages set ;
8        call ;
9        return ;
10       
11procedure :
12 for each unvisited page  do
13        for each point  do
14               if   then
15                      Extract-Max();
16                      Insert();
17                     
18              
19       
Algorithm 2 NN Query
Remark 4.

Intuitively, the initial radius affects the number of range query calls, and thus the query efficiency. We observed that a small would not degrade the query performance severely. As we know, the extra cost of using range queries with increasing radius to answer a NN query mainly comes form 1) traversing the index multiple times and 2) accessing the same pages multiple times. In LIMS, a query is processed by a function invocation in time instead of time in tree-like indexes, which makes the cost of multiple traverses negligible. Besides, LIMS would not access visited pages. However, a too large initial radius can degrade the query performance. Therefore, we recommend a small initial search radius, which can be simply estimated based on distances between pairs of data objects sampled from the dataset.

5.3 Updates

LIMS allows both insertions and deletions of data. To support the efficient insertion, LIMS maintains a sorted array for each cluster in ascending order of distance values to the centroid. Given an object to be inserted, LIMS first runs a point query to find if there is a page containing . If so, the algorithm terminates immediately. Otherwise, LIMS finds the cluster closest to and inserts into the sorted array of this cluster. All newly inserted data are arranged in a number of pages with each page fully utilized. During query processing, LIMS uses the triangle inequality and exponential search to retrieve these inserted objects matching query filters. Given an object to be deleted, LIMS first runs a point query for . If is found, it is marked as ‘deleted’. Then, LIMS updates maximum and minimum distances to each pivot of the cluster belongs to. Due to the space limitation, we omit the pseudocode here. Such an easy update strategy is effective and efficient because 1) LMIS partitions the data space into many clusters that amortize the additional search time on the sorted arrays; 2) LIMS maintains an index for each cluster separately, which allows for partially rebuilding the index, i.e., retraining rank prediction models for certain clusters, especially if deletions invalidate a cluster or insertions result in too much overlapping between some clusters. The procedure of retraining rank prediction models is the same as that of training them; 3) the short index construction (reconstruction) time of LIMS ensures the feasibility of this strategy in practice (Section 6).

5.4 Last Piece

The last piece of LIMS is to determine the number of clusters. The optimal number is related not only to the data distribution but also to the query workload. However, the query workload is not available during data clustering and we do not assume a known query workload in this paper. Therefore, an alternative should be developed to pre-define the number of clustering parameter . Recall that the goal of clustering in LIMS is to decompose complex and potentially correlated data into a few clusters. Ideally, these clusters are independent and each can be accurately fit by a linear function. However, in most real-world use cases, the data do not follow such a perfect pattern. On the one hand, the overlapping between clusters may incur extra pruning and refinement costs. On the other hand, uneven intra-cluster distribution incurs more arithmetic and comparison operations. In order to pick a to avoid or reduce such overhead, we introduce overlap rate (OR) and mean absolute error (MAE) to evaluate the goodness of clustering. OR quantifies the extent of overlapping among clusters. It can be computed as:

(14)

Without confusion, we use to represent the distance of the furthest object in th cluster from the centroid, and is the length of the overlapping area computed as:

(15)

MAE quantifies the quality of the linear regression fit for each cluster. It can be computed as:

(16)

where is the cardinality of dataset, are linear rank prediction models learned from .

Datasets Cardinality Dim. Ins. Metric
Color 1,281,167 32 4.2 -norm
Forest 565,892 6 1.5 -norm
GaussMix 5,10,20,40,60,80M 2,4,8,12,16 6.2 -norm
Skewed 10M 2,4,8,12,16 5.6 -norm
Signature 100K 65 36 Edit distance
TABLE II: Summary of datasets

We model the overhead as , where is a user-defined weight. Inspired by the elbow method [31], we choose the elbow or knee of a curve as the clustering number to use, i.e., a point where adding more clusters will not give much better modeling of the data. The number estimated by the techniques described above turns out to be very close to the optimal number of clusters observed in practice, as shown in Section 6.

6 Experiments

In this section, we present the results of an in-depth experimental study on LIMS. We implement LIMS 111https://github.com/learned-index/LIMS and associated similarity search algorithms in C++. All experiments are conducted on a computer running 64-bit Ubuntu 20.04 with a 2.30 GHz Intel(R) Xeon(R) Gold 5218 CPU, 254 GB RAM, and an 8.2 TB hard disk.

(a) Estimated effect of
(b) Effect of on range query
(c) Effect of on range query
(d) Effect of on range query
Fig. 5: Effect of parameters

6.1 Experimental Settings

Datasets. We employ two real-world datasets, namely, Color Histogram222https://image-net.org/download-images and Forest Cover Type333https://www.kaggle.com/c/forest-cover-type-prediction/data following the experimental settings of ML index[11]. Color Histogram

contains 1,281,167 32-dimensional image features extracted from the ImageNet

444 http://image- net.org/download- images dataset. Forest Cover Type is collected by US Geological Survey and US Forest Service. It includes 565,892 records, each of which has 12 cartographic variables. We extract 6 quantitative variables of them as our data object. Following the experimental settings of iDistance[17], we generate 2, 4, 8, 12, 16-dimensional GaussMix

datasets. Every dataset contains up to 80 million points (5.36GB in size) sampled from 150 normal distributions with the standard deviation of 0.05 and randomly determined means. Default settings are underlined. Without loss of generality,

norm is utilized to the above datasets. Following the experimental settings of RSMI[26], we create 2, 4, 8, 12, 16-dimensional Skewed datasets. They are generated from uniform data by raising the values in each dimension to their powers, i.e., a randomly generated data point are converted from to . The size of each dataset is 10 million and norm is employed. Without loss of generality, all the data values of the above datasets are normalized to the range . Following the experimental settings in [30], we also generate a Signature dataset, where each object is a string with 65 English letters. We first obtain 25 ‘anchor signatures’ whose letters are randomly chosen from the alphabet. Then, each anchor produces a cluster with 4,000 objects, each of which is obtained by randomly changing positions in the corresponding anchor signature to other random letters, where is uniformly distributed in the range . The edit distance is used to compute the distance between two signatures. Table II summarizes the statistics of the datasets.

Competitors. We compare LIMS with three representative multi-dimensional learned indexes as mentioned in Section 2, i.e., ZM[32], ML[11], LISA[21], and three traditional indexes, i.e., -tree[1], M-tree[9] and SPB-tree[6][7]

. ZM and ML are in-memory indexes, so we adapt them to the disk by storing data in ascending order of their z-order/mapped values in a number of pages with each page fully utilized. For LISA, no open-source C++ code is available, so we implement it following Python version implementation. For

-tree, M-tree and SPB-tree, we use the original implementations. In addition, to study the effectiveness of learning components in LIMS, we design a method called non-learned index for metric spaces (N-LIMS) by replacing rank prediction models in LIMS with the traditional B-trees[2]. All competitors are configured to use a fixed disk page size of 4KB.

Evaluation Metrics. Four metrics are used to evaluate the performance of indexes: the average number of page accesses, the average query time, indexing time and index size. We randomly select 200 objects from each dataset and repeat each experiment 20 times to get average results.

6.2 Effect of Parameters

We first study the effect of parameters, including the number of clusters , the number of pivots and the number of rings , to optimize LIMS-based similarity search, as summarized in Fig. 5. Only one parameter varies whereas the others are fixed to their default values in every experiment. By default, the selectivity of range query, i.e., the fraction of objects within the query range from the total number of objects, is set to 0.01%. The of NN query is 5. Degrees of and are 20 and 1. and are 3 and 20. is determined according to the method described in Section 5.4. We set for 10M 8 GaussMix and 8 Skewed, for Color Histogram, Forest Cover Type and Signature.

6.2.1 Effect of

Fig. 5(a) plots the criterion versus on both real and synthetic data, where and is set to . We can see that different datasets have different elbow points. In order to show the estimation is close to the actual optimal number of clusters, we also plot the actual average query time and the number of page accesses for range query by varying . Due to the space limitation, we only report the performance on 10M 8 GaussMix dataset in Fig. 5(b). It can be observed that query time decreases slowly after , which is consistent with the recommended choice of in Fig. 5(a). Therefore, we set for this dataset.

6.2.2 Effect of

Fig. 5(c) reports the query performance on 10M 8 GaussMix dataset by varying the number of pivots . We can see that increasing the number of pivots always reduces (or at least does not increase) the number of page accesses. This is expected because the intersection of metric regions defined by more pivots is always smaller than (or at least equal to) the intersection of metric regions defined by fewer pivots. As discussed in Section 4.3, the more pivots, the stronger pruning ability. This observation can also be proven using Equation (11). However, the average query time decreases when using up to four pivots, and then increases progressively. That is because the cost for filtering unqualified objects grows as well with more pivots. The best number of pivots for a metric index is a trade-off between filter cost and scanning cost. Therefore, the default value for is set to 3 unless otherwise stated.

6.2.3 Effect of

Fig. 5(d) reports the query performance on 10M 8 GaussMix dataset by varying the number of rings . For the same reason as above, the average query time presents a down and up trend whereas the lowest value turns out at .

6.3 Range Query Performance

In this subsection, we study the performance of LIMS, ML, LISA, ZM, -tree, M-tree and SPB-tree on range query from different angles, as summarized in Fig. 6, 7 and 8.

(a) Query time on Skewed
(b) # Page accesses on Skewed
(c) Query time on GaussMix
(d) # Page accesses on GaussMix
Fig. 6: Range query performance with dimensionality
(a) Query time on Forest Cover Type
(b) # Page accesses on Forest
(c) Query time on Color
(d) # Page accesses on Color
Fig. 7: Range query performance with selectivity

6.3.1 Performance with dimensionality

The first set of experiments studies the average query time and the number of page accesses under different dimensionalities. Fig. 6(a)(b) and Fig. 6(c)(d) report the results on Skewed and GaussMix datasets, respectively. From the figures, we have the following observations: 1) The average query time of all methods increases with dimensionality, but LIMS, ML and SPB-tree grow much slower than others, suggesting that data clustering and pivot-based data transformation techniques are effective in alleviating the curse of dimensionality. The coordinate-based methods, i.e., LISA, ZM and -tree, degrade rapidly with dimensionality and even do not work, hence we do not report their results after . 2) LIMS is slightly slower than LISA when and when on GaussMix, but LIMS offers the best performance in a higher range of dimensions on both datasets. The reason is that metric-space indexes can only use the four distance properties to prune the search space, while LISA can easily locate the cells that overlap with the query range by coordinates. Fewer assumptions about the data result in poorer pruning and slightly larger query costs in the low-dimensional case. However, the filtering cost of LISA increases exponentially with dimensionality and it does not work when the dimension is greater than . Note that on GaussMix, even though LIMS has higher numbers of page accesses than LISA, it is still faster due to the low filter cost. On Skewed, LIMS begins to have a lead advantage since because LIMS can be applied to a metric space with any distance metrics naturally (e.g., norm here) and guarantees a few false positives and fast query response. 3) LIMS is always better than ML in both the query time and page accesses. This is intuitive since ML transforms different objects with equi-distance from the pivot into the same 1-dimensional value, while LIMS integrates the pruning abilities from multiple pivots. In addition, with the help of a well-defined pivots-based mapping, LIMS maps nearby data into the compact region, which further reduces the search region, and thus fewer objects to be accessed in the refinement step. 4) LIMS outperforms the learned index, ZM, by over an order of magnitude. This is because LIMS uses clusters and LIMS values to organize objects into compact regions, while ZM uses z-order curve to organize data, which incurs too many false positives, and thus large page accesses and distance computations. 5) LIMS is always better than traditional indexes. The main reason is that query processing with a traditional index requires traversing many tree nodes multiple times, which is time-consuming. M-tree is omitted since it is considerably worse than others. 6) The performance of all indexes on