Fast and memory-efficient ANN with a subset-search functionality
Existing approximate nearest neighbor search systems suffer from two fundamental problems that are of practical importance but have not received sufficient attention from the research community. First, although existing systems perform well for the whole database, it is difficult to run a search over a subset of the database. Second, there has been no discussion concerning the performance decrement after many items have been newly added to a system. We develop a reconfigurable inverted index (Rii) to resolve these two issues. Based on the standard IVFADC system, we design a data layout such that items are stored linearly. This enables us to efficiently run a subset search by switching the search method to a linear PQ scan if the size of a subset is small. Owing to the linear layout, the data structure can be dynamically adjusted after new items are added, maintaining the fast speed of the system. Extensive comparisons show that Rii achieves a comparable performance with state-of-the art systems such as Faiss.READ FULL TEXT VIEW PDF
We present PUFFINN, a parameterless LSH-based index for solving the
Approximate nearest neighbor (ANN) search in high dimensions is an integ...
In database systems, index is the key component to support efficient sea...
Billion-scale high-dimensional approximate nearest neighbour (ANN) searc...
A common problem appearing across the field of data science is k-NN
Determinantal point processes (DPPs) are a useful probabilistic model fo...
The zbMATH database contains more than 4 million bibliographic entries. ...
Fast and memory-efficient ANN with a subset-search functionality
In recent years, the approximate nearest neighbor search (ANN) has received increasing attention from various research communities (Gudmundsson et al., 2018)
. Typical ANN systems operate in two stages. In the offline phase, database vectors are stored in the ANN system. These vectors may be converted to other forms, such as compact codes, for fast searching and efficient memory usage. In the online querying phase, the system receives a query vector. Similar items to the query are retrieved from the stored database vectors. Their identifiers (and optionally their distances to the query) are then returned. To handle large datasets, this search should be not only fast and accurate, but also memory efficient.
Although many ANN methods have already been proposed, there are two critical problems of practical importance that have not received sufficient attention from the research community (Fig. 3).
Subset search (Fig. (a)a): Once database vectors are stored, modern ANN systems can run a search efficiently for the whole database. Surprisingly, however, almost no systems can run a search over a subset of the database111 For example, the state-of-the-art systems Faiss (Jégou et al., 2018) and Annoy (Bernhardsson, 2018) do not provide this functionality. See discussion at https://github.com/facebookresearch/faiss/issues/322, https://github.com/spotify/annoy/issues/263 . For example, let us consider an image search problem, where the search is formulated as an ANN search over feature vectors. We assume that each image also has a corresponding shooting date. Given a query image, an ANN system can easily find similar images from the whole dataset. However, it is not trivial to find similar images that were taken on a target date (say, May 28 1987). Here, the search should not be conducted over the whole dataset, but rather over a subset of the dataset, where the subset is specified by identifiers of target images. The straightforward solution is to run the search and check whether or not the results were taken on May 28, but this post-checking can be drastically slow, especially if the size of the subset is small. Current ANN systems cannot provide a clear solution to this problem.
Performance degradation via data addition (Fig. (b)b): So far, the manner in which the search performance degrades when items are newly added has not been discussed. The number of database items is typically assumed to be provided when an ANN system is built. Parameters of the system are usually optimized by taking this number into consideration. However, in a practical scenario, new items might be often added to the system. Although the performance does not change while the number of new items is small, we can ask whether the system remains efficient even after items are newly added. To put this another way, suppose that one would like to develop a search system that can handle 1,000,000 vectors in the future, but only has 1,000 vectors in the initial stage. In such a case, is the search fast even for 1,000 vectors?
We develop an ANN system that solves the above two problems, namely reconfigurable inverted index (Rii). The key idea is extremely simple: storing the data linearly. Based on the well-known inverted file with product quantization (PQ) approach (IVFADC) (Jégou et al., 2011a), we design the data layout such that an item can be fetched by its identifier with a cost of . This simple but critical modification enables us to search over a subset of the dataset efficiently by switching to a linear PQ scan if the size of the subset is small. Owing to this linear layout, the granularity of a coarse assignment step can easily be controlled by running clustering again over the dataset whenever the user wishes. This means that the data structure can be adjusted dynamically after new items are added.
An extensive comparison with state-of-the-art systems, such as Faiss (Jégou et al., 2018), Annoy (Bernhardsson, 2018), Falconn (Razenshteyn and Schmidt, 2018), and NMSLIB (Naidan et al., 2018), shows that Rii achieves a comparable performance. For subset searches and data-addition problems for which the existing approaches do not perform well, we demonstrate that Rii remains fast in all cases.
Our contributions are summarized as follows.
Rii enables efficient searching over a subset of the whole database, regardless of the size of the subset.
Rii remains fast, even after many new items are added, because the data structure is dynamically adjusted for the current number of database items.
We review existing work that is closely related to our approach.
Locality-sensitive-hashing (LSH) (Datar et al., 2004)
can be considered as one of the most popular branches of ANN. Hash functions are designed such that the probability of collision is higher for close points than for points that are widely separated. Using these functions with hash tables, nearest items can be found efficiently. Although it has been said that LSH requires a lot of memory and is not accurate compared to data-dependent methods, a recent well-tuned library (FALCONN(Andoni et al., 2015; Razenshteyn and Schmidt, 2018)) using multi-probe technology (Lv et al., 2007) has achieved a reasonable performance.
Space partitioning using a projection or tree constitutes another significant branch of ANN. Especially in the computer vision community, one of the most widely employed methods is FLANN(Muja and Lowe, 2014). Recently, the random projection forest-based method Annoy (Bernhardsson, 2018) achieved a good performance for million-scale data.
Benchmark scores (Aumüller et al., 2017; Bernhardsson et al., 2018) show that graph traversal-based methods (Malkov et al., 2014; Malkov and Yashunin, 2016) achieve the current best performance (the fastest with a fixed recall) when the number of database items is around one million. These methods first create a graph where each node corresponds to a database item, which is called a navigable small world. Given a query, the algorithm starts from a random initial node. The graph is traversed to the node that is the closest to the query. In particular, the hierarchical version HNSW (Malkov and Yashunin, 2016) with the highly optimized implementation NMSLIB (Boytsov and Naidan, 2013) represents the current state-of-the-art. The drawback is that it tends to consume memory space, with a long runtime for building the data structure.
Product quantization (PQ) (Jégou et al., 2011a) and its extensions (Ge et al., 2014; Norouzi and Fleet, 2013; Babenko and Lempitsky, 2014; Martinez et al., 2016; Zhang et al., 2014, 2015; Babenko and Lempitsky, 2015b; Douze et al., 2016; Heo et al., 2014; Jain et al., 2016; Babenko and Lemitsky, 2017; Wang et al., 2015) are popular approaches to handling large-scale data. Our proposed Rii method also follows this line. PQ-based methods compress vectors into short memory-efficient codes. The Euclidean distance between an original vector and compressed code can be efficiently approximated using a lookup table. Current billion-scale search systems are usually based on PQ methods, especially combined with an inverted index-based architecture (Babenko and Lempitsky, 2015a; Kalantidis and Avrithis, 2014; Matsui et al., 2018b; Iwamura et al., 2013; Heo et al., 2016; Spyromitros-Xioufis et al., 2014; Xia et al., 2013). Hardware-based acceleration has also recently been discussed (André et al., 2015, 2017; Blalock and Guttag, 2017; Wieschollek et al., 2016; Johnson et al., 2017; Zhang et al., 2018; Liu et al., 2017). An efficient implementation proposed by the original authors is Faiss (Johnson et al., 2017; Jégou et al., 2018). An extensive survey is given in (Matsui et al., 2018a).
In this section, we will review product quantization (PQ) (Jégou et al., 2011a). PQ compresses vectors into memory efficient short codes. The squared Euclidean distance between an input vector and the compressed code can be approximated efficiently. Owing to its memory-efficient form, PQ played a central role in large-scale ANN systems.
We first describe how to encode a vector. A -dimensional input vector is split into sub-vectors. Each -dimensional sub-vector is compared to pre-trained code words, and the identifier (an integer in ) of the closest one is recorded. Using this, is encoded as , which is a tuple of integers:
where the th sub-vector in is quantized into . We refer to as a PQ-code for . Note that is represented by bits, and we set to 256 in order to represent each code using bytes.
Next, we show how to search over the PQ-codes given a query vector . First, a distance table is computed online by comparing the query to the code words. Here, is the squared Euclidean distance between the th part of and th code word from the th codebook. The squared Euclidean distance between the query and the database vector can be approximately computed using the PQ-code , as follows:
This is called an asymmetric distance computation (ADC) (Jégou et al., 2011a), and can be performed efficiently, because only fetches are required on . A search over PQ-codes requires .
Now, we introduce our proposed approach: reconfigurable inverted index (Rii). Let us define a query vector , database vectors , and target identifiers . The subset-search problem is defined to find the similar items to the query from the subset of specified by :
where the operator finds the arguments for which an objective function attains (sorted) smallest values. The exact solution can be obtained by a time-consuming direct linear scan. Our goal is to approximately find nearest items in a fast and memory-efficient manner. Note that the problem turns out to be a usual ANN search if the whole database is set as the subset: .
First, input database vectors are encoded as PQ-codes , where each . These PQ-codes are stored linearly, meaning that they are stored in a single long array. Given an identifier , fetching requires a computational cost of .
The PQ-codes are clustered into groups for inverted indexing. First, coarse centers are created by running the clustering algorithm (Matsui et al., 2017) on (or its subset). Note that each coarse center is also a PQ-code . Using these coarse centers, the database PQ-codes are clustered into groups. The resulting assignments are stored as posting lists , where each is a set of identifiers of the database vectors whose nearest coarse center is the th one:
Note that is an assignment function, that is defined as , where is a symmetric distance function that measures the distance between two PQ-codes (Jégou et al., 2011a; Matsui et al., 2017). Finally, we store , , and as a data structure for Rii. The total theoretical memory usage is bits if an integer is represented by 32 bits. We will show in Sec. 5.5 that this theoretical value is almost the same as the measured value.
Note that in a typical implementation of the original IVFADC (Jégou et al., 2011a) system, PQ-codes are stored in posting lists (not a single array). That is, are chunked for each and then stored. This would enhance the locality of the data, and improve the cache efficiency when traversing a posting list. However, the experimental results (Sec. 5.5) showed that this difference is not serious.
We explain how to search for similar vectors using the data structure explained above. Our system provides two search methods: PQ-linear-scan and inverted-index. The former is fast when the size of a target subset is small, and the latter is fast when the size is large. Depending on the size, the faster method is automatically selected.
A search over a subset of a database is defined as a search on target PQ-codes denoted by the target identifiers . Note that we assume that the elements of are sorted222 A set is denoted by calligraphic font, such as , and implemented by a single array. . This is a slightly strong but reasonable assumption. Because is sorted, it can be checked whether or not an item is contained in a set () with a cost of using a binary search, where is the number of elements in . Note again that a search over the whole dataset is available by setting .
: Because the database PQ-codes are stored linearly, we can simply pick up target PQ-codes and evaluate the distances to the query. We call this a PQ-linear-scan. This is essentially fast if is small, because only a fraction of vectors are compared. The pseudocode is presented in Alg. 1.
As inputs, the system accepts a query vector , database PQ-codes , the number of returned items , and the target identifiers . First, a distance table is created by comparing a query to code words333 We intentionally omitted the code words from the pseudocode, for simplicity. (L1). This is an online pre-processing step, required for all PQ-based methods. To store the results, an array of tuples is prepared (L2). Each tuple consists of (1) an identifier of an item and (2) the distance between the query and the item. For each target identifier , the asymmetric distance to the query is computed (L4). This distance is then stored in the result array with its identifier , where the PushBack function is used to append an element to an array (L5). After all target items have been evaluated, the result array is sorted by the distance (L6). As we require only the top results, we use a partial sort algorithm. Finally, the top elements are returned, where the Take function simply picks up the first several elements (L7). Note that and are not required for the search.
Let us analyze the computational cost. The creation of a distance table requires , and a comparison to items requires . Partial sorting requires on average444 This cost comes from the heap sort-based implementation used in the partial_sort function in C++ STL. Another option is to pick up the smallest items and only sort these. This leads to . We used the former in this paper because we empirically determined that the former is faster in practice, especially when is small. . Their sum leads to a final average cost (Table 1). It is clear that the computation is efficient if is small. As the cost depends on linearly, a PQ-linear-scan becomes inefficient if is large. Note that if the search target is the whole dataset, is replaced by .
|- whole data|
|- whole data|
: The other search method is inverted-index. Because the database items are preliminarily clustered as explained in Sec. 4.1, we can simply evaluate items that are in the same/close clusters to the query. This drastically boosts the performance if the number of the target identifiers is large.
We show the pseudo-code in Alg. 2. Inverted-index takes three additional inputs: posting lists , coarse centers , and the number of candidates . Note that candidates will be selected and evaluated in the final step. This means that is a runtime parameter that controls the trade-off between the accuracy and runtime.
To search, a distance table is first created in the same manner as for PQ-linear-scan (L1). The search steps consists of two blocks. First, the closest clusters to the query are found (L2-6). Then, the items inside the clusters are evaluated (L7-16).
To find the closest clusters, an array of tuples is created (L2). For each coarse center (, the distance from the query is computed (L4). The results are stored in the array (L5).
Next, we run partial sort on the array to find the closest clusters to the query (L6). Here, the target number of the partial sort (the number of postings lists to be focused) is set as , which is determined as follows. Because the target identifiers are of size , where the total number of identifiers is , the probability of any item being a target identifier is on average. Because our purpose here is to select target items as candidates of the search, the required number of items to traverse is . To traverse items, we need to focus on posting lists, because the average number of items per posting list is . This implies that we need to select the nearest posting lists. Note that if , we simply replace the value by , because this performs a full sort of the array ().
The selected posting lists are then evaluated. A score array is prepared (L7). For each closest posting list (L8), identifiers in the posting list are traversed (L9). If an identifier is not included in the target identifier , then this item is simply ignored (L10-11). Note that if the search is for the whole dataset (), any item is always included in , thus we remove L10-11.
For a selected identifier , the identifier and the distance to the query are recorded in the same manner as for the PQ-linear-scan (L12-13). If the size of the score array () reaches the parameter , then the top results are selected and returned (L14-16).
The computational cost is summarized as follows. After the code creation with , the comparison to coarse centers requires . Partial sort requires . The number of items to be traversed is . We can check whether or not each item is included in using a binary search, requiring . This leads to in total. The number of items that are actually evaluated is , and so of the cost is required. Finally, the top are selected using the partial sort, requiring . Table 1 summarizes the computational cost. Inverted-index is fast when is sufficiently large, but is slow if is small. This is highlighted in the term , where this term becomes dominant if is small.
Note that although there appear to be several input parameters for inverted-index, all of them except are usually decided deterministically. is the only parameter the user needs to decide. Our initial setting is the average length of a posting list, . This means that the system traverses one posting list on average. This is a fast setting, and users can change this if they require more accuracy, as .
: The final query algorithm is described in Alg. 3. Given inputs, the system automatically determines the query method as either PQ-linear-scan or inverted-index. This decision is based on the threshold value for the number of target identifiers (L1). Owing to this flexible switching, we can always achieve a fast search with a single Rii data structure (, , and ), regardless of the sizes of the target identifiers (). Fig. 4 highlights the relations among the three query algorithms.
Note that it is not trivial to set the threshold deterministically, because it depends on several parameters, such as and . To find the best threshold, we simply run the search with several parameter combinations when the data structure is constructed. Based on the result, we fit a 1D line in the parameter space, and finally obtain the best threshold. See the supplementary material for more details. This works almost perfectly, as shown in Fig. 4. This thresholding does not require any additional runtime cost for the search phase.
Here, we introduce a reconfigure function that enables us to search efficiently even if a large number of vectors are newly added. As discussed in Sec. 1, typical ANN systems are first optimized to achieve fast searching for items. If new items are added later, such systems might become slow. For example, IVFADC requires an initial decision on the number of space partitions . The selection of is sensitive and critical to the performance. A standard convention555https://github.com/facebookresearch/faiss/wiki/Index-IO,-index-factory,-cloning-and-hyper-parameter-tuning is to set . On the other hand, cannot be changed later. The system could become slower if changes significantly. In other words, we must decide even if the final database size is not known, which sometimes frustrates users.
Unlike these existing methods, Rii provides a reconfigure function. If the search becomes slow because of newly added items, coarse centers and assignments are updated by simply running clustering again. The system is automatically optimized to achieve the fastest search for the current number of database items.
Let us first explain how to add a new item. Given a new PQ-code , the database PQ-codes are updated using PushBack(, ). A corresponding posting list is also updated by PushBack(). Then, searching can be performed without any modifications, but it may be slower if many items are added. This is because the length of each posting list () can become too long, making the traversal inefficient.
If the search becomes slow, a reconfigure function can be called (Alg. 4). This function takes the database PQ-codes and a new number of coarse space partitions as inputs. Again, is typically set as for the new . The outputs are updated posting lists and coarse centers. First, the updated coarse centers are computed by running clustering over the PQ-codes using PQk-means (Matsui et al., 2017) (L1). PQk-means efficiently puts the input PQ-codes into several clusters, without decoding the codes for the original -dimensional vectors. Note that clustering can be run for a subset of to make this fast. We set the upper limit of the codes to be clustered as . After new coarse centers are obtained, the posting lists are created by simply finding the nearest center for each PQ-code (L2-4).
The advantage of the reconfigure function is that it can be called whenever the user wishes. The results are deterministic for , because this just runs the clustering over the codes. We will show in Sec. 5.4 that this reconfigure function is especially useful when the database size drastically changes. Another way of looking at this is that we do not need to know the final number of database items when the index structure is built. This is a clear advantage over IVFADC-based methods. In a practical scenario, it will often occur that the number of database items cannot be decided when the system is created. Even in such cases, IVFADC must decide the parameters. This would lead to a suboptimal performance.
The data structure proposed above is similar to the original IVFADC (Jégou et al., 2011a), but has the following fundamental differences.
In Rii, each vector is encoded directly, whereas IVFADC encodes a residual between an input vector and a coarse center. This makes the accuracy of Rii slightly inferior to that of IVFADC (see Sec. 5.5), but enables us to store PQ-codes linearly.
In Rii, PQ-codes are stored linearly, and their identifiers are stored in posting lists. In IVFADC, both PQ-codes and identifiers are stored in posting lists. This simple modification enables us to run the PQ-linear scan without any additional operations.
In IVFADC, coarse centers are a set of -dimensional vectors, whereas coarse centers in Rii are PQ-codes. The advantage of this is that the reconfigure steps become considerably fast with PQk-means. The limitation is that this might decrease the accuracy, but the experimental results show that this degradation is not serious (Sec. 5.5).
There exist advanced encoding methods for PQ, such as optimized product quantization (OPQ) (Ge et al., 2014; Norouzi and Fleet, 2013), additive quantization (AQ) (Babenko and Lempitsky, 2014; Martinez et al., 2016), and composite quantization (CQ) (Zhang et al., 2014, 2015). Although state-of-the-art accuracy has been achieved by AQ or CQ, it is widely known that they are more complex and time consuming. Therefore, we did not incorporate AQ and CQ in our system.
On the other hand, OPQ provides a reasonable trade-off (slightly slow but with a high accuracy). In OPQ, a rotation matrix is preliminarily trained to minimize the error. In the search phase, an input vector is first rotated with the matrix. The remaining process is exactly the same as PQ. We will show the results of OPQ in Sec. 5.5.
All experiments were performed on a server with a 3.6 GHz Intel Xeon CPU (six cores, 12 threads) and 128 GB of RAM. For a fair comparison, we employed a single-thread implementation for the search. Rii is implemented by C++ with a Python interface, All source codes are publicly available666https://github.com/matsui528/rii
The various methods were evaluated using the following datasets:
SIFT1M (Jégou et al., 2011b) consists of 128D SIFT feature vectors extracted from several images. It provides 1,000,000 base, 10,000 query, and 100,000 training vectors.
GIST1M (Jégou et al., 2011b) consists of 960D GIST feature vectors extracted from several images. It provides 1,000,000 base, 1,000 query, and 500,000 training vectors.
The code words of Rii and Faiss were preliminarily trained using the training data. The search is conducted over the base vectors.
We compare our Rii method with the following existing methods:
Annoy (Bernhardsson, 2018): A random projection forest-based system. Because Annoy is easy to use (fewer parameters, intuitive interface, no training steps, and easy IO with a direct mmap design), it is the baseline for million-scale data.
NMSLIB (Naidan et al., 2018): Highly optimized ANN library with the support of non-metric spaces (Boytsov and Naidan, 2013). This library includes several algorithms, and we used Hierarchical Navigable Small World (HNSW) (Malkov et al., 2014; Malkov and Yashunin, 2016) in this study. NMSLIB with HNSW is the current state-of-the-art for million-scale data (Aumüller et al., 2017; Bernhardsson et al., 2018).
Faiss (Jégou et al., 2018): A collection of highly-optimized PQ-based methods. This library includes IVFADC (Jégou et al., 2011a), OPQ (Ge et al., 2014), inverted multi-index (Babenko and Lempitsky, 2015a), and polysemous codes (Douze et al., 2016). Some of these are implemented using the GPU as well (Johnson et al., 2017). In particular, we compared Rii with the basic IVFADC, which is one of the fastest options. Note that only Faiss and Rii can handle billion-scale data, because PQ-based methods are memory efficient.
We first present the results for searching over a subset of the whole database. This is the main function that the proposed Rii method provides. The conclusion is that Rii always remains fast, whereas existing methods become considerably slower, especially if the size of the target subset is small. We first explain the task, and then introduce a post-checking module through which existing methods can conduct a subset search. Finally, we present the results.
The task is defined as follows. We randomly select integers from , sort them, and construct the target indices . For each query, we run the search and find the top- results. All the results must be members of . The runtime per query was reported with several combinations of and . The evaluation was conducted using the SIFT1M dataset (), with .
Because none of the existing methods provide a subset search functionality, we implemented a straightforward post-checking module in order to enable the existing methods to perform a subset search. Alg. 5 shows the pseudocode. This module takes a query function , a query vector , target identifiers , and the number of returned items as inputs. The query function returns the identifiers of the closest items, given and . This is an existing method such as Annoy. First, the output identifier set is prepared (L1). The number of returned items for each iteration, , is first initialized (L2). Then, the search begins with an infinite loop. The top- items are searched using , and the results are stored in the temporal buffer (L4). For each identifier in , if has already been checked, the loop continues (L6-7). This is actually achieved by starting a for loop with some offsets over , so that the first already-checked elements up to a certain number are not traversed. If is included in , we store it in the output set (L8-9). The algorithm finishes if the enough () items are found (L10-11). If an insufficient number of items are found, then is updated to a larger number by simply multiplying a constant value (L12). The search continues with the updated until items are found.
With this module, searching over a target subset is made available for the existing methods. Note that cannot always return items when is large. This depends on the design of the query function, and some methods have a limit on in order not to make the search too slow. We found that FALCONN and NMSLIB do not return items if is large. Therefore, we compared Rii with Annoy using the post-checking module (Annoy + PC).
Fig. 5 illustrates the results. We point out the following:
Rii was fast under all conditions (less than 2 ms/query). We can conclude that Rii was stable and effective for the subset-search.
As with IVFADC, Rii is robust against .
Annoy + PC became drastically slow for small , which is further highlighted when is large. This is an obvious result, because the while loop (L3 in Alg. 5) must be repeated several times for large . Here, can be even . ANN systems are usually not designed to handle such values.
We describe the experiments for our other main function, reconfigure. The conclusion is that Rii becomes fast by using reconfigure, even after many new vectors are added. First, the task is explained, then the results are presented. Here, we used the Deep1B dataset to demonstrate the robustness against billion-scale data.
The index is first constructed using vectors with , and then the runtime is evaluated. Next, new items are added to the index, so that the final becomes . Then, the runtime is evaluated in two ways: (1) a search is performed with , and (2) the data structure is updated using the reconfigure function with , and then the search is conducted. We run this experiment with the final as , , and .
Fig. 6 illustrates the result. It is clear that the search becomes dramatically faster after the reconfigure function is called. For example, if the user keeps the same data structure after new items are added, the search takes an average of 3.9 ms. This can be made faster after applying the reconfigure function.
Most importantly, because the data structure can be always adjusted for the new , the user need not face the burden of selecting when the system is constructed. This is a clear advantage over the other existing methods. Note that the runtime for adding vectors was 109 s, and that of the reconfigure function with was 111 s. These times can be considered moderate.
Finally, we compare Rii (and its variant Rii-OPQ) with Annoy, FALCONN, NMSLIB (HNSW), and Faiss (IVFADC), using SIFT1M and GIST1M. The conclusion is that our Rii method achieved a comparable performance to the state-of-the-art system Faiss. Note that the searches were conducted over the whole datasets.
The accuracy was measured using Recall@1, which measures the fraction of queries for which the ground truth nearest neighbor is returned within the top-1 result. The average Recall@1 over the query set is reported. We evaluated the methods with several parameter combinations, and report the results with a fixed Recall@1 (0.65 for SIFT1M and 0.5 for GIST1M) for a fair comparison. Because the ranges of some parameters are discrete, we cannot achieve an exact target Recall@1. Thus, the target Recall@1 was selected as best as possible as a value that all methods can achieve.
The disk consumption of the index data structure is also reported. This was measured by storing the data structure on the disk and checking its size in bytes. Note that the runtime (peak-time) memory consumption is the more important measure, but measuring the peak-time memory usage is not always stable, and can vary depending on the computer. Thus, we report the disk space instead, which is reproducible and strongly related to the memory consumption. The runtime of building the data structure is also reported.
|Dataset||Method||Parameters||Recall@1 (fixed)||Runtime/query||Disk space||Build time|
|SIFT1M||Annoy (Bernhardsson, 2018)||,||0.67||0.18 ms||1703 MB||899 sec|
|FALCONN (Razenshteyn and Schmidt, 2018; Andoni et al., 2015)||0.63||0.87 ms||-||1.8 sec|
|NMSLIB (HNSW) (Naidan et al., 2018; Boytsov and Naidan, 2013; Malkov and Yashunin, 2016)||0.67||0.043 ms||669 MB||436 sec|
|Faiss (IVFADC) (Jégou et al., 2018, 2011a)||0.67||0.61 ms||73 MB||30 sec|
|Rii (proposed)||0.64||0.73 ms||69 MB||82 sec|
|Rii-OPQ (proposed)||0.65||0.82 ms||69 MB||85 sec|
|GIST1M||Annoy (Bernhardsson, 2018)||,||0.49||1.2 ms||5023 MB||2088 sec|
|FALCONN (Razenshteyn and Schmidt, 2018; Andoni et al., 2015)||0.53||8.6 ms||-||7.2 sec|
|NMSLIB (HNSW) (Naidan et al., 2018; Boytsov and Naidan, 2013; Malkov and Yashunin, 2016)||0.49||0.19 ms||3997 MB||1576 sec|
|Faiss (IVFADC) (Jégou et al., 2018, 2011a)||0.52||3.8 ms||253 MB||51 sec|
|Rii (proposed)||0.45||3.2 ms||246 MB||353 sec|
|Rii-OPQ (proposed)||0.50||3.8 ms||249 MB||388 sec|
Table 2 presents the results. We summarize our findings:
Rii was comparable with the state-of-the-art system Faiss. In particular, although our method is basically an approximation of IVFADC, the decrease in the accuracy is not significant.
Rii was the most memory efficient among the methods. The measured value is almost same as the theoretically predicted value (68 MB against 69 MB and 244 MB against 249 MB).
If we compare Rii and Rii-OPQ, Rii-OPQ was slightly slower but a little more accurate with the same parameter settings.
Annoy achieved the second fastest result. Because Annoy supports the direct memory map system, the construction required some time and consumed a relatively large disk space.
FALCONN achieved a comparable (or slightly slower) performance to Faiss/Rii. We note that the building cost of FALCONN is considerably smaller than for other methods. As FALCONN does not provide IO functions, we did not report the disk space.
The results for SIFT1M and GIST1M follow similar tendencies.
We present an application to highlight the subset search function of Rii. For this demonstration, we leverage the data of The Metropolitan Museum of Art (MET) Open Access777https://github.com/metmuseum/openaccess. This dataset contains more than 420,000 items from MET, with both the image and extensive metadata for each item (Table 3). From this data, we select 201,998 items that are provided with the Creative Common license. For each image, we extracted a 1,920-dimensional activation of last average pooling layer of the DenseNet-201 (Huang et al., 2017)
architecture trained with ImageNet. The features are stored in Rii with. Several meta-information is stored in a table using Pandas888https://pandas.pydata.org/, which is a popular on-memory data management system for Python.
Fig. 7 demonstrates the system, including Python codes and the search results. The metadata and DenseNet vectors are first read. Then, the search is conducted based on the metadata. Here, the items that were created before A.D. 500 in Egypt are specified. Next, the target identifiers are prepared. This is simply a set of IDs of the selected items. The image-based search is then conducted over them. The query here is Chinese tapestry. We can find similar items to the Chinese tapestry from the museum items in ancient Egypt.
As this demonstration reveals, the search using the target subset is a general problem setting. Rii can solve this type of problem easily. As Sec. 5.3 shows, existing methods using the late checking module do not perform well when is small. For example, in this case the result of the metadata search can have any number of items. Rii can handle a subset search for any size of .
|0||Bust of Abraham Lincoln||1876||United States|
|1||Acorn Clock||1847||United States|
[mathescape, breaklines, breakindent=20pt, fontsize=, frame=single, ]python import pandas as pd import rii
# Read data df = pd.read_csv(’metadata.csv’) engine = pkl.load(open(’rii_densenet.pkl’, ’rb’))
# Metadata search (13.5 ms) S = df[(df[’data’]¡500) & (df[’country’]==’Egypt’)][’ID’] S = np.sort(np.array(S)) # Target identifiers
# ANN for subset (2 ms) q = # Read query feature result = engine.query(q=q, target_ids=S, topk=3)
|Query||The nearest||The 2nd nearest||The 3rd nearest|
We developed an approximate nearest neighbor search method, called Rii. Rii provides the two functions of searching over a subset and a reconfigure function for newly added vectors. Extensive comparisons showed that Rii achieved a comparable performance to state-of-the art systems, such as Faiss.
Note that the latest systems incorporate HNSW for the coarse assignment of IVFADC (Baranchuk et al., 2018; Douze et al., 2018). Our Rii architecture can be combined to them, but that will be remained as a future work.
Acknowledgments: This work was supported by JST ACT-I Grant Number JPMJPR16UO, Japan.
Shortlist Selection With Residual-Aware Distance Estimator for K-Nearest Neighbor Search. InProc. IEEE CVPR.
Scalable Nearest Neighbor Algorithms for High Dimensional Data.IEEE TPAMI 36, 11 (2014), 2227–2240.
Cartesian k-means. InProc. IEEE CVPR.
A Comprehensive Study Over VLAD and Product Quantization in Large-Scale Image Retrieval.IEEE TMM 16, 6 (2014), 1713–1728.