Given a query and a similarity function , similarity search finds the item that is most similar to the query in a dataset according to the similarity function . Popular similarity functions include Euclidean distance , angular distance  and inner product . In practice, it is commonly required to return the top most similar items to a query. Similarity search is a key component in a large number of applications including large-scale image search , semi-supervised low-shot classification , recommendation based on user and item embeddings , sequence matching , entity resolution , memory network training 11]. Similarity search that returns the exact top neighbors is usually too costly and approximate similarity search, which returns a good portion of the exact top neighbors, suffices for most applications. Therefore, we focus on approximate similarity search in this paper.
Similarity search algorithms.
Due to the importance of similarity search, many algorithms have been proposed to solve it efficiently. Existing similarity search algorithms can be roughly classified into four categories, i.e., tree-based methods[12, 13, 14], locality sensitive hashing (LSH) based methods [15, 2, 16]
, vector quantization based methods[17, 18, 19] and proximity graph based methods [20, 21, 22]. Among them, the proximity graph based methods were shown to provide the best recall-time performance111Recall-time performance measures the recall achieved within a given query processing time and higher recall indicates better performance. in a number of empirical studies [22, 23, 24]. In a proximity graph, each item is connected to a small set of items that are most similar to it in the dataset and the graph-based methods usually conduct similarity search by a walk on the proximity graph. With the proximity graph, the graph-based methods can model the fine-grained neighboring relation among items and avoid checking dissimilar items for query processing, which explains their good performance . Among the proximity graph based methods, the Hierarchical Navigable Small World graph (HNSW)  represents the state-of-the-art method because of its fast index construction and good search performance. We will give a detailed introduction to HNSW in Section II.
There are also some distributed similarity search solutions designed to handle large datasets [25, 26, 27]. However, these solutions use either tree-based methods  or LSH-based methods [26, 27], and scalable solutions for the more recently proposed proximity graph based methods are still lacking. Our work proposes a distributed solution based on HNSW, the state of the art similarity search method on a single machine, to improve the performance of distributed similarity search.
Single machine solutions. The main challenge of similarity search on large-scale datasets is the memory required to store the raw data and index data structures, which could easily exceed the capacity of a single machine. For example, the SIFT1B dataset , which contains 1 billion 128-dimensional SIFT descriptors of images, takes up 512GB of memory for holding the raw data if each feature is stored as a floating point number. In addition, the proximity graph of HNSW also takes a large amount of memory (often comparable with the size of the raw data) as it needs to maintain a neighbor list for every item.
To reduce memory consumption, existing single machine solutions such as FAISS  and Link&Code  use vector quantization techniques (e.g., PQ  and OPQ ) to compress the items. However, the quantization error introduced by the compression process often harms the quality of the search results. For example, using OPQ with 8 codebooks, FAISS achieves a precision of only 25.15% when items are probed for top-10 Euclidean nearest neighbor search222To calculate precision, the items are ranked according to their approximate similarity scores, i.e., . If of the ground truth top Euclidean nearest neighbors are identified in the items ranking top , the precision is .. This is because, given that two items and have similarity scores with a query , a small quantization error could lead to if the compressed approximations (i.e., and ) are used to evaluate the similarity function.
Therefore, existing single machine solutions are inadequate for large-scale similarity search as they cannot provide high quality results, which is particularly critical in applications such as e-commerce and advertising . To provide high quality search results and scale to even larger datasets (e.g., with trillions of items) we may encounter in the future, it is necessary to develop distributed solutions that can store and process queries with uncompressed data.
Requirements. Apart from producing high quality results, a similarity search framework needs to fulfill three additional requirements for production use, i.e., high query processing throughput, low query processing latency, and good robustness. Query processing throughput is the number of queries the framework can handle in unit time. Query processing latency measures time taken to process a query and online applications typically require a query processing latency in the order of several milliseconds . As node failures and stragglers are common in a distributed computing, the performance of the framework should also be robust under these adversarial scenarios.
Our solution. Pyramid is a distributed solution based on HNSW and supports popular similarity functions including Euclidean distance, angular distance and inner product. A naive distributed solution with HNSW is to randomly partition the dataset over the machines and build an HNSW on each machine. However, its query processing throughput is low as partitions in every machine need to be searched for processing a query. Pyramid solves the deficiency of the naive solution with novel dataset partitioning and query assignment strategies. The key idea is to build a much smaller meta-HNSW that captures the structure of the entire dataset, which allows us to build two levels of indexes and pinpoint the neighbors of a query efficiently. Specifically, by partitioning the bottom layer of the meta-HNSW, Pyramid assigns dataset items to sub-datasets of roughly equal size and ensures that items in the same sub-dataset are similar to each other. By searching the meta-HNSW, Pyramid quickly identifies the sub-datasets that are likely to contain the neighbors of a query and involves only these sub-datasets in query processing without hurting the quality of the search results. For failure recovery and straggler mitigation, Pyramid replicates the sub-datasets and their HNSWs across the machines.
We implement the index building component of Pyramid with customized code for efficient distributed execution. For query processing, we employ Zookeeper333https://zookeeper.apache.org/ to monitor the system and perform automatic failure recovery. Kafka444https://kafka.apache.org/ is used to dispatch queries to the machines and automatically handle load balancing and fault tolerance for the message queues. At the top level, we provide a set of simple and expressive high-level API to hide the low-level execution details from users.
We tested Pyramid on three large-scale datasets, Deep500M, SIFT500M and Tiny10M. The results show that Pyramid provides high quality search results and the precision can easily reach 90% for top 10 Euclidean nearest neighbor search. The throughput of Pyramid is over 2x compared with a naive solution that randomly partitions a dataset among the machines and builds an HNSW for each machine. Comparing with the famous FLANN library  that uses tree based method for distributed similarity search, Pyramid provides a throughput that is over 100x higher and achieves better precision for the search results. Pyramid is able to keep the query processing latency within 2-3ms. Pyramid is also robust to stragglers and node failures with its replication strategy.
Paper organization. The remainder of the paper is organized as follows. Section II introduces the basics about HNSW to as a background. The index building and query processing algorithms of Pyramid are discussed in Section III. Section IV introduces Pyramid’s API and the designs for straggler mitigation and failure recovery. Section V presents the experiment results and Section VI surveys the related work. The concluding remarks are given in Section VII.
Ii Hierarchical Navigable Small World Graph
We first introduce the query processing and graph construction procedures of Hierarchical Navigable Small World graph (HNSW) , as Pyramid is a distributed solution based on HNSW. For simplicity, we omit some details in the algorithms and readers may refer to  for the complete algorithms. We also present the algorithms using a similarity function instead of a distance function and a larger similarity value indicates that two items are more similar. To search for neighbors with small distances, the similarity function can be defined as negative distance. For example, can be used for Euclidean distance nearest neighbor search (Euclidean NNS).
The HNSW proximity graph has multiple layers as illustrated in Figure 1. The bottom layer (layer 0) contains all items in the dataset, while the items in each upper layer are sampled uniformly from its previous layer. Therefore, the number of items reduces as the layer increases. Each layer of HNSW is an approximation of the
-nearest neighbor graph (KNN graph) and an item is connected to its approximate top neighbors in that layer, where is a user-specified number that controls the size of the graph.
The query processing procedure of HNSW is shown in Algorithm 1, in which is the candidate queue and keeps the best results encountered so far. The graph walk starts at a fixed entry vertex at the top layer. Before reaching the bottom layer, graph walk is conducted with a search factor (that controls the size of ) of 1, which is also called greedy graph walk (without backtracking). Greedy graph walk moves to the best neighbor (most similar to the query) of the current vertex at every step and stops when the current vertex is more similar to the query than all its neighbors. The stopping vertex in an upper layer is used as the starting vertex for graph walk in the layer below it. For the bottom layer, graph walk is usually conducted with a search factor , which is similar to beam search with backtracking and makes the walk less likely to be trapped at local optimal.
HNSW is an improvement of the KNN graph, which contains a single layer and each item is connected to its top neighbors. In a KNN graph, graph walk can only take small steps as the connections are local. If the starting vertex is far from the neighborhood of the query, graph walk needs a large number of steps to reach the true neighbors, which harms performance. The upper layers of HNSW contain uniformly sampled items from the dataset and they allow graph walk to take large steps and quickly approach the neighborhood of the query. Using a large search factor, graph walk on the bottom layer allows good exploration of the neighborhood of the query to identify the true neighbors.
The graph construction procedure of HNSW is shown in Algorithm 2. The items in the dataset are inserted sequentially into the graph to build the HNSW. For an item , the highest layer (denoted by
in the algorithm) it can appear is first generated by an exponential distribution. Then graph search is conducted usingas the query and the only difference from Algorithm 1 is that large search factor () is used for all layers below . In each of these layers, graph search is used to find the top neighbors of and is connected to them using directed edges.
In a number of empirical studies, HNSW is found to significantly outperform other similarity search algorithms, including tree-based methods, LSH-based methods and vector quantization based methods. It is reported that the search complexity of HNSW scales with with being the cardinality of the dataset , which is favorable for large datasets. Index construction of HNSW is also efficient as it does not require to build an exact KNN graph at each layer. Although HNSW is originally designed for metric similarity functions such as Euclidean distance and edit distance, it has been shown recently that HNSW also achieves the state-of-the-art performance for maximum inner product search (MIPS) . Therefore, the efficiency in index building, excellent similarity search performance and generality to similarity functions make HNSW an ideal choice for building a distributed similarity search solution based on it.
In this section, we introduce the algorithmic aspect of Pyramid, i.e., the index building and query processing procedures. We also discuss the special considerations in Pyramid for MIPS.
First, we motivate the design of Pyramid by analyzing the deficiency of a naive solution. For distributed similarity search with HNSW, a straightforward solution (denote as HNSW-naive) is to randomly partition a dataset among workers in a cluster and build an independent HNSW graph on each worker. A query is distributed to all workers and each worker processes the query with its own HNSW graph using Algorithm 1. The final search results are obtained by merging and re-ranking the partial results reported by the workers. However, a query invokes computation on all workers in HNSW-naive, which results in low query processing throughput.
If a query is handled by only some rather than all of the workers, query processing throughput can be improved as each query invokes less workload. This is the main motivation of Pyramid’s design, which is achieved by dataset partitioning and query assignment. In the index building phase, Pyramid partitions the dataset into sub-datasets containing items similar to each other and assigns each sub-dataset to a worker. Due to the partitioning, some sub-datasets are likely to contain the neighbors for a query while others are not. Then for query processing, it is sufficient to handle the query by the workers holding these potential sub-datasets and the other workers do not need to be involved, which results in high query processing throughput. In Pyramid, both dataset partitioning and query assignment are conducted with a small meta-HNSW built on samples from the dataset.
Iii-a Index Building
The index building procedure of Pyramid is shown in Algorithm 3. As the original dataset may be very large, we first sample a small dataset with size from it. Then, kmeans with centers is conducted on the sample dataset and we assign a weight to each kmeans center, which is set as the number of items it has from . The meta-HNSW is built on these kmeans centers using Algorithm 2. We partition the bottom layer (which is a proximity graph) of into balanced graph partitions (in the sense that each graph partition has similar total vertex weights) and try to minimize the number of edges across the graph partitions. Note that by minimizing the number of cross edges, we ensure that items in each graph partition are similar to each other. Currently, we use the Karlsruhe Fast Flow Partitioner algorithm  for partitioning, which adopts an efficient multi-level local improvement strategy to search for the best partitioning. Finally, for each item in the original dataset , we assign it to sub-dataset according to its most similar item in . For each sub-dataset , we build a sub-HNSW independently. Note that there is a one to one mapping between the sub-datasets and the graph partitions (of the bottom layer of ) and the -th partition corresponds to sub-dataset .
The design of Algorithm 3 is a joint consideration of efficiency, load balancing and statistical stability. As the original dataset is large, we use a smaller sample dataset as its surrogate to speed up the index building process. Assume that each item in is equally likely to be accessed by queries, the sub-datasets should have roughly equal size to balance their workloads. Therefore, we set the weight of a vertex in as the number of items it has from and ensure that the partitions of have similar total vertex weights. There may be scenarios that some items in are hot (more likely to be accessed by queries) and we are given a set of sample queries. In this case, we can set the weight of a vertex in as the frequency it appears in the top similarity search results of the queries for load balancing. We do not directly sample items from the dataset to build the meta-HNSW as may be small and a small sample may not reflect the distribution of the entire dataset. Empirically, we observed that a small meta-HNSW is already sufficient for good performance and a large meta-HNSW is not favorable as it prolongs the query processing time. Therefore, we conduct kmeans on a larger sample (i.e., with items) of the dataset to obtain the vertexes in the meta-HNSW for statistical stability.
Distributed workflow. The index construction procedure starts with each worker reading a part of the dataset from the distributed file system. Then each worker samples some items from its local dataset (according to the cardinality of the entire dataset and its local dataset, and the total sample size ) and the workers conduct distributed kmeans with centers together. The centers are kept in one of the workers for meta-HNSW construction and graph partitioning. When graph partitioning finishes, the worker also builds a one-to-one mapping between the graph partitions (and the sub-dataset indexes) and the workers. After that, the meta-HNSW is broadcast to all workers along with related data structures. The workers decide for each item in its local dataset which sub-dataset it belongs to using the meta-HNSW and shuffles the items to their destination workers in parallel. When data shuffle finishes, each worker builds an HNSW on its own sub-dataset.
Iii-B Query Processing
The query processing procedure of Pyramid is shown in Algorithm 4. Query processing starts by finding the top neighbors of the query in the meta-HNSW , which can be done very efficiently as the meta-HNSW is usually very small. For each graph partition of the bottom layer of , if it contains one or more of these neighbors, the query will be dispatched to its corresponding sub-dataset. We call the branching factor as it controls how many sub-datasets a query will be forwarded to and a larger means more sub-datasets will be involved. For each sub-dataset that receives the query , HNSW-based graph walk is conducted to find the top neighbors of . The final search results are obtained by selecting the top neighbors from the partial results returned by the sub-datasets. The query processing algorithm ensures that only some of the sub-datasets are activated for a query, which helps to achieve high query processing throughput.
Distributed workflow. When a query comes, it is assigned to a random worker as the coordinator. The coordinator searches the meta-HNSW with the query and decides the sub-datasets that will be involved in processing the query according to the search results. Then, the query is dispatched to the corresponding workers and the workers search their own HNSW with the query. Each involved worker returns tuples of (item id, similarity score) to the coordinator. When all responses for a query are gathered, the coordinator selects items with the top similarity scores as the final results.
We give an illustration of Pyramid in Figure 2. If we treat the sub-HNSWs on the workers as the bottom layer, the meta-HNSW is equivalent to some common upper layers of the sub-HNSWs. By searching the meta-HNSW, we can quickly identify the sub-HNSWs that are likely to contain the neighbors of a query. This is analogous to the upper layers of an HNSW, which help quickly approach the neighborhood of the query. By using the bottom layer of the meta-HNSW for dataset partitioning, we ensure that items in the same sub-dataset are similar to each other.
Iii-C The Generality of Pyramid
One important design goal of Pyramid is generality, which means that Pyramid should work for search with popular similarity functions including Euclidean distance, angular distance and inner product. As HNSW is originally designed for Euclidean distance, Algorithm 3 and Algorithm 4 naturally work for Euclidean NNS by using as the similarity function. Angular distance (i.e., ) is monotone to Euclidean distance when both query and item have unit norm, which suggests that we can transform angular similarity search into Euclidean NNS. By normalizing the items to unit norm before index construction and normalizing the query before query processing, Algorithm 3 and Algorithm 4 also work for angular distance. Although MIPS can also be transformed into Euclidean NNS, it has been shown that the transformation harms the performance of proximity graph based methods . In the following discussion, we also analyze the problems of directly using Algorithm 3 and Algorithm 4 for MIPS and present our solutions.
Pyramid for MIPS. One interesting property of MIPS is that items with large norm are very likely to be the search results, which we show in Figure 3
. For the ImageNet dataset, which contains about 2 million, 150 dimensional descriptors of images, we found the exact top-10 MIPS results of 1,000 randomly selected queries with linear scan. This gives us a result set containing 10,000 items (without deduplication) and we calculated the percentage that items with different norm percentiles take in this result set. For example, the first bar (from left to right) in Figure3 means that items ranking top 5% in norm takes up 93.1% of the result set. This phenomenon can be explained by the fact that high dimensional vectors tend to have a large angle with each other. Thus, the influence of norm on inner product is decisive since , in which is the angle between and
. This is contrasted with Euclidean NNS, for which each item should have equal probability to be the search result if the query comes from the same distribution of the dataset items.
The bias towards items with large norm in MIPS causes problems for both dataset partitioning and query processing. Items with large norm tend to be strongly connected with each other in the inner product HNSW as they are likely to be the results of MIPS. In Algorithm 3, we partition the meta-HNSW by minimizing the number of cross edges, which means the large norm items will be put into the same graph partition (denote this partition as the large norm partition). For dataset partitioning, most items will find their MIPS in the large norm partition and this means that one of the sub-datasets will be much larger than the others. This can cause the worker holding the large sub-dataset to run out of memory. For query processing, the larger norm partition is very likely to contain the top- MIPS of most queries for meta-HNSW search, which makes the worker holding the large norm partition much more heavily loaded than the other workers and may become a straggler in the system.
To solve the aforementioned problems, we use Algorithm 5 to build index for MIPS. Note that Algorithm 5 uses inner product as similarity function for all operations. There are two main changes in Algorithm 5 compared with Algorithm 3. Firstly, the sampled items are normalized to unit norm in line 4 and the kmeans is spherical kmeans  in line 5, which ensures that the centers have unit norm. Therefore, all items in the meta-HNSW have unit norm and each graph partition contains vectors pointing to similar directions. In line 8-11, each item in the original dataset is assigned to the graph partition that contains its MIPS. As all meta-HNSW vectors have unit norm, the MIPS of an item is also most similar to it in direction. Thus, each sub-dataset covers items pointing to similar direction, which avoids the problem that the large norm partition attracts much more items than the others. The problem that the large norm partition is hot for query processing is also solved as the query is assigned to sub-datasets that are similar to it in direction and no sub-dataset is more likely to attract queries.
In lines 12-15, Algorithm 5 introduces an additional item assignment stage compared with Algorithm 3 and the motivation is to reduce the number of sub-datasets needed to process a query. We have already assigned items pointing to similar directions to the same sub-dataset. However, items with large norm are likely to be the result of MIPS and it is possible that these items are scattered across many sub-datasets. This means that the query needs to access all these sub-datasets to achieve high recall, which hurts query processing throughput. In lines 12-15, we add the top MIPS neighbors of each meta-HNSW vector to its corresponding sub-datasets, which enables items with large norm to be assigned to multiple sub-datasets. Empirically, we found that setting to a relatively small fraction (e.g., several percent) of the dataset cardinality already leads to good performance, which means that the memory overhead of this allocation is small. Finding the top MIPS neighbors of an vector in the original dataset can be done approximately and efficiently using LSH-based methods [16, 4]. The query processing procedure for MIPS is exactly the same as Algorithm 4.
Iv System Design and Implementation
In this section, we introduce the system architecture and API of Pyramid. The fault tolerance and straggler mitigation strategies of Pyramid are also discussed. We will make Pyramid open source.
Iv-a System Architecture and API
As a distributed similarity search system, Pyramid consists of three kinds of major components, i.e., coordinators, executors and brokers. We plot the architecture of Pyramid in Figure 4. The coordinators receive queries from some upstream applications (e.g, product recommendation and image search), search the meta-HNSW with the queries and send query processing requests to the executors with the relevant sub-HNSWs. The executors conduct search on their own sub-HNSWs with the received query processing requests and return the partial results to the coordinators. The coordinators merge these partial results to obtain the final results. The brokers handle the delivery of the query processing requests from the coordinators to the executors and use Kafka for reliable message passing on unreliable network. A Zookeeper cluster is used to monitor the workers in the system to detect failures. The Kafka brokers also internally rely on Zookeeper.
Pyramid provides a set of high-level API for query processing and index building, which hides the details of distributed processing from users. The coordinator and executor are two classes central to query processing while the GraphConstructor class is used for index building. We introduce the three classes as follows.
Coordinator. Listing 1 shows a summary of the coordinator class. To construct a Coordinator object, a user passes the broker list, the path to the meta-HNSW, the dataset name and the similarity metric definition to the constructor. After that, the execute(query, para) method could be called to process a query. Alternatively, the execute_async(query, para, callback) method could be used to asynchronously execute a query. It immediately exits without waiting for the final results and the callback will be invoked when the final results are available. para provides parameters for query processing, including the branching factor and the number of required neighbors . Typically, a user writes a program which receives queries from upstream applications using custom protocols (e.g., RESTful API ), injects the queries to the system through the coordinator API, and returns the results back to the upstream applications.
Executor. Listing 2 shows a summary of the executor class. To construct an executor, the programmer passes the broker list, the path and ID to a sub-HNSW, the dataset name and the similarity function definition to the constructor. After that, the start(para) method could be called to start handle the query processing requests, in which para provides the parameters for query processing (e.g., the search factor for the bottom layer graph search and the maximum number of similarity function computations for a query). Unlike the coordinators which receive queries from upstream applications through custom protocols, the executors typically do not involve any custom logic. Therefore, a standalone program is provided to directly run an executor without any programming effort.
GraphConstructor. Listing 3 summarizes the GraphConstructor class, which is used to build the index for Pyramid. It takes the dataset path and the similarity metric as input and output the meta-HNSW and sub-HNSWs. The para provides the parameters for index construction, such as the size of the meta-HNSW and the number of sub-HNSWs . The refresh() method reads the dataset again, reconstructs the graphs and notifies the coordinators and executors, which is useful for updating the index when there are changes in the dataset.
Iv-B Straggler Mitigation and Fault Tolerance
Stragglers and failures are common in large-scale distributed systems. Pyramid relies on replication (which means that the same sub-HNSW is replicated on multiple workers) and Kafka to achieve robustness against straggler and failure.
Straggler Mitigation. The coordinators push query processing requests to the executors through the Kafka brokers and each sub-HNSW forms a Kafka topic. The executors serving the same sub-HNSW form a group and all of them subscribe to the corresponding topic. Stragglers are handled automatically by the message distribution mechanism of Kafka, which periodically re-balances the message queues of the executors. Therefore, the workload of a slow executor is reduced because it receives fewer query processing requests and the requests are offloaded to other executors serving the same sub-HNSW. Beside straggler mitigation, this design also supports elastic scalability, which means that the executors serving the same sub-HNSW can be dynamically added to/removed from the system. In this case, Kafka simply redistributes the message queues among the executors. Currently, Pyramid does not handle straggling coordinators. We assume that the upstream applications use a mechanism (e.g., hashing) to evenly distribute the workload among the coordinators. Moreover, the workload of the coordinators is much lighter comparing with the executors and the influence of straggler may not be significant.
Failure Recovery. There are two kinds of failures in Pyramid, i.e., coordinator failure and executor failure. Pyramid does not handle the on-going queries when a coordinator fails and relies on the upstream applications to retry another coordinator upon timeout. Note that executors return the partial results to the respective coordinators through bare network connection instead of using Kafka brokers. Therefore, a coordinator serving a retried query only needs to redo all jobs (generating the query processing requests and sending to the executors). If the partial results are sent through Kafka, the new coordinator needs to handle partial states and the retry procedure will be complicated. If there are multiple executors serving the same sub-HNSW and one of them fails, the query processing requests for this sub-HNSW will be handled by the other executors due to the message dispatching mechanism of Kafka.
Pyramid uses Zookeeper to track the states of the system and a Master monitors the states kept on Zookeeper. On Zookeeper, each running instance (coordinator or executor) should lock a corresponding file. Once the Master finds that a file is unlocked, it restarts the corresponding instance on an available machine. The new instance locks the corresponding file and starts serving. As the failed instance may recover by itself, the new instance exits immediately when it finds that the corresponding file is locked. If the failed instance recovers after the new instance takes up its responsibility (by locking the file), it simply exits. To avoid the single point of failure of the Master, there are several hot backups of Master. A Master serves only if it could lock a file on Zookeeper. The hot backups monitor the file and take up the responsibility of the original Master if the file is unlocked.
V Experimental Results
In this section, we evaluate the performance Pyramid with extensive experiments. First, we explore the influence of the parameters on the performance of Pyramid. Second, we compare Pyramid with two distributed similarity search solutions, HNSW-naive and FLANN . Finally, we evaluate the performance of Pyramid under straggler and failure.
V-a Performance Metrics and Experimental Setting
We mainly use three performance metrics in our evaluation, i.e., precision, throughput and latency. Their definitions are given as follows.
Precision. For top- similarity search, an algorithm is allowed to return items. If of these items belong to the ground-truth top- neighbors of a query, the precision is said to be . Precision measures the quality of similarity search results and higher precision means better quality. Note that precision is a more stringent metric than the Recall@k used in [29, 30], which is 1 if the items returned by an algorithm contains the top-1 neighbor of a query.
Throughput and Latency. Query processing throughput is the number of queries that can be processed by a system per second. Query processing latency is the interval between the time a system receives a query and the time the system returns the similarity search results.
Unless otherwise stated, we searched for the top-10 neighbors of a query in the experiments. The reported precision is the average of 10,000 randomly selected queries. For latency, we report the 90th percentile instead of the average, which models the worst-case performance of the system and is more critical to online applications. For the meta-HNSW and the sub-HNSWs in Pyramid, we set their parameters according to the recommendation in the HNSW paper  and did not fine-tune them. Specially, the maximum out-degree of an item was set to 32 for the bottom layer graph, while the maximum out-degree was set to 16 for the other layers. The search factor for graph walk on the bottom layer was set to 100.
We used the three datasets listed in Table I
for our experiments. Deep500M was sampled from the Deep1B dataset and contains descriptors of images generated by the last fully connected layer of a deep neural network. The SIFT500M dataset was sampled from the SIFT1B dataset and contains 128-dimension SIFT descriptors of images . The Tiny10M dataset was sampled from the Tiny80M dataset 555https://groups.csail.mit.edu/vision/TinyImages/ and contains the GIST descriptors of the Tiny images. Both Deep500M and SIFT500M contain items with very similar Euclidean norm, conducting MIPS on them is not interesting as it is equivalent to Euclidean NNS. In contrast, Tiny10M has a wide spread in Euclidean norm distribution and is more suitable for testing the performance of MIPS. Therefore, we conducted MIPS on Tiny10M and Euclidean NNS on Deep500M and SIFT500M.
All experiments were conducted on a cluster of 10 machines connected with 10 Gbps Ethernet. Each machine is equipped with two 8-core Intel Xeon E5-2620v4 2.1GHz processors and 48GB RAM, running on CentOS 7.2. As the cluster contains 10 machines, we use 10 sub-HNSWs in the experiments.
|Name||# item||# dimension||size (GB)|
V-B Influence of The Parameters
In this set of experiments, we evaluated the influence of the meta-HNSW related parameters on the performance of Pyramid. Recall that the meta-HNSW has two main parameters, i.e., size (the number of vectors in the bottom layer graph) and branching factor (the number of top neighbors in the meta-HNSW that are used to choose the sub-HNSWs for a query). On Deep500M and SIFT500M and for Euclidean NNS, we experimented with a meta-HNSW of size 1,000, 10,000 and 100,000 and branching factor value of 1, 5, 10, 20, 50 and 100. Different aspects of Pyramid’s performance under these configurations are reported in Figure 5 to Figure 8.
Figure 5 reports the average access rate of the queries under different branching factor, where the access rate is the fraction of sub-HNSWs that are accessed for processing a query. Under the same meta-HNSW size, the access rate increases with as a larger means that more neighbors in the meta-HNSW are used to choose the sub-HNSWs according to Algorithm 4. For the same , a larger meta-HNSW size results in a lower access rate. This is because a larger meta-HNSW offers more fine-grained partitioning of the dataset such that the top neighbors of a query in the meta-HNSW are contained in a smaller number of sub-datasets.
Figure 6 reports the precision of the search results under different configurations. The precision first increases rapidly with the branching factor and then stabilizes. Moreover, the precision is higher for a smaller meta-HNSW size under the same branching factor. This is because more sub-HNSWs are accessed to process a query for a smaller meta-HNSW according to Figure 5. Combining Figure 5 and Figure 6, we can also conclude that Pyramid provides high quality search results with a low sub-HNSW access rate. For example, the precisions on both Deep500M and SIFT500M are above 65% with , under which only one sub-HNSW is accessed. This result verifies the effectiveness of the meta-HNSW based dataset partitioning and query assignment strategies in Pyramid.
Figure 7 reports the query processing throughput under different branching factor . The results show that the throughput consistently drops when increases. This is because a larger results in a higher access rate, which means that more sub-HNSWs are searched to answer a query and thus the per-query workload is heavier. Although the access rate is lower for a larger meta-HNSW size under the same according to Figure 5, meta-HNSW 100,000 does not always achieve higher throughput than meta-HNSW 10,000. This is because searching a larger meta-HNSW has higher complexity and we observed that meta-HNSW search takes 0.06ms and 0.18ms per query for meta-HNSW 10,000 and meta-HNSW 100,000, respectively. As the difference between the access rates of meta-HNSW 10,000 and meta-HNSW 100,000 is not large, high meta-HNSW search complexity could outweigh the benefits of a lower access rate. This phenomenon also suggests that further increasing the meta-HNSW size beyond 100,000 may degrade the performance due to even higher meta-HNSW search complexity. Therefore, Pyramid does not require a large meta-HNSW to achieve the optimal performance, which is favorable as the meta-HNSW needs to be replicated on every coordinator and a large meta-HNSW will take up a larger amount of memory.
Figure 8 reports the 90th percentile of the query processing latency under different configurations. The results show that the latency increases with the branching factor . This is because a coordinator needs to aggregate the results from more executors due to the large access rate under a large and the latency depends on the maximum latency in these executors. The influence of the meta-HNSW size on latency is more complicated as a large meta-HNSW reduces the access rate but requires longer time for searching the meta-HNSW.
In conclusion, this set of experiments shows that distributed similarity search with Pyramid can provide high quality results (achieving a precision above 90%), support high throughput (over 100,000 queries per second) and achieve low latency (2 to 3 milliseconds). As Pyramid performs well with a meta-HNSW size of 10,000 on the two datasets we experimented with, we use a meta-HNSW size of 10,000 in all the subsequent experiments.
V-C Comparison with Other Methods
In this set of experiments, we compared Pyramid with two other distributed similarity search solutions, i.e., HNSW-naive and FLANN . We are aware that there are other distributed similarity search solutions, such as PLSH  and SPTAG . We did not compare with them because PLSH is not open-source and SPTAG replicates the entire dataset on every machine and cannot support large datasets. As introduced in Section III, HNSW-naive randomly partitions a dataset among the workers and builds a sub-HNSW on each worker. Therefore, a query needs to be handled by all the workers. FLANN is a widely used library for similarity search and supports distributed similarity search with KD tree. Similar to HNSW-naive, FLANN randomly partitions the data among workers and builds an index on each worker. HNSW-naive and Pyramid used the same maximum out-degree configuration for the sub-HNSWs. To enable a fair throughput comparison between Pyramid and HNSW-naive, we adjusted their query processing parameters (branching factor and search factor for Pyramid and search factor for HNSW-naive) to achieve a precision approximately 90%. As it is difficult for FLANN to achieve a 90% precision, we report both its precision and throughput under the setting recommended in .
The throughput and precision results of the systems on Deep500M and SIFT500M are reported in Figure 9. The results show that Pyramid achieves a throughput that is over 2x of HNSW-naive at a similar precision. This is because Pyramid uses a dataset partitioning and query assignment strategy to avoid searching the sub-HNSWs in all workers for processing a query. As each query generates less workload in Pyramid, its throughput is much higher than HNSW-naive. Moreover, Pyramid and HNSW-naive outperform FLANN in both throughput and precision due to the algorithmic advantage of HNSW over the tree-based method adopted in FLANN. In particular, the throughput of Pyramid is two orders of magnitude higher than that of FLANN, which shows the benefits of building distributed similarity search solutions based on the state-of-the-art similarity search algorithm.
It is worth noting that the better performance of Pyramid comes at the cost of more expensive index building. For the Deep500M dataset, Pyramid took about 162 minutes for index building using 10 machines, with 31 minutes for meta-HNSW construction, 87 minutes for dataset partitioning and 44 minutes for sub-HNSW construction. As a comparison, HNSW-naive took 53 minutes to build the index, with 14 minutes for dataset partitioning and 39 minutes for sub-HNSW construction. The longer index building time of Pyramid is mainly caused by dataset partitioning using the meta-HNSW, which needs to search the meta-HNSW with every item. For FLANN, index building is extremely fast and took only 38 seconds. As index building is conducted off-line, Pyramid is suitable for applications where dataset updates are not frequent and online search performance is critical.
We report the performance of Pyramid on MIPS for the Tiny10M dataset in Figure 10. As there are no distributed similarity search solutions that support MIPS, we report the performance of HNSW-naive as the baseline. HNSW-naive achieves a precision of 99.7% and a throughput of 12,732 queries per second. In comparison, the throughput of Pyramid is much higher under similar precision thanks to its low access rate. Recall that we allow an item to appear in multiple sub-datasets for MIPS in Algorithm 5. Due to this design, Pyramid achieves a precision of 96.98% under a branching factor of 1, which means that the sub-HNSW in only one of the 10 machines is accessed for processing each query. Note that we did not replicate a lot of items across the machines, which would take up a lot of memory. For this experiment, we set the replication factor as 300 and all the sub-HNSWs end up storing a total of 10,060,599 items, which is only 0.6% larger than the original Tiny10M dataset. We believe allowing some items to appear in multiple sub-datasets can be an effective measure to reduce the access rate for distributed similarity search and adopting this idea of Euclidean NNS will be an interesting direction for future study.
V-D Scalability, Straggler Mitigation and Fault Tolerance
In this set of experiments, we tested the scalability and robustness of Pyramid. The experiments were conducted on the SIFT100M dataset, which was sampled from the SIFT500M dataset.
To check the scalability of Pyramid, we tested its throughput on the SIFT100M dataset with 10 machines and 5 machines. For a fair comparison of throughout, we adjusted the query processing parameters (e.g., branching factor and search factor ) to ensure that both configurations achieved the same precision (80% and 90%). We report the results in Figure 11. The throughput with 10 machines is 1.78 and 1.59 times of the case with 5 machines under a precision of 80% and 90%, respectively. The reason that Pyramid does not achieve linear scaling could be due to the characteristic of HNSW. As introduced in Section II, the search complexity of HNSW scales with with being the cardinality of the dataset. To achieve the same precision, the 5 machine configuration needs to access a smaller number of sub-HNSWs but the number of items in each sub-HNSWs is larger comparing with the 10 machine configuration. As the search complexity of HNSW increases slowly with the dataset cardinality, the 5 machine configuration has a smaller per-query complexity than its 10 machine counterpart because the influence of using a smaller number of sub-HNSWs is dominant.
To test the performance of Pyramid under straggler, we used the CPU-limit tool to constrain the CPU usage of one of the machines. In this experiment, we created two copies of each sub-HNSW and assigned them to two different machines. Moreover, we also ensured that each machine hosted two different sub-HNSWs. We configured the system to run at 70% of its peak throughput and measured the throughput of the queries that would access the sub-HNSWs hosted on the CPU limited machine. The results in Figure 12 show that there is no significant change in the throughput when the CPU share is above 30%. This is because the two sub-HNSWs hosted on the CPU limited machine had replicas on other machines and Kafka would offload the queries to these machines due to its message dispatching mechanism. As the system ran at only 70% of its peak throughput, the other machines had idle resource to serve the offloaded queries. When the CPU share was too low (e.g., 10%), the throughput decreases significantly as too many queries were offloaded and the other machines did not have sufficient resource to process them. However, this extreme case of straggler is not common in practice and Pyramid can provide a steady throughput in the case of straggler using replication.
We report the performance of Pyramid under failure in Figure 13. The results show that the query processing throughput drops at the time when one machine was killed (at around the 300th second). At around the 500th second, when the failed machine rejoined query processing, throughput dropped again because Kafka needed to re-balance the message queue for the machines. When the re-balancing finished at approximately the 600th second, the throughput returned back to the level before failure. The result thus show that Pyramid can effectively handle failure.
Vi Related Work
, were first proposed for similarity search. However, it was found that these algorithms are prone to curse of dimensionality. Locality sensitive hashing (LSH) based methods [15, 2, 41] were then proposed to use random hash functions to map the items to buckets and provide theoretical guarantee on the quality of the search results. Recently, the vector quantization (VQ) techniques, such as PQ , OPQ  and CQ , were proposed, which learn vector codebooks from the data and support both fast similarity function computation and data compression. As the VQ methods are data-dependent, they usually provide better performance than the data-independent LSH-based methods. More recently, the proximity graph based methods [20, 21, 22] are shown to significantly outperform other methods. Although similarity search research traditionally focuses on Euclidean NNS, there is an increasing interest in MIPS due to its many applications [16, 43]. It was found that proximity graph also provides the best performance for MIPS . Pyramid builds its distributed solution based on HNSW, the state of the art proximity graph method, and supports popular similarity functions including Euclidean distance and inner product.
Large-scale solutions. There are a number of works that address similarity search on large-scale datasets for applications such as image search. FAISS uses VQ techniques to compress billion-scale datasets to fit in the memory of GPU and leverages the computation power of GPU to reduce query processing latency . Link&Code uses VQ techniques and a customized interpretation algorithm to compress large datasets to fit in the main memory and builds an HNSW on the compressed vectors. VQ techniques are used for data compression and the IVFADC index structure  is used for candidate generation in . These methods are all based on a single machine and the data compression affects the quality of the search results. Moreover, they also have problems in scaling to the large datasets (e.g., with trillions of items) we may encounter in the future.
In this paper, we presented Pyramid, a general, efficient and robust framework for distributed similarity search on large datasets. Pyramid was developed based on HNSW, the state-of-the-art algorithm for similarity search. We devised effective data partitioning and query assignment strategies to improve query processing throughput and latency. Pyramid is general and can work with popular similarity functions including Euclidean distance, angular distance and inner product. Pyramid is also robust to straggler and failures due to its system design. Experimental results show that Pyramid provides high quality search results on large datasets and achieves high query processing throughput and low latency.
-  Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence, 40(4):769–790, 2017.
-  Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253–262. ACM, 2004.
Moses S Charikar.
Similarity estimation techniques from rounding algorithms.In
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM, 2002.
-  Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pages 2321–2329, 2014.
-  James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In , pages 1–8. IEEE, 2007.
-  Matthijs Douze, Arthur Szlam, Bharath Hariharan, and Hervé Jégou. Low-shot learning with large-scale diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3349–3358, 2018.
-  Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.
-  Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology, 33(6):623, 2015.
-  Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. Kore: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 545–554. ACM, 2012.
-  Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal Vincent, Gerald Tesauro, and Yoshua Bengio. Hierarchical memory networks. arXiv preprint arXiv:1605.07427, 2016.
-  Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems, pages 99–109, 2017.
-  K Fukunage and Patrenahalli M. Narendra. A branch and bound algorithm for computing k-nearest neighbors. IEEE transactions on computers, (7):750–753, 1975.
-  Hosagrahar V Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang. idistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems (TODS), 30(2):364–397, 2005.
-  Parikshit Ram and Alexander G Gray. Maximum inner-product search using cone trees. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 931–939. ACM, 2012.
-  Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613. ACM, 1998.
-  Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product search. arXiv preprint arXiv:1410.5518, 2014.
-  Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
-  Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2946–2953, 2013.
-  Artem Babenko and Victor Lempitsky. The inverted multi-index. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(6):1247–1260, 2014.
Kiana Hajebi, Yasin Abbasi-Yadkori, Hossein Shahbazi, and Hong Zhang.
Fast approximate nearest-neighbor search with k-nearest neighbor
Twenty-Second International Joint Conference on Artificial Intelligence, 2011.
-  Ben Harwood and Tom Drummond. Fanng: Fast approximate nearest neighbour graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5713–5722, 2016.
-  Yury A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 2018.
-  Jing Wang, Jingdong Wang, Gang Zeng, Rui Gan, Shipeng Li, and Baining Guo. Fast neighborhood graph search using cartesian concatenation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2128–2135, 2013.
-  Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. Fast approximate nearest neighbor search with the navigating spreading-out graph. Proceedings of the VLDB Endowment, 12(5):461–474, 2019.
Marius Muja and David G Lowe.
Scalable nearest neighbor algorithms for high dimensional data.IEEE transactions on pattern analysis and machine intelligence, 36(11):2227–2240, 2014.
-  Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proceedings of the VLDB Endowment, 6(14):1930–1941, 2013.
-  Wanxin Zhang, Dongsheng Li, Ying Xu, and Yiming Zhang. Shuffle-efficient distributed locality sensitive hashing on spark. In 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 766–767. IEEE, 2016.
-  Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. Searching in one billion vectors: re-rank with source coding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 861–864. IEEE, 2011.
-  Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
-  Matthijs Douze, Alexandre Sablayrolles, and Hervé Jégou. Link and code: Fast indexing with graphs and compact regression codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3646–3654, 2018.
-  Firas Abuzaid, Geet Sethi, Peter Bailis, and Matei Zaharia. To index or not to index: Optimizing exact maximum inner product search.
-  Dmitry Baranchuk, Artem Babenko, and Yury Malkov. Revisiting the inverted indices for billion-scale approximate nearest neighbors. In Proceedings of the European Conference on Computer Vision (ECCV), pages 202–216, 2018.
-  Stanislav Morozov and Artem Babenko. Non-metric similarity graphs for maximum inner product search. In Advances in Neural Information Processing Systems, pages 4721–4730, 2018.
-  Peter Sanders and Christian Schulz. Think locally, act globally: Perfectly balanced graph partitioning. arXiv preprint arXiv:1210.0477, 2012.
-  Alex Auvolat, Sarath Chandar, Pascal Vincent, Hugo Larochelle, and Yoshua Bengio. Clustering is efficient for approximate maximum inner product search. arXiv preprint arXiv:1507.05910, 2015.
-  Roy T Fielding and Richard N Taylor. Architectural styles and the design of network-based software architectures, volume 7. University of California, Irvine Doctoral dissertation, 2000.
-  Artem Babenko and Victor Lempitsky. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2055–2063, 2016.
-  Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. Searching in one billion vectors: re-rank with source coding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 861–864. IEEE, 2011.
-  Qi Chen, Haidong Wang, Mingqin Li, Gang Ren, Scarlett Li, Jeffery Zhu, Jason Li, Chuanjie Liu, Lintao Zhang, and Jingdong Wang. SPTAG: A library for fast approximate nearest neighbor search, 2018.
-  Roger Weber, Hans-Jörg Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, volume 98, pages 194–205, 1998.
-  Jinfeng Li, Xiao Yan, Jian Zhang, An Xu, James Cheng, Jie Liu, Kelvin KW Ng, and Ti-chung Cheng. A general and efficient querying method for learning to hash. In Proceedings of the 2018 International Conference on Management of Data, pages 1333–1347. ACM, 2018.
-  Ting Zhang, Chao Du, and Jingdong Wang. Composite quantization for approximate nearest neighbor search. In ICML, volume 2, page 3, 2014.
-  Xiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, and James Cheng. Norm-ranging lsh for maximum inner product search. In Advances in Neural Information Processing Systems, pages 2952–2961, 2018.