A library for efficient similarity search and clustering of dense vectors.
Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy. We propose a design for k-selection that operates at up to 55 peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art. We apply it in different similarity search scenarios, by proposing optimized design for brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation enables the construction of a high accuracy k-NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.READ FULL TEXT VIEW PDF
We explore the use of GPU for accelerating large scale nearest neighbor
Billion-scale high-dimensional approximate nearest neighbour (ANN) searc...
We present FLASH ( Fast LSH Algorithm for Similarity
k-nearest neighbor graph is a key data structure in many disciplines suc...
We study an indexing architecture to store and search in a database of
Making top-K predictions for state-of-the-art Matrix Factorization model...
Data series are a special type of multidimensional data present in numer...
A library for efficient similarity search and clustering of dense vectors.
Images and videos constitute a new massive source of data for indexing and search. Extensive metadata for this content is often not available. Search and interpretation of this and other human-generated content, like text, is difficult and important. A variety of machine learning and deep learning algorithms are being used to interpret and classify these complex, real-world entities. Popular examples include the text representation known as word2vec
, representations of images by convolutional neural networks[39, 19], and image descriptors for instance search . Such representations or embeddings are usually real-valued, high-dimensional vectors of 50 to 1000+ dimensions. Many of these vector representations can only effectively be produced on GPU systems, as the underlying processes either have high arithmetic complexity and/or high data bandwidth demands , or cannot be effectively partitioned without failing due to communication overhead or representation quality . Once produced, their manipulation is itself arithmetically intensive. However, how to utilize GPU assets is not straightforward. More generally, how to exploit new heterogeneous architectures is a key subject for the database community .
In this context, searching by numerical similarity rather than via structured relations is more suitable. This could be to find the most similar content to a picture, or to find the vectors that have the highest response to a linear classifier on all vectors of a collection.
One of the most expensive operations to be performed on large collections is to compute a -NN graph. It is a directed graph where each vector of the database is a node and each edge connects a node to its nearest neighbors. This is our flagship application. Note, state of the art methods like NN-Descent  have a large memory overhead on top of the dataset itself and cannot readily scale to the billion-sized databases we consider.
Such applications must deal with the curse of dimensionality , rendering both exhaustive search or exact indexing for non-exhaustive search impractical on billion-scale databases. This is why there is a large body of work on approximate search and/or graph construction. To handle huge datasets that do not fit in RAM, several approaches employ an internal compressed representation of the vectors using an encoding. This is especially convenient for memory-limited devices like GPUs. It turns out that accepting a minimal accuracy loss results in orders of magnitude of compression . The most popular vector compression methods can be classified into either binary codes [18, 22], or quantization methods [25, 37]. Both have the desirable property that searching neighbors does not require reconstructing the vectors.
Our paper focuses on methods based on product quantization (PQ) codes, as these were shown to be more effective than binary codes . In addition, binary codes incur important overheads for non-exhaustive search methods . Several improvements were proposed after the original product quantization proposal known as IVFADC ; most are difficult to implement efficiently on GPU. For instance, the inverted multi-index , useful for high-speed/low-quality operating points, depends on a complicated “multi-sequence” algorithm. The optimized product quantization or OPQ 
is a linear transformation on the input vectors that improves the accuracy of the product quantization; it can be applied as a pre-processing. The SIMD-optimized IVFADC implementation from operates only with sub-optimal parameters (few coarse quantization centroids). Many other methods, like LOPQ and the Polysemous codes [27, 16] are too complex to be implemented efficiently on GPUs.
There are many implementations of similarity search on GPUs, but mostly with binary codes , small datasets , or exhaustive search [14, 40, 41]. To the best of our knowledge, only the work by Wieschollek et al.  appears suitable for billion-scale datasets with quantization codes. This is the prior state of the art on GPUs, which we compare against in Section 6.4.
This paper makes the following contributions:
a GPU -selection algorithm, operating in fast register memory and flexible enough to be fusable with other kernels, for which we provide a complexity analysis;
a near-optimal algorithmic layout for exact and approximate -nearest neighbor search on GPU;
a range of experiments that show that these improvements outperform previous art by a large margin on mid- to large-scale nearest-neighbor search tasks, in single or multi-GPU configurations.
The paper is organized as follows. Section 2 introduces the context and notation. Section 3 reviews GPU architecture and discusses problems appearing when using it for similarity search. Section 4 introduces one of our main contributions, i.e., our k-selection method for GPUs, while Section 5 provides details regarding the algorithm computation layout. Finally, Section 6 provides extensive experiments for our approach, compares it to the state of the art, and shows concrete use cases for image collections.
We are concerned with similarity search in vector collections. Given the query vector and the collection222To avoid clutter in 0-based indexing, we use the array notation to denote the range inclusive. , we search:
i.e., we search the nearest neighbors of in terms of L2 distance. The L2 distance is used most often, as it is optimized by design when learning several embeddings (e.g., ), due to its attractive linear algebra properties.
The lowest distances are collected by -selection. For an array , -selection finds the lowest valued elements , , along with the indices , , of those elements from the input array. The will be 32-bit floating point values; the are 32- or 64-bit integers. Other comparators are sometimes desired; e.g.
, for cosine similarity we search forhighest values. The order between equivalent keys is not specified.
Typically, searches are performed in batches of query vectors in parallel, which allows for more flexibility when executing on multiple CPU threads or on GPU. Batching for -selection entails selecting elements and indices from separate arrays, where each array is of a potentially different length .
The exact solution computes the full pairwise distance matrix . In practice, we use the decomposition
The two first terms can be precomputed in one pass over the matrices and whose rows are the and . The bottleneck is to evaluate , equivalent to the matrix multiplication . The -nearest neighbors for each of the queries are -selected along each row of .
From now on, we focus on approximate nearest-neighbor search. We consider, in particular, the IVFADC indexing structure . The IVFADC index relies on two levels of quantization, and the database vectors are encoded. The database vector is approximated as:
where and are quantizers; i.e., functions that output an element from a finite set. Since the sets are finite, is encoded as the index of and that of . The first-level quantizer is a coarse quantizer and the second level fine quantizer encodes the residual vector after the first level.
The Asymmetric Distance Computation (ADC) search method returns an approximate result:
For IVFADC the search is not exhaustive. Vectors for which the distance is computed are pre-selected depending on the first-level quantizer :
The multi-probe parameter is the number of coarse-level centroids we consider. The quantizer operates a nearest-neighbor search with exact distances, in the set of reproduction values. Then, the IVFADC search computes
Hence, IVFADC relies on the same distance estimations as the two-step quantization of ADC, but computes them only on a subset of vectors.
The corresponding data structure, the inverted file, groups the vectors into inverted lists with homogeneous . Therefore, the most memory-intensive operation is computing , and boils down to linearly scanning inverted lists.
The quantizers and have different properties. needs to have a relatively low number of reproduction values so that the number of inverted lists does not explode. We typically use , trained via -means. For , we can afford to spend more memory for a more extensive representation. The ID of the vector (a 4- or 8-byte integer) is also stored in the inverted lists, so it makes no sense to have shorter codes than that; i.e., .
We use a product quantizer  for , which provides a large number of reproduction values without increasing the processing cost. It interprets the vector as sub-vectors , where is an even divisor of the dimension . Each sub-vector is quantized with its own quantizer, yielding the tuple …, . The sub-quantizers typically have 256 reproduction values, to fit in one byte. The quantization value of the product quantizer is then , which from a storage point of view is just the concatenation of the bytes produced by each sub-quantizer. Thus, the product quantizer generates -byte codes with reproduction values. The -means dictionaries of the quantizers are small and quantization is computationally cheap.
This section reviews salient details of Nvidia’s general-purpose GPU architecture and programming model . We then focus on one of the less GPU-compliant parts involved in similarity search, namely the -selection, and discuss the literature and challenges.
The Nvidia GPU is a general-purpose computer that executes instruction streams using a 32-wide vector of CUDA threads (the warp); individual threads in the warp are referred to as lanes, with a lane ID from 0 – 31. Despite the “thread” terminology, the best analogy to modern vectorized multicore CPUs is that each warp is a separate CPU hardware thread, as the warp shares an instruction counter. Warp lanes taking different execution paths results in warp divergence, reducing performance. Each lane has up to 255 32-bit registers in a shared register file. The CPU analogy is that there are up to 255 vector registers of width 32, with warp lanes as SIMD vector lanes.
A user-configurable collection of 1 to 32 warps comprises a block or a co-operative thread array (CTA). Each block has a high speed shared memory, up to 48 KiB in size. Individual CUDA threads have a block-relative ID, called a thread id, which can be used to partition and assign work. Each block is run on a single core of the GPU called a streaming multiprocessor (SM). Each SM has functional units, including ALUs, memory load/store units, and various special instruction units. A GPU hides execution latencies by having many operations in flight on warps across all SMs. Each individual warp lane instruction throughput is low and latency is high, but the aggregate arithmetic throughput of all SMs together is 5 – 10 higher than typical CPUs.
Blocks are organized in a grid of blocks in a kernel. Each block is assigned a grid relative ID. The kernel is the unit of work (instruction stream with arguments) scheduled by the host CPU for the GPU to execute. After a block runs through to completion, new blocks can be scheduled. Blocks from different kernels can run concurrently. Ordering between kernels is controllable via ordering primitives such as streams and events.
The number of blocks executing concurrently depends upon shared memory and register resources used by each block. Per-CUDA thread register usage is determined at compilation time, while shared memory usage can be chosen at runtime. This usage affects occupancy on the GPU. If a block demands all 48 KiB of shared memory for its private usage, or 128 registers per thread as opposed to 32, then only 1 – 2 other blocks can run concurrently on the same SM, resulting in low occupancy. Under high occupancy more blocks will be present across all SMs, allowing more work to be in flight at once.
Different blocks and kernels communicate through global memory, typically 4 – 32 GB in size, with 5 – 10 higher bandwidth than CPU main memory. Shared memory is analogous to CPU L1 cache in terms of speed. GPU register file memory is the highest bandwidth memory. In order to maintain the high number of instructions in flight on a GPU, a vast register file is also required: 14 MB in the latest Pascal P100, in contrast with a few tens of KB on CPU. A ratio of 250 : 6.25 : 1 for register to shared to global memory aggregate cross-sectional bandwidth is typical on GPU, yielding 10 – 100s of TB/s for the register file .
Shared and register memory usage involves efficiency tradeoffs; they lower occupancy but can increase overall performance by retaining a larger working set in a faster memory. Making heavy use of register-resident data at the expense of occupancy or instead of shared memory is often profitable .
As the GPU register file is very large, storing structured data (not just temporary operands) is useful. A single lane can use its (scalar) registers to solve a local task, but with limited parallelism and storage. Instead, lanes in a GPU warp can instead exchange register data using the warp shuffle instruction, enabling warp-wide parallelism and storage.
A common pattern to achieve this is a lane-stride register array
lane-stride register array. That is, given elements , each successive value is held in a register by neighboring lanes. The array is stored in registers per lane, with a multiple of 32. Lane stores , while register holds .
For manipulating the , the register in which is stored (i.e., ) and must be known at assembly time, while the lane (i.e., ) can be runtime knowledge. A wide variety of access patterns (shift, any-to-any) are provided; we use the butterfly permutation  extensively.
-selection algorithms, often for arbitrarily large and , can be translated to a GPU, including radix selection and bucket selection , probabilistic selection , quickselect , and truncated sorts . Their performance is dominated by multiple passes over the input in global memory. Sometimes for similarity search, the input distances are computed on-the-fly or stored only in small blocks, not in their entirety. The full, explicit array might be too large to fit into any memory, and its size could be unknown at the start of the processing, rendering algorithms that require multiple passes impractical. They suffer from other issues as well. Quickselect requires partitioning on a storage of size , a data-dependent memory movement. This can result in excessive memory transactions, or requiring parallel prefix sums to determine write offsets, with synchronization overhead. Radix selection has no partitioning but multiple passes are still required.
In similarity search applications, one is usually interested only in a small number of results, or so. In this regime, selection via max-heap is a typical choice on the CPU, but heaps do not expose much data parallelism (due to serial tree update) and cannot saturate SIMD execution units. The ad-heap  takes better advantage of parallelism available in heterogeneous systems, but still attempts to partition serial and parallel work between appropriate execution units. Despite the serial nature of heap update, for small the CPU can maintain all of its state in the L1 cache with little effort, and L1 cache latency and bandwidth remains a limiting factor. Other similarity search components, like PQ code manipulation, tend to have greater impact on CPU performance .
Heaps can be similarly implemented on a GPU . However, a straightforward GPU heap implementation suffers from high warp divergence and irregular, data-dependent memory movement, since the path taken for each inserted element depends upon other values in the heap.
GPU parallel priority queues  improve over the serial heap update by allowing multiple concurrent updates, but they require a potential number of small sorts for each insert and data-dependent memory movement. Moreover, it uses multiple synchronization barriers through kernel launches in different streams, plus the additional latency of successive kernel launches and coordination with the CPU host.
Other more novel GPU algorithms are available for small , namely the selection algorithm in the fgknn library . This is a complex algorithm that may suffer from too many synchronization points, greater kernel launch overhead, usage of slower memories, excessive use of hierarchy, partitioning and buffering. However, we take inspiration from this particular algorithm through the use of parallel merges as seen in their merge queue structure.
For any CPU or GPU algorithm, either memory or arithmetic throughput should be the limiting factor as per the roofline performance model . For input from global memory, -selection cannot run faster than the time required to scan the input once at peak memory bandwidth. We aim to get as close to this limit as possible. Thus, we wish to perform a single pass over the input data (from global memory or produced on-the-fly, perhaps fused with a kernel that is generating the data).
We want to keep intermediate state in the fastest memory: the register file. The major disadvantage of register memory is that the indexing into the register file must be known at assembly time, which is a strong constraint on the algorithm.
We use an in-register sorting primitive as a building block. Sorting networks are commonly used on SIMD architectures , as they exploit vector parallelism. They are easily implemented on the GPU, and we build sorting networks with lane-stride register arrays.
We use a variant of Batcher’s bitonic sorting network , which is a set of parallel merges on an array of size . Each merge takes arrays of length ( and a power of 2) to arrays of length , using parallel steps. A bitonic sort applies this merge recursively: to sort an array of length , merge arrays of length to arrays of length , to arrays of length , successively to sorted array of length , leading to parallel merge steps.
If some input data is already sorted, we can modify the network to avoid merging steps. We may also not have a full power-of-2 set of data, in which case we can efficiently shortcut to deal with the smaller size.
Algorithm 1 is an odd-sized merging network that merges already sorted left and right arrays, each of arbitrary length. While the bitonic network merges bitonic sequences, we start with monotonic sequences: sequences sorted monotonically. A bitonic merge is made monotonic by reversing the first comparator stage.
The odd size algorithm is derived by considering arrays to be padded to the next highest power-of-2 size with dummy elements that are never swapped (the merge is monotonic) and are already properly positioned; any comparisons with dummy elements are elided. A left array is considered to be padded with dummy elements at the start; a right array has them at the end. A merge of two sorted arrays of lengthand to a sorted array of requires parallel steps. Figure 1 shows Algorithm 1’s merging network for arrays of size 5 and 3, with 4 parallel steps.
The compare-swap is implemented using warp shuffles on a lane-stride register array. Swaps with a stride a multiple of 32 occur directly within a lane as the lane holds both elements locally. Swaps of stride or a non-multiple of 32 occur with warp shuffles. In practice, used array lengths are multiples of 32 as they are held in lane-stride arrays.
Algorithm 2 extends the merge to a full sort. Assuming no structure present in the input data, parallel steps are required for sorting data of length .
Our -selection implementation, WarpSelect, maintains state entirely in registers, requires only a single pass over data and avoids cross-warp synchronization. It uses merge-odd and sort-odd as primitives. Since the register file provides much more storage than shared memory, it supports . Each warp is dedicated to -selection to a single one of the arrays . If is large enough, a single warp per each will result in full GPU occupancy. Large per warp is handled by recursive decomposition, if is known in advance.
Our approach (Algorithm 3 and Figure 2) operates on values, with associated indices carried along (omitted from the description for simplicity). It selects the least values that come from global memory, or from intermediate value registers if fused into another kernel providing the values. Let be the sequence provided for selection.
The elements (on the left of Figure 2) are processed in groups of 32, the warp size. Lane is responsible for processing ; thus, if the elements come from global memory, the reads are contiguous and coalesced into a minimal number of memory transactions.
Each lane maintains a small queue of elements in registers, called the thread queues , ordered from largest to smallest (). The choice of is made relative to , see Section 4.3. The thread queue is a first-level filter for new values coming in. If a new is greater than the largest key currently in the queue, , it is guaranteed that it won’t be in the smallest final results.
The warp shares a lane-stride register array of smallest seen elements, , called the warp queue. It is ordered from smallest to largest (); if the requested is not a multiple of 32, we round it up. This is a second level data structure that will be used to maintain all of the smallest warp-wide seen values. The thread and warp queues are initialized to maximum sentinel values, e.g., .
The three invariants maintained are:
all per-lane are not in the min-
all per-lane are greater than all warp queue keys
all seen so far in the min- are contained in either some lane’s thread queue (), or in the warp queue.
Lane receives a new and attempts to insert it into its thread queue. If , then the new pair is by definition not in the minimum, and can be rejected.
Otherwise, it is inserted into its proper sorted position in the thread queue, thus ejecting the old . All lanes complete doing this with their new received pair and their thread queue, but it is now possible that the second invariant have been violated. Using the warp ballot instruction, we determine if any lane has violated the second invariant. If not, we are free to continue processing new elements.
If any lane has its invariant violated, then the warp uses odd-merge to merge and sort the thread and warp queues together. The new warp queue will be the min- elements across the merged, sorted queues, and the new thread queues will be the remainder, from min- to min-. This restores the invariants and we are free to continue processing subsequent elements.
Since the thread and warp queues are already sorted, we merge the sorted warp queue of length with 32 sorted arrays of length . Supporting odd-sized merges is important because Batcher’s formulation would require that and is a power-of-2; thus if , must be 32. We found that the optimal is way smaller (see below).
Using odd-merge to merge the 32 already sorted thread queues would require a struct-of-arrays to array-of-structs transposition in registers across the warp, since the successive sorted values are held in different registers in the same lane rather than a lane-stride array. This is possible , but would use a comparable number of warp shuffles, so we just reinterpret the thread queue registers as an (unsorted) lane-stride array and sort from scratch. Significant speedup is realizable by using odd-merge for the merge of the aggregate sorted thread queues with the warp queue.
If there are remainder elements because is not a multiple of 32, those are inserted into the thread queues for the lanes that have them, after which we proceed to the output stage.
A final sort and merge is made of the thread and warp queues, after which the warp queue holds all min- values.
For each incoming group of 32 elements, WarpSelect can perform 1, 2 or 3 constant-time operations, all happening in warp-wide parallel time:
read 32 elements, compare to all thread queue heads , cost , happens times;
if , , perform insertion sort on those specific thread queues, cost , happens times;
if , sort and merge queues, cost , happens times.
Thus, the total cost is . , and on random data drawn independently, and , see the Appendix for a full derivation. Hence, the trade-off is to balance a cost in and one in . The practical choice for given and was made by experiment on a variety of -NN data. For , we use , uses , uses , and uses , all irrespective of .
This section explains how IVFADC, one of the indexing methods originally built upon product quantization , is implemented efficiently. Details on distance computations and articulation with -selection are the key to understanding why this method can outperform more recent GPU-compliant approximate nearest neighbor strategies .
We briefly come back to the exhaustive search method, often referred to as exact brute-force. It is interesting on its own for exact nearest neighbor search in small datasets. It is also a component of many indexes in the literature. In our case, we use it for the IVFADC coarse quantizer .
As stated in Section 2, the distance computation boils down to a matrix multiplication. We use optimized GEMM routines in the cuBLAS library to calculate the term for L2 distance, resulting in a partial distance matrix . To complete the distance calculation, we use a fused -selection kernel that adds the term to each entry of the distance matrix and immediately submits the value to -selection in registers. The term need not be taken into account before -selection. Kernel fusion thus allows for only 2 passes (GEMM write, -select read) over , compared to other implementations that may require 3 or more. Row-wise -selection is likely not fusable with a well-tuned GEMM kernel, or would result in lower overall efficiency.
As does not fit in GPU memory for realistic problem sizes, the problem is tiled over the batch of queries, with queries being run in a single tile. Each of the tiles are independent problems, but we run two in parallel on different streams to better occupy the GPU, so the effective memory requirement of is . The computation can similarly be tiled over . For very large input coming from the CPU, we support buffering with pinned memory to overlap CPU to GPU copy with GPU compute.
At its core, the IVFADC requires computing the distance from a vector to a set of product quantization reproduction values. By developing Equation (6) for a database vector , we obtain:
If we decompose the residual vectors left after as:
then the distance is rewritten as:
Each quantizer has 256 reproduction values, so when and are known all distances can be precomputed and stored in tables each of size 256 . Computing the sum (10) consists of look-ups and additions. Comparing the cost to compute distances:
Explicit computation: mutiply-adds;
With lookup tables: multiply-adds and lookup-adds.
This is the key to the efficiency of the product quantizer. In our GPU implementation, is any multiple of 4 up to 64. The codes are stored as sequential groups of bytes per vector within lists.
When scanning over the elements of the inverted list (where by definition is constant), the look-up table method can be applied, as the query and are known.
The objective is to minimize inner loop computations. The computations we can do in advance and store in lookup tables are as follows:
Term 1 is independent of the query. It can be precomputed from the quantizers, and stored in a table of size ;
Term 2 is the distance to ’s reproduction value. It is thus a by-product of the first-level quantizer ;
Term 3 can be computed independently of the inverted list. Its computation costs multiply-adds.
This decomposition is used to produce the lookup tables used during the scan of the inverted list. For a single query, computing the tables from scratch costs multiply-adds, while this decomposition costs multiply-adds and additions. On the GPU, the memory usage of can be prohibitive, so we enable the decomposition only when memory is a not a concern.
Algorithm 4 summarizes the process as one would implement it on a CPU. The inverted lists are stored as two separate arrays, for PQ codes and associated IDs. IDs are resolved only if -selection determines -nearest membership. This lookup yields a few sparse memory reads in a large array, thus the IDs can optionally be stored on CPU for tiny performance cost.
A kernel is responsible for scanning the closest inverted lists for each query, and calculating the per-vector pair distances using the lookup tables . The are stored in shared memory: up to lookups are required for a query set (trillions of accesses in practice), and are random access. This limits to at most 48 (32-bit floating point) or 96 (16-bit floating point) with current architectures. In case we do not use the decomposition of Equation (11), the are calculated by a separate kernel before scanning.
Each pairs of query against inverted list can be processed independently. At one extreme, a block is dedicated to each of these, resulting in up to partial results being written back to global memory, which is then -selected to final results. This yields high parallelism but can exceed available GPU global memory; as with exact search, we choose a tile size to reduce memory consumption, bounding its complexity by with multi-streaming.
A single warp could be dedicated to -selection of each set of lists, which could result in low parallelism. We introduce a two-pass -selection, reducing to partial results for some subdivision factor . This is reduced again via -selection to the final results.
As with exact search, we experimented with a kernel that dedicates a single block to scanning all lists for a single query, with -selection fused with distance computation. This is possible as WarpSelect does not fight for the shared memory resource which is severely limited. This reduces global memory write-back, since almost all intermediate results can be eliminated. However, unlike -selection overhead for exact computation, a significant portion of the runtime is the gather from the in shared memory and linear scanning of the from global memory; the write-back is not a dominant contributor. Timing for the fused kernel is improved by at most 15%, and for some problem sizes would be subject to lower parallelism and worse performance without subsequent decomposition. Therefore, and for reasons of implementation simplicity, we do not use this layout.
Modern servers can support several GPUs. We employ this capability for both compute power and memory.
If an index instance fits in the memory of a single GPU, it can be replicated across different GPUs. To query vectors, each replica handles a fraction of the queries, joining the results back together on a single GPU or in CPU memory. Replication has near linear speedup, except for a potential loss in efficiency for small .
If an index instance does not fit in the memory of a single GPU, an index can be sharded across different GPUs. For adding vectors, each shard receives of the vectors, and for query, each shard handles the full query set , joining the partial results (an additional round of -selection is still required) on a single GPU or in CPU memory. For a given index size , sharding will yield a speedup (sharding has a query of against versus replication with a query of against ), but is usually less than pure replication due to fixed overhead and cost of subsequent -selection.
Replication and sharding can be used together ( shards, each with replicas for GPUs in total). Sharding or replication are both fairly trivial, and the same principle can be used to distribute an index across multiple machines.
This section compares our GPU -selection and nearest-neighbor approach to existing libraries. Unless stated otherwise, experiments are carried out on a 22.8GHz Intel Xeon E5-2680v2 with 4 Maxwell Titan X GPUs on CUDA 8.0.
We compare against two other GPU small -selection implementations: the row-based Merge Queue with Buffered Search and Hierarchical Partition extracted from the fgknn library of Tang et al.  and Truncated Bitonic Sort (TBiS) from Sismanis et al. . Both were extracted from their respective exact search libraries.
We evaluate -selection for and 1000 of each row from a row-major matrix of random 32-bit floating point values on a single Titan X. The batch size is fixed at 10000, and the array lengths vary from 1000 to 128000. Inputs and outputs to the problem remain resident in GPU memory, with the output being of size , with corresponding indices. Thus, the input problem sizes range from 40 MB ( = ) to 5.12 GB ( = k). TBiS requires large auxiliary storage, and is limited to in our tests.
Figure 3 shows our relative performance against TBiS and fgknn. It also includes the peak possible performance given by the memory bandwidth limit of the Titan X. The relative performance of WarpSelect over fgknn increases for larger ; even TBiS starts to outperform fgknn for larger at . We look especially at the largest . WarpSelect is faster at , at . Performance against peak possible drops off for all implementations at larger . WarpSelect operates at 55% of peak at but only 16% of peak at . This is due to additional overhead assocated with bigger thread queues and merge/sort networks for large .
WarpSelect is influenced by fgknn, but has several improvements: all state is maintained in registers (no shared memory), no inter-warp synchronization or buffering is used, no “hierarchical partition”, the -selection can be fused into other kernels, and it uses odd-size networks for efficient merging and sorting.
The exact search method with can be used by a -means clustering method in the assignment stage, to assign training vectors to centroids. Despite the fact that it does not use the IVFADC and selection is trivial (a parallel reduction is used for the case, not WarpSelect), -means is a good benchmark for the clustering used to train the quantizer .
We apply the algorithm on MNIST8m images. The 8.1M images are graylevel digits in 28x28 pixels, linearized to vectors of 784-d. We compare this -means implementation to the GPU -means of BIDMach , which was shown to be more efficient than several distributed -means implementations that require dozens of machines333BIDMach numbers from https://github.com/BIDData/BIDMach/wiki/Benchmarks#KMeans. Both algorithms were run for 20 iterations. Table 1 shows that our implementation is more than 2 faster, although both are built upon cuBLAS. Our implementation receives some benefit from the -selection fusion into L2 distance computation. For multi-GPU execution via replicas, the speedup is close to linear for large enough problems (3.16 for 4 GPUs with 4096 centroids). Note that this benchmark is somewhat unrealistic, as one would typically sub-sample the dataset randomly when so few centroids are requested.
|BIDMach ||1||320 s||735 s|
|Ours||1||140 s||316 s|
|Ours||4||84 s||100 s|
We can also compare to , an approximate CPU method that clusters 128-d vectors to 85k centroids. Their clustering method runs in 46 minutes, but requires 56 minutes (at least) of pre-processing to encode the vectors. Our method performs exactk-means on 4 GPUs in 52 minutes without any pre-processing.
We consider a classical dataset used to evaluate nearest neighbor search: Sift1M . Its characteristic sizes are , , . Computing the partial distance matrix costs Tflop, which runs in less than one second on current GPUs. Figure 4 shows the cost of the distance computations against the cost of our tiling of the GEMM for the term of Equation 2 and the peak possible -selection performance on the distance matrix of size , which additionally accounts for reading the tiled result matrix at peak memory bandwidth.
for -selection, the naive algorithm that sorts the full result array for each query using
thrust::sort_by_key is more than slower than the comparison methods;
L2 distance and -selection cost is dominant for all but our method, which has 85 % of the peak possible performance, assuming GEMM usage and our tiling of the partial distance matrix on top of GEMM is close to optimal. The cuBLAS GEMM itself has low efficiency for small reduction sizes ();
Our fused L2/-selection kernel is important. Our same exact algorithm without fusion (requiring an additional pass through ) is at least 25% slower.
Efficient -selection is even more important in situations where approximate methods are used to compute distances, because the relative cost of -selection with respect to distance computation increases.
There are few studies on GPU-based approximate nearest-neighbor search on large datasets (). We report a few comparison points here on index search, using standard datasets and evaluation protocol in this field.
For the sake of completeness, we first compare our GPU search speed on Sift1M with the implementation of Wieschollek et al. . They obtain a nearest neighbor recall at 1 (fraction of queries where the true nearest neighbor is in the top 1 result) of R@1 = 0.51, and R@100 = 0.86 in 0.02 ms per query on a Titan X. For the same time budget, our implementation obtains R@1 = 0.80 and R@100 = 0.95.
We compare again with Wieschollek et al., on the Sift1B dataset  of 1 billion SIFT image features at . We compare the search performance in terms of same memory usage for similar accuracy (more accurate methods may involve greater search time or memory usage). On a single GPU, with bytes per vector, R@10 = 0.376 in 17.7 s per query vector, versus their reported R@10 = 0.35 in 150 s per query vector. Thus, our implementation is more accurate at a speed 8.5 faster.
We also experimented on the Deep1B dataset  of =1 billion CNN representations for images at . The paper that introduces the dataset reports CPU results (1 thread): R@1 = 0.45 in 20 ms search time per vector. We use a PQ encoding of , with via OPQ , and , which uses a comparable dataset storage as the original paper (20 GB). This requires multiple GPUs as it is too large for a single GPU’s global memory, so we consider 4 GPUs with , . We obtain a R@1 = 0.4517 in 0.0133 ms per vector. While the hardware platforms are different, it shows that making searches on GPUs is a game-changer in terms of speed achievable on a single machine.
An example usage of our similarity search method is to construct a -nearest neighbor graph of a dataset via brute force (all vectors queried against the entire index).
We evaluate the trade-off between speed, precision and memory on two datasets: 95 million images from the Yfcc100M dataset  and Deep1B. For Yfcc100M, we compute CNN descriptors as the one-before-last layer of a ResNet , reduced to = 128 with PCA.
The evaluation measures the trade-off between:
Speed: How much time it takes to build the IVFADC index from scratch and construct the whole -NN graph () by searching nearest neighbors for all vectors in the dataset. Thus, this is an end-to-end test that includes indexing as well as search time;
Quality: We sample 10,000 images for which we compute the exact nearest neighbors. Our accuracy measure is the fraction of 10 found nearest neighbors that are within the ground-truth 10 nearest neighbors.
For Yfcc100M, we use a coarse quantizer ( centroids), and consider 16, 32 and 64 byte PQ encodings for each vector. For Deep1B, we pre-process the vectors to via OPQ, use and consider 20, 40. For a given encoding, we vary from 1 to 256, to obtain trade-offs between efficiency and quality, as seen in Figure 5.
For Yfcc100M we used , . An accuracy of more than 0.8 is obtained in 35 minutes. For Deep1B, a lower-quality graph can be built in 6 hours, with higher quality in about half a day. We also experimented with more GPUs by doubling the replica set, using 8 Maxwell M40s (the M40 is roughly equivalent in performance to the Titan X). Performance is improved sub-linearly ( for , for ).
For comparison, the largest -NN graph construction we are aware of used a dataset comprising 36.5 million 384-d vectors, which took a cluster of 128 CPU servers 108.7 hours of compute , using NN-Descent . Note that NN-Descent could also build or refine the -NN graph for the datasets we consider, but it has a large memory overhead over the graph storage, which is already 80 GB for Deep1B. Moreover it requires random access across all vectors (384 GB for Deep1B).
The largest GPU -NN graph construction we found is a brute-force construction using exact search with GEMM, of a dataset of 20 million 15,000-d vectors, which took a cluster of 32 Tesla C2050 GPUs 10 days . Assuming computation scales with GEMM cost for the distance matrix, this approach for Deep1B would take an impractical 200 days of computation time on their cluster.
When a -NN graph has been constructed for an image dataset, we can find paths in the graph between any two images, provided there is a single connected component (this is the case). For example, we can search the shortest path between two images of flowers, by propagating neighbors from a starting image to a destination image. Denoting by and the source and destination images, and the distance between nodes, we search the path with and such that
i.e., we want to favor smooth transitions. An example result is shown in Figure 6 from Yfcc100M444The mapping from vectors to images is not available for Deep1B. It was obtained after 20 seconds of propagation in a -NN graph with neighbors. Since there are many flower images in the dataset, the transitions are smooth.
The arithmetic throughput and memory bandwidth of GPUs are well into the teraflops and hundreds of gigabytes per second. However, implementing algorithms that approach these performance levels is complex and counter-intuitive. In this paper, we presented the algorithmic structure of similarity search methods that achieves near-optimal performance on GPUs.
This work enables applications that needed complex approximate algorithms before. For example, the approaches presented here make it possible to do exact -means clustering or to compute the -NN graph with simple brute-force approaches in less time than a CPU (or a cluster of them) would take to do this approximately.
GPU hardware is now very common on scientific workstations, due to their popularity for machine learning algorithms. We believe that our work further demonstrates their interest for database applications. Along with this work, we are publishing a carefully engineered implementation of this paper’s algorithms, so that these GPUs can now also be used for efficient similarity search.
Proc. International Conference on Computer Vision, pages 1502–1510, 2015.
Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 3069–3076, June 2012.
Proc. Empirical Methods on Natural Language Processing, pages 1898–1907. ACL, 2013.
Efficient computation of k-nearest neighbor graphs for large high-dimensional data sets on gpu clusters.Master’s thesis, University of Wisconsin Milwaukee, August 2013.
Deep image retrieval: Learning global representations for image search.In Proc. European Conference on Computer Vision, pages 241–257, 2016.
We derive the average number of times updates are triggered in WarpSelect, for use in Section 4.3.
Let the input to -selection be a sequence (1-based indexing), a randomly chosen permutation of a set of distinct elements. Elements are read sequentially in groups of size (the warp; in our case, ); assume is a multiple of , so . Recall that is the thread queue length. We call elements prior to or at position in the min- seen so far the successive min- (at ). The likelihood that is in the successive min- at is:
as each , has a chance as all permutations are equally likely, and all elements in the first qualify.
In a given lane, an insertion sort is triggered if the incoming value is in the successive min- values, but the lane has “seen” only values, where
is the previous won warp ballot. The probability of this happening is:
The approximation considers that the thread queue has seen all the values, not just those assigned to its lane. The probability of any lane triggering an insertion sort is then:
Here the approximation is a first-order Taylor expansion. Summing up the probabilities over gives an expected number of insertions of .
We seek , the expected number of full sorts required for WarpSelect.
For now, we assume , so . Let be the probability that in an sequence , exactly of the elements as encountered by a sequential scanner () are in the successive min-. Given , there are places where these successive min- elements can occur. It is given by a recurrence relation:
The last case is the probability of: there is a sequence with successive min- elements preceding us, and the current element is in the successive min-, or the current element is not in the successive min-, ones are before us. We can then develop a recurrence relationship for . Note that
for where is the fraction of all sequences of length that will force sorts of data by winning the thread queue ballot, as there have to be to elements in the successive min- for these sorts to happen (as the min- elements will overflow the thread queues). There are at most won ballots that can occur, as it takes separate sequential current min- seen elements to win the ballot. is thus the expectation of this over all possible :
This can be computed by dynamic programming.
Analytically, note that for , , is the harmonic number , which converges to (the Euler-Mascheroni constant ) as .
For , or , as the first elements are in the successive min-, and the expectation for the rest is .
For , note that there are some number , of successive min- determinations made for each possible . The number of won ballots for each case is by definition , as the thread queue must fill up times. Thus, .
The case is complicated by the fact that there are joint probabilities to consider (if more than one of the workers triggers a sort for a given group, only one sort takes place). However, the likelihood can be bounded. Let be the expected won ballots assuming no mutual interference between the workers for winning ballots (i.e., we win ballots if there are workers that independently win a ballot at a single step), but with the shared min- set after each sort from the joint sequence. Assume that . Then:
where the likelihood of the workers seeing a successive min- element has an upper bound of that of the first worker at each step. As before, the number of won ballots is scaled by , so . Mutual interference can only reduce the number of ballots, so we obtain the same upper bound for .