Log In Sign Up

Fast top-K Cosine Similarity Search through XOR-Friendly Binary Quantization on GPUs

by   Xiaozheng Jian, et al.

We explore the use of GPU for accelerating large scale nearest neighbor search and we propose a fast vector-quantization-based exhaustive nearest neighbor search algorithm that can achieve high accuracy without any indexing construction specifically designed for cosine similarity. This algorithm uses a novel XOR-friendly binary quantization method to encode floating-point numbers such that high-complexity multiplications can be optimized as low-complexity bitwise operations. Experiments show that, our quantization method takes short preprocessing time, and helps make the search speed of our exhaustive search method much more faster than that of popular approximate nearest neighbor algorithms when high accuracy is needed.


page 1

page 2

page 3

page 4


Exact and/or Fast Nearest Neighbors

Prior methods for retrieval of nearest neighbors in high dimensions are ...

Transformed Residual Quantization for Approximate Nearest Neighbor Search

The success of product quantization (PQ) for fast nearest neighbor searc...

Billion-scale similarity search with GPUs

Similarity search finds application in specialized database systems hand...

Low-Precision Quantization for Efficient Nearest Neighbor Search

Fast k-Nearest Neighbor search over real-valued vector spaces (KNN) is a...

Bolt: Accelerated Data Mining with Fast Vector Compression

Vectors of data are at the heart of machine learning and data mining. Re...

Comparing apples to apples in the evaluation of binary coding methods

We discuss methodological issues related to the evaluation of unsupervis...

Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing

We introduce a novel dictionary optimization method for high-dimensional...

1. Introduction

It is hard to find specific content in massive resource library. Generally, these contents can be transformed into vectors of different lengths using proper embedding algorithms. The state-of-the-art examples include, for text data, word2vec (Mikolov et al., 2013)

, and for image data, convolutional neural network

(Sharif Razavian et al., 2014; Gong et al., 2014). Using these embedding vectors of dozens to thousands of dimensions, the distance from queries to every entry in the database can be calculated, and the nearest ones can be found. The true problem is how to find the most similar contents from an arbitrary query in a large database when users request it with low delay. This is also the most computationally expensive part of many algorithms diversified in biology (gene classification (Pan et al., 2004)

), computer vision (local image feature matching

(Lowe, 2004)), speech recognition (content-based music search (Li and Ogihara, 2004)) and many other fields.

We hope that the both speed and accuracy of the search algorithms can be high even employed on large datasets. When working with high-dimensional features, which is often the case in computer vision applications, there is no known exact nearest-neighbor search algorithm that has acceptable performance. To obtain a speed improvement, researchers developed approximate search algorithms (Aumüller et al., 2017). Generally these algorithms can provide 80 percent or more of the correct neighbors, and be much faster than exact search. When a even higher proportion of correctness is required, for example 90 percent or 98 percent, the speed of most approximate search algorithms drops quickly. Furthermore, Some of these algorithms need to train a codebook for indexing before searching, which is also a time-consuming part.

Since the data size is becoming extremely large and the embedding vectors can be long to keep more information, the distance calculation now requires a great number of basic arithmetic calculations and comparisons, but the calculation for pairs of embedding vectors does not affect each other. In this decade, great development on GPU computation has been achieved. GPUs are much skilled at parallel computing for simple calculation than CPUs, which meets the case of nearest-neighbor search problem. Exact search algorithm (Garcia et al., 2010) and some approximate search algorithms (Pan and Manocha, 2011; Johnson et al., 2019) have already been implemented on GPU and obtained great improvements on performance.

In this paper, we propose a fast and accurate algorithm implemented on GPU for approximate nearest-neighbor search problem. A binary quantization method is proposed to compress floating-point numbers into 3- or 4-bit binary codes without training. Cosine similarity calculations of vectors are simplified to exclusive OR (XOR) operations of binary codes and is further optimized based on the parallel characteristics of GPU. Based on the quantization method and optimized calculations, cosine similarities of normalized data points can then be fast calculated on GPU when both data size and vector length are large. As training is not needed for quantization, this method will be useful for the situations where dataset distribution changes rapidly or only few queries are asked in large datasets.

This paper makes the following contributions:

  • We propose a new quantization method to encode number within into arbitrary bits (Section 3).

  • We provide a novel view that transforming multiplication into bitwise XOR operation, and use this transformation to accelerate multiplications in limited scope (Section 3).

  • We propose a train-free algorithm to implement GPU exhaustive kNN-Selection on large datasets, which based on cosine similarity and has a series of parameters controlling the accuracy and speed (Section 3 & 4).

  • We conduct real-data experiments that show that the proposed algorithm has a state-of-the-art searching efficiency and high accuracy on large-scale nearest-neighbor search tasks. The algorithm is also extensible on multi-GPU configurations (Section 5).

2. Related Work

Generally, K-nearest neighbor search is to find the top most similar vectors in vectors for each vector query given the distance metric. Here each vector has components, and in this paper we specify the metric with descending cosine similarity, defined by the inner product of two normalized vectors. This section presents a review of previous work in this area.

The brute force way to solve the problem is to calculate pairwise distance between the query vector and each alternative vector and use a minimum heap to store the top nearest vector. This way costs great computing resources (with time complexity ). Garcia et al. (Garcia et al., 2010)

implemented parallel brute force algorithm on NVIDIA GPU using CUDA and CUBLAS, showing that the speed can be 25x and 62x faster than highly optimized ANN C++ library implemented on CPU. As this migration cannot really reduce the resource consumption on computation, researchers have been providing solutions to calculate the approximate nearest neighbors with high precision but much lower time complexity. Most of these techniques are target on reduce the search space. We revise the most widely used K-NN search techniques, classified in three categories: hashing based techniques, partitioning trees and nearest neighbor graph techniques.

The best-known hashing-based techniques might be local sensitive hashing (LSH) (Andoni and Indyk, 2006), in which many hash functions, with property that elements with similar hashes are more likely to be similar, are used. Variants of LSH such as multi-probe LSH (Lv et al., 2007) and LSH Forest (Bawa et al., 2005) help improve the performance of these techniques. As the performance of LSH is highly related to the quality of hashing functions, huge work focuses on improving hashing methods (Shakhnarovich et al., 2003; Wang et al., 2010; Xu et al., 2011). Pan implemented LSH based nearest-neighbor search on GPU (Pan and Manocha, 2011), making searching much faster. As LSH is highly sensitive to the hashing function we choose, we are not going to compare it with our method in the experiment section. However, combining LSH with our method to further accelerate the search speed is possible, as we will mention later.

Partitioning trees is also a popular technique for approximate nearest-neighbor search. KD tree (Silpa-Anan and Hartley, 2008) is one of the best known nearest-neighbor algorithms. It is effective on datasets with low dimensionality but gets poor performance on datasets with high dimensionality. Gong encoded image matrix into binary values to find similar results and achieved good search recalls for image datasets (Gong et al., 2012, 2013). This method does not provide good results for non-classification problems. Other methods like annoy (Bernhardsson and others, 2018), ball tree (Liu et al., 2006; Omohundro, 1989)

use decision tree to make searching an

level job.

Herve Jegou proposed product quantization (PQ) to provide a short code representation of vectors (Jegou et al., 2010) and improved the searching efficiency by IVFADC algorithm (Jégou et al., 2011). In product quantization, space is decomposed into a Cartesian product of low dimensional subspaces, and data points are represented by compact codes computed as quantization indices in these subspaces. A codebook needs to be trained by a training dataset with distribution similar to the population before indexing. The training phase can take a long time when the training set is large. The compact codes are then efficiently compared to the query points using an asymmetric approximate distance in the search phase. Using an inverted file system, PQ can help efficiently search nearest neighbors on high-dimensional datasets. Inverted multi-index (IMI) proposed by Babenko and Lempitsky (Babenko and Lempitsky, 2014), which replaces the standard quantization in an inverted index with product quantization, obtains a denser subdivision of the search space. These methods are efficient at searching on large datasets with high dimensionally and are now used and accelerated by GPU in Facebook’s Faiss library (Johnson et al., 2019). As the new approach proposed in this paper is also a vector-quantization based technique, we will compare our results with the IVFADC version of PQ based nearest-neighbor searching.

Nearest neighbor graph methods is based on the thought that, when there comes a query, we start to calculate the distance from a random point to the query, and try to search along the ”steepest descent direction” of distance between points on the direction and the query. In practice, a graph structure in which points are vertices and edges connect each point to its friend points is built. For each point in the graph, the friend points are likely to be close to it. The query points are used to explore this graph using various strategies in order to get closer to their nearest neighbors. As an optimized graph must be built, these graph methods also have train phases to build graph and take long time on training when the dataset is huge. Malkov raised an efficient and robust searching algorithm using Hierarchical Navigable Small World (HNSW) graphs (Malkov and Yashunin, 2018). HNSW is one of the best practices of nearest neighbor graph techniques so far. However, it is not a good idea to implement HNSW on GPU, which takes advantage on parallel computing, since we need to access the points based on a strict hierarchical order. We will compare the performance of our approach and HNSW with comparable computing resources.

Faiss (Johnson et al., 2019) is a good solution that works on GPU verified by ANN-benchmark (Aumüller et al., 2017). The performance of PQ methods on GPU has been optimized by Faiss. The library also integrates a CPU version of HNSW search. For both algorithms here, the training and indexing step for large dataset takes a long time, which is a crucial drawback for some real-time online data. They suffer from a painful trade-off among training time, recall/precision rate and search speed. In the following section we will provide a novel approach with short preprocessing time, high recall/precision and fast search speed.

3. Compress Vectors with XOR-friendly Binary Quantization

Given two floating-point vectors and with , the cosine similarity of them is

It requires N multiplications and (N-1) additions, resulting in intensive computational complexity though these operations can be parallelized by SIMD instructions. Besides, for floating-point vectors, the memory bandwidth can also limit the throughput of computations when processing large scale similarity computation. For example, the memory bandwidth of DDR4 2666 is 21.3 GB/s. To solve these two problems, we propose a fast similarity search mechanism that quantizes 32-bit floating-point numbers to low-bit binary numbers and replaces high-complexity multiplications by XOR computations, which has low complexity on computation. We first introduce the relationship between multiplication and XOR operation.

3.1. Multiplication and XOR on simple sets

Consider two sets and . Define multiplication on is simple multiplication (), and multiplication on is XOR operation (). The following proposition reveals the relationship between these two different operations.

Proposition 3.1 ().

and are two isomorphism groups under mapping , where


The proposition can be directly verified by checking enumerate all possible operations. In these two groups, (+1) and 0 are identity elements while (-1) and 1 are inverse elements respectively.

Using 3.1, the multiplication on can be replaced by the XOR computation on with the following formula.


Therefore, if all components in vectors can be represented as a combination of , then the high-complexity multiplication can be replaced by the low-complexity XOR computation.

3.2. XOR-Friendly Binary Quantization

To take advantage of the XOR computation, a special strategy is used to quantize floating-point numbers to binary numbers.

For any real number , we use a mapping to approximate into a subspace of . For simplification we write . has the following representation:

where . Specially we define . Then the value of is decided by the following formula:

Similar with binary representation, this approximation has some good properties. Define as the set of all positive integer, then we have

Proposition 3.2 ().

. And uniformly convergent to :

The proof of this proposition is in Appendix A.

When , can be exactly represented. Instead, if is fixed to a finite number, will be approximately represented. In this case, if we take a fixed (indicating the number of encoding bits), with the vector as a fixed codebook, denote as , then all can then be encoded as


where .

Next we will show that, based on this quantization scheme, the multiplication between floating-point numbers can then be replaced by XOR. Given the encoding bit , the product of can be calculated as


where . By Eq. 2, Eq. 4 can be transformed as

where are the corresponding element in described by proposition 3.1, and is shift logical left instruction (). Then, the multiplications are replaced by fast bitwise operations - XOR and bit shifts. For convenience of description, a new operator is introduced. We define and . Here and are integers represented by binary code. We name this kind of quantization as XOR-Friendly Binary Quantization (XFBQ). We introduce a new operation to denote

Then the result of can be equivalently transformed to the product of


Therefore, the multiplication of floating-point numbers can be optimized by following steps:

  • Given the encoding bit , based on Equation 3 and 1, quantize two operands as and .

  • Calculate the result of using XOR operations.

  • Use Equation 5 to calculate the product of .

3.3. Inner Product of Vectors with XFBQ

Based the above quantization scheme of floating-point numbers, floating-point vectors can be quantized in a XOR-Friendly way using the following scheme. Given two N-dimension vectors and , , with the encoding bit , we have


where and .

We denote with . Using Eq. 5, the result of can be transformed to the inner product by


Furthermore, population count (POPCNT) operations are introduced to improve the performance of the inner product. Since


notice that

In order to utilize POPCNT operations and deduce the total number of instructions for a faster calculation performance, the N-dimension quantized vectors are considered to be reconstructed as -dimension vectors,

Figure 1. Preprocessing of dataset in our method. The work will be done in time of O(BN). Encoding, transform and reconstruct can all be well done parallelly.

The process of the reconstruction is shown in Figure 1. Then Eq. 8 can be represented as


The benefit from this storage scheme can be shown by a example. Suppose and are both 32-d vectors, and we set . Without encoding 32 multiplications are needed. IN XFBQ, both vectors are encoded into 96 bits. Without this scheme we need to take XOR operations separately, which takes long time on sending instructions. If we store the data as and , then all and can be saved in 4-byte integers respectively as all bits in the same integer share the same shifting bits in calculation. Now both and are encoded into 3 integers, and only XOR and POPCNT operations with corresponding shifts are needed. As POPCNT is faster than multiplication, and we decrease the number of instructions, we can see great speedup. Modern processing units also support POPCNT64, which takes the same operation on two 8-byte integers. This instruction will further improve the performance. Based on this storage scheme, XFBQ works better on higher-dimensional vectors.

To sum all above up, calculating the inner product of two floating-point vectors can be optimized by following steps:

  • Given the encoding bit , based on Equation 6 and 9, quantize and reconstruct the vectors as and .

  • Based on Equation 10, calculate the result of .

  • Based on Equation 7 and 10, obtain the inner product of .

3.4. Complexity Analysis

Given two N-dimension vectors and , here N is multiple of 64 for convenience of description, based on the above quantization scheme, these vectors can be quantized as and and then perform bitwise operations instead of multiplications for the inner product. As shown in Table 1, the calculation of the inner product can be significantly optimized by the proposed scheme.

Calculation Manipulations (times) Memory (bytes)

N multiplications and (N-1) additions

2N * size of(float)

XOR and POPCNT, bitwise shifts and additions

* size of(uint64)

Table 1. Analysis of computational complexity

3.5. Error Analysis of Inner Product Calculation

Using the quantization method we provided above, a 4-byte floating-point number can be easily transformed into and stored in 3 or 4 bits, saying that we can store 8-11 times data as storing the original data. We admit there is a huge loss on precision, but using the following tricks, the loss on the final precision of similarities will be acceptable.

Suppose are two vectors with same dimension. and are approximations of them. Then Similarities is calculated by


XFBQ can be seen as an approximate method. Using this formula, the similarity error introduced by quantization can be split into two parts: the length error and the angle error.

3.5.1. Length Error

For each component of a vector, it has been approximated to the nearest point that can be represented in a few bits, which introduce a length error in each dimension. As the error accumulates, the length of the quantized vector we stored should be different from the real value. That is the length error. Fixing the number of bits we use for one component, this error can be large when most of the components are small, which is just the case when vector has a lot of components. Good news is, in many fields like speech recognition and NLP, the data vectors are in high dimension, and we can expect that the vectors are usually small in every dimension with high probability. On the popular dataset of embedding vectors like GloVe

(Pennington et al., 2014)

, all tokens are no larger than 0.5 after vector normalization. Tokens are even smaller for many fields needing nearest neighbor search such as fingerprints recognition and facial recognition. These findings make the following assumption reasonable:

Assumption 3.1 ().

, ,

Even for the cases that some vectors have large value in specific dimensions, we can take a linear transformation using an orthogonal matrix

(Gong et al., 2012) to make the assumption effective. Generally, will shrink when goes high. We can expect that (But it should be specially calculated for every dataset). Under this assumption, the vectors can be scaled up by before quantizing them. When there is no loss of precision, the inner product will be scaled up by and the similarity results keep their order. Scaling up makes the data more scattered in and can deduce the length error introduced by the quantization. We test our quantization approach on 128-d and 1000-d random data and SIFT1M dataset. The results show that after the quantization, the new vector length is about 5-10% larger than the scaled vector, and the expansion of length on different quantized vectors is about 5% of their average length, making the error tolerable.

3.5.2. Angle Error

As the quantized vectors are no longer in the direction of the original ones, when we use the new vectors to calculate the cosine value in 11, the value is, actually no way the same with the real value. This is the angle error. We may accept it when the change of angle is small as searching for the top nearest neighbors: cosine function has small derivative value when function value is around 1, so the cosine value will not be influenced much when the two vectors are intrinsically point to similar directions. For example, we find that when , the angle between primitive and quantized vectors can be as much as 30 degrees, seems to be unacceptable. Using the scaling up trick proposed for length error, we can decrease the angle error from the quantization as well. The common case for the difference between the angle of and the angle of after scaling up is just 5-10 degrees and less. Experiments on a higher dimension enhance this find. This angle error will finally add about 5% relative error, on the top nearest vectors, when , according to our trials.

In the discussion above, we used a conservative parameter that is no larger than any components of all vectors. In most real-world problem, we can relax this restriction, and make it no larger than 98% of all components, or even less. Experimental results show that a properly expanded can accelerate calculation, holding the error acceptable.

3.6. Control Selection Error by Extra Distance

Methods proposed in Section 3.5 can already help find similar vectors. Next we hope to further reduce the impact of these errors and make the result precise.

First, we show how errors influences the distance we get. 7 tells the relationship between the similarity () and the distance () we calculate. Especially when we quantize query with 4 bits and quantize document with 3 bits, the relationship is

There is a linear relationship between the distance and the similarity. If the error is considered, then the scaled similarity (SS) can be written as

When we fix the query and find the top similar documents, the error on query length is fixed, and only the length error on documents and the angle error is floating. The length error round within 10% of the scale, and the angle error can be assumed to be no larger than for top documents. In total, the fluctuation range of the distance is no larger than 15% of the whole range of the distance.

3.6.1. Extra distance method

Our strategy of improving search accuracy is as following: Given a query, distances from the query to candidate documents will be calculated by the quantization algorithm and then the minimum distance can be found to retrieve top similar documents (Gray part in Figure 2). Besides, the minimum distance can optionally be added with an ”extra distance” to prevent missing good results caused by the error. The documents whose distances are not greater than the minimum distance are recalled (Gray and green part in Figure 2), and then the floating-point similarities between these documents and the query are calculated. Based on these similarities, the top results are finally returned.

Figure 2. Sketch plot about the usage of extra distance. The gray area includes top similar vectors to the query, and the green area includes the following vectors with distances no larger than top distance plus extra distance. Vectors in both areas are sent to refine selection.

3.6.2. Choose the extra distance

In numerical experiments, we fixed one vector (as the query) and randomly generate massive vectors to calculate the pair-wise similarity by both real value and our quantization method. We normalize both similarities respectively so that results for both methods have 0 mean and 1 variance. A sample of the distributions after normalization is shown in Figure

3. Notice that they are close to each other in distribution. This suggests that the boundaries of true top and the calculated top are close after normalization.

Figure 3. A sample of the distributions of exact similarity and quantized similarity after normalization. Practically they have the same distribution, but the components are arranged in totally different ways.

Assume that the entire similarity space is divided into several buckets, then a vector with high similarity may drop down to no larger than a few buckets at a high probability. If we find the calculated top bounds and then consider the vectors in those lower buckets, and sort the original similarities of these vectors, the resulting top will have a high accuracy/recall rate.

As analyzed above, the fluctuation range of the distance is no larger than 15% of the whole range of the distance. This theoretical upper bound can help to determine the upper bound for extra distance. In application with 4 bits * 3 bits quantization, we believe considering 5% to 10% of SS range is enough for high recall rates ().

4. XFBQ based K-NN Search Algorithm

1:function K-Select()
2:      Quantize all document vectors and the query vector (with a scale factor)
5:      Calculate distances between documents and the query
6:      initialize distances
7:     parallel in CUDA for do
9:     end for
11:      Histogram distances and select candidates by the th minimum distance
12:      initialize bins of the histogram
13:     parallel in CUDA for do
14:         Add(, 1)
15:     end for
16:      fix the boundary of top candidates
17:      initialize index array of candidates
18:     parallel in CUDA for do
19:         if  then
20:              Append to
21:         end if
22:     end for
24:      Refine candidates
26:     parallel in CUDA for do
27:          CalcSimilarity()
28:         Append to
29:     end for
30:      SortBySimilarity() sort candidates by similarities
31:     return
32:end function
Algorithm 1 Fast K-NN Search Algorithm in CUDA

As shown in Algorithm 1 , we perform a -NN Search approach as following steps:

  • Quantization. Quantize all document vectors and the query vector.

  • Distance Calculation. Calculate distances between the quantized query vector and all quantized document vectors.

  • Histogram and select. Histogram the distances and find the th minimum distance as mentioned in Section 3.6. Based on this distance, select the candidate documents.

  • Refine. Sort candidate documents based on exact similarities calculated by original floating-point vectors. This step can also be done on CPU with little influence on overall performance , if GPU memory is limited.

4.1. Details of the Selection Algorithm

The details of these steps are described as follows.

Quantization. The quantization of an N-dimension floating-point vector is optimized by the Advanced Vector Extensions (AVX) intrinsic and some ingenious bit operations in CPU. The entire quantization process has two following steps. The first step is to quantize the floating-point vectors by Single Instruction Multiple Data (SIMD) fashion since the quantization process for each dimension is independent. In the second step, the required bits are extracted from the quantized vectors and stored into uint64_t array for efficient access. By bitwise AND with elaborate bit masks and some other bitwise operations, 8-bit quantization can simultaneously be performed.

Distance Calculation. In this step distances between the quantized query vector and all quantized document vectors are calculated on GPU. Before this, all quantized document vectors are reorganized as bundles of size 32 for warps - units of execution in GPU, and rearranged to column-based access pattern. With this structure of quantized document vectors, each warp can access the global memory in a high-efficient way. The calculation of the distance is implemented by the bitwise XOR, CUDA integer intrinsic __popcll and bitwise shift operations as shown in Equation 10. Distances are stored as integers of , as they are equivalent to real inner products under a linear transformation.

Histogram and Select. The top documents that are most similar to the query need to be selected with the information of obtained unordered distances. Traditional sequential method is to maintain a minimum heap and get top K results, but it is not totally friendly for parallel computing. Instead, a two-step parallel approach is used here. First, a histogram of the distances is parallelly built to find the th minimum distance of all vectors. Since the integers stored by previous step take limited unique values, each unique value can take a bin in the histogram, and the process can be efficiently done. In the second step, with the th minimum distance (often we add an extra distance as mentioned in Section 3.6), the candidate top documents are selected from all documents.

Refine. To improve the recall rate, the selected candidate results will be refined by exact similarities calculated from floating-point vectors. This process can be done on either CPU or GPU. For the GPU implementation, considering the feature of warp execution and 128-byte alignment of the L1 cache, vectors are grouped for calculation in a manner similar to loop unrolling. After calculation, candidate documents are sorted by exact similarities to get the top nearest neighbors.

4.2. Complexity Analysis

4.2.1. Time Complexity

In preprocessing step, we quantize all the document vectors from floating numbers into a space of bits on CPU and copy the quantized vectors to GPU. In practice we choose as 3. Time complexity here is .

In searching step, we first calculate the distance between quantized document vectors and the query vector. Benefit from the new representation, we can take the special XOR operation between 64 pairs of tokens and add them up at the same time with bit operations and additions. Here and represent how many bits the document/query vectors are quantized into. Therefore, the time complexity here is still . As and are fixed after preprocessing, this step can be seen as with a much smaller constant compared to brute force approach.

Next in histogram and select step, every distance is contributing to the corresponding bin in histogram on GPU parallelly, and the time cost does not exceed the size of the bin with the most distances – In the worst case , but in average . Here Distance Scale is a number related to scale introduced in Section 3.5. Generally speaking, time here can be ignored compared to distance calculation. Finding the boundary of top candidates takes time and finding top candidate vectors on GPU takes another time. Finally in refine step as candidates are taken, time are needed for calculating the exact similarity and sort them.

In general, searching takes a time complexity of . It can be at least 5 - 6 times faster than brute force way using the same computing resource.

4.2.2. Space Complexity

Beside storing the vector information with space on CPU, we need an extra space to store the quantized vectors on GPU. Distance result can be ignored compared to the vector storage. Detail of the storage has been shown in 3.4. Therefore, with the same space resources on GPU, our approach can deal with 10x data points compared to brute force way when data are all stored on GPU from the very beginning.

5. Experiments Results

In this section, we compare our approach with other popular methods on different public datasets. The baseline approach is a brute-force way of the similarity calculation implemented by ourselves that only taking advantage of the parallelization of GPUs. We also compared with the state-of-the-art method implemented by Faiss, including HNSW, IVF-HNSW on CPU, and IVF Flat, IVF Product Quantization approach on GPU. All programs are run under Ubuntu 16.04 LTS with 20 Intel Core i7-6950X CPU @ 3.00GHz and 1 - 4 Nvidia Titan V GPU. As we did not optimize our program for multi-query requests, we represent results of single-query requests here. The metric we use for comparison is query per second (QPS) and precision at top 1/10/100/1,000. Here we define

We first take experiments on synthetic datasets to choose parameters and that can make the search fast and accurate. When and are no larger than 2, the precision of the result will be poor. takes a balance between speed and accuracy. When , assume sizeof(uint64) = 8 bytes and sizeof(float) = 4 bytes, by Table 1 we know XFBQ based inner product calculation only costs 28% instructions and 9% space compared to the brute force algorithm. When and , 37% instructions and 11% space are cost compared to the brute force algorithm.

We choose two different open datasets for testing. Test purposes are different on these datasets.

Tencent AILab word embedding dataset (Song et al., 2018) has 8,824,330 records for Chinese and English words. Each record is a 200-d vector. On this dataset we aim to test the speed of XFBQ based inner product calculation and the performance of our approach on single GPU. We randomly take 24,330 records in the dataset as queries and keep the others as dictionary. In our approach we choose as a common setting, and based on the range of records. Precision are controlled by setting different extra distance (from 0 to 20). Figure 4 show that our approach are on average 10 times faster than common parallel computing on inner product calculation. Results of the precision of k-NN search are shown in Figure 5. IVF-PQ (Jegou et al., 2010) provides a fast but low-precision search for nearest neighbors. It also costs great time on training. HNSW(Malkov and Yashunin, 2018) also takes thousands of CPU seconds on training, and its efficiency drops down quickly when required increases.

Figure 4. Time cost on calculating inner product with and without XFBQ. XFBQ can save 90% of the time cost.
(a) Results on high precision part
(b) Results of IVF-PQ approach. Different trends for same label shows results of different PQ code length, 8/20/40 bytes from left to right.
Figure 5. Queries per Second as functions of Precision at different top levels on Tencent AI Lab dataset. The IVF-PQ approach run by Faiss is on a separate figure due to low precision it reaches. Our approach based on binary quantization outperform all other state-of-the-art systems when a high precision is required. The efficiency is not influenced heavily when changes.
Method Resource Usage

Preprocessing Time(s)

Our approach 1 CPU + 1 GPU 17
IVF3072FLAT 1 CPU + 1 GPU 18
IVF65536PQ8/20/40 1 CPU + 1 GPU 440 - 800
HNSW20/40/50 20 CPU 480 - 600
Table 2. Preprocessing Time on Tencent AI Lab Dataset

When high accuracy becomes a must, our approach has the highest QPS when using only one GPU. On the whole search process, we have 6 times faster than brute force way with optimized parallel implementation on GPU, and a 1.5x - 2x speed (measuring on same precision) compared to IVF3072 Flat search implemented by Faiss on GPU. Our approach also takes the least preprocessing time, which is similar to IVF Flat approach, as shown in Table 2. The space used by our approach is similar to HNSW. The result above shows that our approach can replace other state-of-the-art methods for k nearest neighbor search asking for high precision and a quick start.

Deep100M is the first 100 million vectors of dataset Deep1B (Babenko and Lempitsky, 2016), which has 1 billion CNN representations for images with dimension 96. We design a test on only 100M records to avoid extreme long training time on those approaches for comparison. The query size is 10,000. We tested HNSW approach with 20 CPU cores. Other approaches are tested with 1 CPU and 1 or 4 GPUs in order to show the performance of our algorithm on multiple devices. In our approach we also choose and , but set to fit the data.

Figure 6. Queries per Second as functions of Precision at different top levels on Deep100M dataset. Precision axis is rescaled with a lower limit of 0.9 to show the line for IVF-PQ method on some of the plot, as they cannot reach a high precision on all tasks. No green points are shown in the plot as that approach cannot reach such a high precision.

The result is shown in Figure 6. The scale of precision axis is different from that in Figure 5(a), as we try to show the performance of the best among all IVF-PQ methods. The improvement on search efficiency shows that our approach fits well for multiple GPUs. When more devices are provided, the QPS increases linearly on our approach. In other words, our approach can be deployed on a distributed system. Our approach can deal with 2.5 billion tokens per gigabyte video memory, so a Nvidia P40 with 24GB memory can deal with 60 billion tokens, which exceeds the size of many real time searching problems. Our approach keeps a stable performance on all requests and is the fastest when precision is over 99%, using 4 GPUs. Compared to HNSW, we also have much lower memory usage.

Method Resource Usage

Preprocessing Time(s)

Our approach 1 CPU + 1 GPU 57
Our approach 1 CPU + 4 GPU 20
FaissIVF65536PQ48 1 CPU + 1 GPU 1424
HNSW32 20 CPU 6025
Table 3. Preprocessing Time on Deep100M Dataset. For our approach, the time is used for encoding and copying to GPU. For others, here only record their time for training and adding index.

Results on 1 GPU in Figure 6 reflects that our approach can still be dragged down by eager needs on computing power. This is the main drawback of it. As our approach only focus on how to calculate the similarity, it can be further improved by combining with other mature nearest neighbor search techniques focusing on reducing the searching area. For example, Locality Sensitive Hashing can be used in preprocessing and the search will then become non-exhaustive, taking away most of the calculation. In this way the searching will be n times faster than current speed when only one over n of the samples are chosen to be considered. Also, our techniques can be deployed on an inverted file system. Compared to IVF-PQ structure, we can provide higher precision with lower preprocessing time. Therefore, our approach has great potential on accelerating search speed.

6. Conclusion

In this work, we have presented a new approach for performing efficient k nearest-neighbor search using cosine similarity metric on GPU. We propose a new binary quantization method (XFBQ) for compressing calculation. This technique is combined with a special -selection method using the calculated distance. Overall our techniques provide significant reductions on pre-searching time cost compared to other popular approximate nearest-neighbor search algorithms and keep a state-of-the-art searching efficiency, especially when high accuracy is needed. Since most of our work is on accelerating similarity calculation, our approach can be further combined with other popular techniques focusing on reducing search space, such as locality sensitive hashing, and inverted file system, to get a even faster speed. As a single high-performance GPU can handle the calculation work for hundreds of millions of vector productions, CPU resources can be liberated for works with higher complexity.


  • A. Andoni and P. Indyk (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pp. 459–468. Cited by: §2.
  • M. Aumüller, E. Bernhardsson, and A. Faithfull (2017) ANN-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. In International Conference on Similarity Search and Applications, pp. 34–49. Cited by: §1, §2.
  • A. Babenko and V. Lempitsky (2014) The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence 37 (6), pp. 1247–1260. Cited by: §2.
  • A. Babenko and V. Lempitsky (2016) Efficient indexing of billion-scale datasets of deep descriptors. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2055–2063. Cited by: §5.
  • M. Bawa, T. Condie, and P. Ganesan (2005) LSH forest: self-tuning indexes for similarity search. In Proceedings of the 14th international conference on World Wide Web, pp. 651–660. Cited by: §2.
  • E. Bernhardsson et al. (2018) Annoy (approximate nearest neighbors oh yeah). Cited by: §2.
  • V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud (2010) K-nearest neighbor search: fast gpu-based implementations and application to high-dimensional feature matching. In 2010 IEEE International Conference on Image Processing, pp. 3757–3760. Cited by: §1, §2.
  • Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik (2013) Learning binary codes for high-dimensional data using bilinear projections. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 484–491. Cited by: §2.
  • Y. Gong, S. Kumar, V. Verma, and S. Lazebnik (2012) Angular quantization-based binary codes for fast similarity search. In Advances in neural information processing systems, pp. 1196–1204. Cited by: §2, §3.5.1.
  • Y. Gong, L. Wang, R. Guo, and S. Lazebnik (2014) Multi-scale orderless pooling of deep convolutional activation features. In European conference on computer vision, pp. 392–407. Cited by: §1.
  • H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §2, §5.
  • H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg (2011) Searching in one billion vectors: re-rank with source coding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 861–864. Cited by: §2.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §1, §2, §2.
  • T. Li and M. Ogihara (2004) Content-based music similarity search and emotion detection. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, pp. V–705. Cited by: §1.
  • T. Liu, A. W. Moore, and A. Gray (2006) New algorithms for efficient high-dimensional nonparametric classification.

    Journal of Machine Learning Research

    7 (Jun), pp. 1135–1158.
    Cited by: §2.
  • D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §1.
  • Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li (2007) Multi-probe lsh: efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd international conference on Very large data bases, pp. 950–961. Cited by: §2.
  • Y. A. Malkov and D. A. Yashunin (2018) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §5.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
  • S. M. Omohundro (1989) Five balltree construction algorithms. International Computer Science Institute Berkeley. Cited by: §2.
  • F. Pan, B. Wang, X. Hu, and W. Perrizo (2004) Comprehensive vertical sample-based knn/lsvm classification for gene expression analysis. Journal of Biomedical Informatics 37 (4), pp. 240–248. Cited by: §1.
  • J. Pan and D. Manocha (2011) Fast gpu-based locality sensitive hashing for k-nearest neighbor computation. In Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems, pp. 211–220. Cited by: §1, §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1532–1543. External Links: Link Cited by: §3.5.1.
  • G. Shakhnarovich, P. Viola, and T. Darrell (2003)

    Fast pose estimation with parameter-sensitive hashing

    In null, pp. 750. Cited by: §2.
  • A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806–813. Cited by: §1.
  • C. Silpa-Anan and R. Hartley (2008) Optimised kd-trees for fast image descriptor matching. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.
  • Y. Song, S. Shi, J. Li, and H. Zhang (2018) Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 175–180. Cited by: §5.
  • J. Wang, S. Kumar, and S. Chang (2010)

    Semi-supervised hashing for scalable image retrieval

    In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3424–3431. Cited by: §2.
  • H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu (2011) Complementary hashing for approximate nearest neighbor search. In 2011 International Conference on Computer Vision, pp. 1631–1638. Cited by: §2.

Appendix A Proof of Proposition 3.2


First we prove . When , by definition if . Otherwise . As , in either situation . The statement holds.

Now suppose the statement holds when . We have . Then when :

If , by definition we know in , and

Thus . The statement also holds.

If , we can prove the statement holds using the same technique.

Therefore by deduction, for any .

The statement of convergence can be directly verified by the definition of uniform convergence using the result above. ∎