1. Introduction
It is hard to find specific content in massive resource library. Generally, these contents can be transformed into vectors of different lengths using proper embedding algorithms. The stateoftheart examples include, for text data, word2vec (Mikolov et al., 2013)
, and for image data, convolutional neural network
(Sharif Razavian et al., 2014; Gong et al., 2014). Using these embedding vectors of dozens to thousands of dimensions, the distance from queries to every entry in the database can be calculated, and the nearest ones can be found. The true problem is how to find the most similar contents from an arbitrary query in a large database when users request it with low delay. This is also the most computationally expensive part of many algorithms diversified in biology (gene classification (Pan et al., 2004)), computer vision (local image feature matching
(Lowe, 2004)), speech recognition (contentbased music search (Li and Ogihara, 2004)) and many other fields.We hope that the both speed and accuracy of the search algorithms can be high even employed on large datasets. When working with highdimensional features, which is often the case in computer vision applications, there is no known exact nearestneighbor search algorithm that has acceptable performance. To obtain a speed improvement, researchers developed approximate search algorithms (Aumüller et al., 2017). Generally these algorithms can provide 80 percent or more of the correct neighbors, and be much faster than exact search. When a even higher proportion of correctness is required, for example 90 percent or 98 percent, the speed of most approximate search algorithms drops quickly. Furthermore, Some of these algorithms need to train a codebook for indexing before searching, which is also a timeconsuming part.
Since the data size is becoming extremely large and the embedding vectors can be long to keep more information, the distance calculation now requires a great number of basic arithmetic calculations and comparisons, but the calculation for pairs of embedding vectors does not affect each other. In this decade, great development on GPU computation has been achieved. GPUs are much skilled at parallel computing for simple calculation than CPUs, which meets the case of nearestneighbor search problem. Exact search algorithm (Garcia et al., 2010) and some approximate search algorithms (Pan and Manocha, 2011; Johnson et al., 2019) have already been implemented on GPU and obtained great improvements on performance.
In this paper, we propose a fast and accurate algorithm implemented on GPU for approximate nearestneighbor search problem. A binary quantization method is proposed to compress floatingpoint numbers into 3 or 4bit binary codes without training. Cosine similarity calculations of vectors are simplified to exclusive OR (XOR) operations of binary codes and is further optimized based on the parallel characteristics of GPU. Based on the quantization method and optimized calculations, cosine similarities of normalized data points can then be fast calculated on GPU when both data size and vector length are large. As training is not needed for quantization, this method will be useful for the situations where dataset distribution changes rapidly or only few queries are asked in large datasets.
This paper makes the following contributions:

We propose a new quantization method to encode number within into arbitrary bits (Section 3).

We provide a novel view that transforming multiplication into bitwise XOR operation, and use this transformation to accelerate multiplications in limited scope (Section 3).

We propose a trainfree algorithm to implement GPU exhaustive kNNSelection on large datasets, which based on cosine similarity and has a series of parameters controlling the accuracy and speed (Section 3 & 4).

We conduct realdata experiments that show that the proposed algorithm has a stateoftheart searching efficiency and high accuracy on largescale nearestneighbor search tasks. The algorithm is also extensible on multiGPU configurations (Section 5).
2. Related Work
Generally, Knearest neighbor search is to find the top most similar vectors in vectors for each vector query given the distance metric. Here each vector has components, and in this paper we specify the metric with descending cosine similarity, defined by the inner product of two normalized vectors. This section presents a review of previous work in this area.
The brute force way to solve the problem is to calculate pairwise distance between the query vector and each alternative vector and use a minimum heap to store the top nearest vector. This way costs great computing resources (with time complexity ). Garcia et al. (Garcia et al., 2010)
implemented parallel brute force algorithm on NVIDIA GPU using CUDA and CUBLAS, showing that the speed can be 25x and 62x faster than highly optimized ANN C++ library implemented on CPU. As this migration cannot really reduce the resource consumption on computation, researchers have been providing solutions to calculate the approximate nearest neighbors with high precision but much lower time complexity. Most of these techniques are target on reduce the search space. We revise the most widely used KNN search techniques, classified in three categories: hashing based techniques, partitioning trees and nearest neighbor graph techniques.
The bestknown hashingbased techniques might be local sensitive hashing (LSH) (Andoni and Indyk, 2006), in which many hash functions, with property that elements with similar hashes are more likely to be similar, are used. Variants of LSH such as multiprobe LSH (Lv et al., 2007) and LSH Forest (Bawa et al., 2005) help improve the performance of these techniques. As the performance of LSH is highly related to the quality of hashing functions, huge work focuses on improving hashing methods (Shakhnarovich et al., 2003; Wang et al., 2010; Xu et al., 2011). Pan implemented LSH based nearestneighbor search on GPU (Pan and Manocha, 2011), making searching much faster. As LSH is highly sensitive to the hashing function we choose, we are not going to compare it with our method in the experiment section. However, combining LSH with our method to further accelerate the search speed is possible, as we will mention later.
Partitioning trees is also a popular technique for approximate nearestneighbor search. KD tree (SilpaAnan and Hartley, 2008) is one of the best known nearestneighbor algorithms. It is effective on datasets with low dimensionality but gets poor performance on datasets with high dimensionality. Gong encoded image matrix into binary values to find similar results and achieved good search recalls for image datasets (Gong et al., 2012, 2013). This method does not provide good results for nonclassification problems. Other methods like annoy (Bernhardsson and others, 2018), ball tree (Liu et al., 2006; Omohundro, 1989)
use decision tree to make searching an
level job.Herve Jegou proposed product quantization (PQ) to provide a short code representation of vectors (Jegou et al., 2010) and improved the searching efficiency by IVFADC algorithm (Jégou et al., 2011). In product quantization, space is decomposed into a Cartesian product of low dimensional subspaces, and data points are represented by compact codes computed as quantization indices in these subspaces. A codebook needs to be trained by a training dataset with distribution similar to the population before indexing. The training phase can take a long time when the training set is large. The compact codes are then efficiently compared to the query points using an asymmetric approximate distance in the search phase. Using an inverted file system, PQ can help efficiently search nearest neighbors on highdimensional datasets. Inverted multiindex (IMI) proposed by Babenko and Lempitsky (Babenko and Lempitsky, 2014), which replaces the standard quantization in an inverted index with product quantization, obtains a denser subdivision of the search space. These methods are efficient at searching on large datasets with high dimensionally and are now used and accelerated by GPU in Facebook’s Faiss library (Johnson et al., 2019). As the new approach proposed in this paper is also a vectorquantization based technique, we will compare our results with the IVFADC version of PQ based nearestneighbor searching.
Nearest neighbor graph methods is based on the thought that, when there comes a query, we start to calculate the distance from a random point to the query, and try to search along the ”steepest descent direction” of distance between points on the direction and the query. In practice, a graph structure in which points are vertices and edges connect each point to its friend points is built. For each point in the graph, the friend points are likely to be close to it. The query points are used to explore this graph using various strategies in order to get closer to their nearest neighbors. As an optimized graph must be built, these graph methods also have train phases to build graph and take long time on training when the dataset is huge. Malkov raised an efficient and robust searching algorithm using Hierarchical Navigable Small World (HNSW) graphs (Malkov and Yashunin, 2018). HNSW is one of the best practices of nearest neighbor graph techniques so far. However, it is not a good idea to implement HNSW on GPU, which takes advantage on parallel computing, since we need to access the points based on a strict hierarchical order. We will compare the performance of our approach and HNSW with comparable computing resources.
Faiss (Johnson et al., 2019) is a good solution that works on GPU verified by ANNbenchmark (Aumüller et al., 2017). The performance of PQ methods on GPU has been optimized by Faiss. The library also integrates a CPU version of HNSW search. For both algorithms here, the training and indexing step for large dataset takes a long time, which is a crucial drawback for some realtime online data. They suffer from a painful tradeoff among training time, recall/precision rate and search speed. In the following section we will provide a novel approach with short preprocessing time, high recall/precision and fast search speed.
3. Compress Vectors with XORfriendly Binary Quantization
Given two floatingpoint vectors and with , the cosine similarity of them is
It requires N multiplications and (N1) additions, resulting in intensive computational complexity though these operations can be parallelized by SIMD instructions. Besides, for floatingpoint vectors, the memory bandwidth can also limit the throughput of computations when processing large scale similarity computation. For example, the memory bandwidth of DDR4 2666 is 21.3 GB/s. To solve these two problems, we propose a fast similarity search mechanism that quantizes 32bit floatingpoint numbers to lowbit binary numbers and replaces highcomplexity multiplications by XOR computations, which has low complexity on computation. We first introduce the relationship between multiplication and XOR operation.
3.1. Multiplication and XOR on simple sets
Consider two sets and . Define multiplication on is simple multiplication (), and multiplication on is XOR operation (). The following proposition reveals the relationship between these two different operations.
Proposition 3.1 ().
and are two isomorphism groups under mapping , where
(1) 
The proposition can be directly verified by checking enumerate all possible operations. In these two groups, (+1) and 0 are identity elements while (1) and 1 are inverse elements respectively.
Using 3.1, the multiplication on can be replaced by the XOR computation on with the following formula.
(2)  
Therefore, if all components in vectors can be represented as a combination of , then the highcomplexity multiplication can be replaced by the lowcomplexity XOR computation.
3.2. XORFriendly Binary Quantization
To take advantage of the XOR computation, a special strategy is used to quantize floatingpoint numbers to binary numbers.
For any real number , we use a mapping to approximate into a subspace of . For simplification we write . has the following representation:
where . Specially we define . Then the value of is decided by the following formula:
Similar with binary representation, this approximation has some good properties. Define as the set of all positive integer, then we have
Proposition 3.2 ().
. And uniformly convergent to :
The proof of this proposition is in Appendix A.
When , can be exactly represented. Instead, if is fixed to a finite number, will be approximately represented. In this case, if we take a fixed (indicating the number of encoding bits), with the vector as a fixed codebook, denote as , then all can then be encoded as
(3) 
where .
Next we will show that, based on this quantization scheme, the multiplication between floatingpoint numbers can then be replaced by XOR. Given the encoding bit , the product of can be calculated as
(4)  
where . By Eq. 2, Eq. 4 can be transformed as
where are the corresponding element in described by proposition 3.1, and is shift logical left instruction (). Then, the multiplications are replaced by fast bitwise operations  XOR and bit shifts. For convenience of description, a new operator is introduced. We define and . Here and are integers represented by binary code. We name this kind of quantization as XORFriendly Binary Quantization (XFBQ). We introduce a new operation to denote
Then the result of can be equivalently transformed to the product of
(5)  
Therefore, the multiplication of floatingpoint numbers can be optimized by following steps:

Calculate the result of using XOR operations.

Use Equation 5 to calculate the product of .
3.3. Inner Product of Vectors with XFBQ
Based the above quantization scheme of floatingpoint numbers, floatingpoint vectors can be quantized in a XORFriendly way using the following scheme. Given two Ndimension vectors and , , with the encoding bit , we have
(6)  
where and .
We denote with . Using Eq. 5, the result of can be transformed to the inner product by
(7) 
Furthermore, population count (POPCNT) operations are introduced to improve the performance of the inner product. Since
(8) 
notice that
In order to utilize POPCNT operations and deduce the total number of instructions for a faster calculation performance, the Ndimension quantized vectors are considered to be reconstructed as dimension vectors,
(9)  
The benefit from this storage scheme can be shown by a example. Suppose and are both 32d vectors, and we set . Without encoding 32 multiplications are needed. IN XFBQ, both vectors are encoded into 96 bits. Without this scheme we need to take XOR operations separately, which takes long time on sending instructions. If we store the data as and , then all and can be saved in 4byte integers respectively as all bits in the same integer share the same shifting bits in calculation. Now both and are encoded into 3 integers, and only XOR and POPCNT operations with corresponding shifts are needed. As POPCNT is faster than multiplication, and we decrease the number of instructions, we can see great speedup. Modern processing units also support POPCNT64, which takes the same operation on two 8byte integers. This instruction will further improve the performance. Based on this storage scheme, XFBQ works better on higherdimensional vectors.
To sum all above up, calculating the inner product of two floatingpoint vectors can be optimized by following steps:

Based on Equation 10, calculate the result of .
3.4. Complexity Analysis
Given two Ndimension vectors and , here N is multiple of 64 for convenience of description, based on the above quantization scheme, these vectors can be quantized as and and then perform bitwise operations instead of multiplications for the inner product. As shown in Table 1, the calculation of the inner product can be significantly optimized by the proposed scheme.
Calculation  Manipulations (times)  Memory (bytes) 

N multiplications and (N1) additions 
2N * size of(float)  

XOR and POPCNT, bitwise shifts and additions 
* size of(uint64) 
3.5. Error Analysis of Inner Product Calculation
Using the quantization method we provided above, a 4byte floatingpoint number can be easily transformed into and stored in 3 or 4 bits, saying that we can store 811 times data as storing the original data. We admit there is a huge loss on precision, but using the following tricks, the loss on the final precision of similarities will be acceptable.
Suppose are two vectors with same dimension. and are approximations of them. Then Similarities is calculated by
(11)  
XFBQ can be seen as an approximate method. Using this formula, the similarity error introduced by quantization can be split into two parts: the length error and the angle error.
3.5.1. Length Error
For each component of a vector, it has been approximated to the nearest point that can be represented in a few bits, which introduce a length error in each dimension. As the error accumulates, the length of the quantized vector we stored should be different from the real value. That is the length error. Fixing the number of bits we use for one component, this error can be large when most of the components are small, which is just the case when vector has a lot of components. Good news is, in many fields like speech recognition and NLP, the data vectors are in high dimension, and we can expect that the vectors are usually small in every dimension with high probability. On the popular dataset of embedding vectors like GloVe
(Pennington et al., 2014), all tokens are no larger than 0.5 after vector normalization. Tokens are even smaller for many fields needing nearest neighbor search such as fingerprints recognition and facial recognition. These findings make the following assumption reasonable:
Assumption 3.1 ().
, ,
.
Even for the cases that some vectors have large value in specific dimensions, we can take a linear transformation using an orthogonal matrix
(Gong et al., 2012) to make the assumption effective. Generally, will shrink when goes high. We can expect that (But it should be specially calculated for every dataset). Under this assumption, the vectors can be scaled up by before quantizing them. When there is no loss of precision, the inner product will be scaled up by and the similarity results keep their order. Scaling up makes the data more scattered in and can deduce the length error introduced by the quantization. We test our quantization approach on 128d and 1000d random data and SIFT1M dataset. The results show that after the quantization, the new vector length is about 510% larger than the scaled vector, and the expansion of length on different quantized vectors is about 5% of their average length, making the error tolerable.3.5.2. Angle Error
As the quantized vectors are no longer in the direction of the original ones, when we use the new vectors to calculate the cosine value in 11, the value is, actually no way the same with the real value. This is the angle error. We may accept it when the change of angle is small as searching for the top nearest neighbors: cosine function has small derivative value when function value is around 1, so the cosine value will not be influenced much when the two vectors are intrinsically point to similar directions. For example, we find that when , the angle between primitive and quantized vectors can be as much as 30 degrees, seems to be unacceptable. Using the scaling up trick proposed for length error, we can decrease the angle error from the quantization as well. The common case for the difference between the angle of and the angle of after scaling up is just 510 degrees and less. Experiments on a higher dimension enhance this find. This angle error will finally add about 5% relative error, on the top nearest vectors, when , according to our trials.
In the discussion above, we used a conservative parameter that is no larger than any components of all vectors. In most realworld problem, we can relax this restriction, and make it no larger than 98% of all components, or even less. Experimental results show that a properly expanded can accelerate calculation, holding the error acceptable.
3.6. Control Selection Error by Extra Distance
Methods proposed in Section 3.5 can already help find similar vectors. Next we hope to further reduce the impact of these errors and make the result precise.
First, we show how errors influences the distance we get. 7 tells the relationship between the similarity () and the distance () we calculate. Especially when we quantize query with 4 bits and quantize document with 3 bits, the relationship is
There is a linear relationship between the distance and the similarity. If the error is considered, then the scaled similarity (SS) can be written as
When we fix the query and find the top similar documents, the error on query length is fixed, and only the length error on documents and the angle error is floating. The length error round within 10% of the scale, and the angle error can be assumed to be no larger than for top documents. In total, the fluctuation range of the distance is no larger than 15% of the whole range of the distance.
3.6.1. Extra distance method
Our strategy of improving search accuracy is as following: Given a query, distances from the query to candidate documents will be calculated by the quantization algorithm and then the minimum distance can be found to retrieve top similar documents (Gray part in Figure 2). Besides, the minimum distance can optionally be added with an ”extra distance” to prevent missing good results caused by the error. The documents whose distances are not greater than the minimum distance are recalled (Gray and green part in Figure 2), and then the floatingpoint similarities between these documents and the query are calculated. Based on these similarities, the top results are finally returned.
3.6.2. Choose the extra distance
In numerical experiments, we fixed one vector (as the query) and randomly generate massive vectors to calculate the pairwise similarity by both real value and our quantization method. We normalize both similarities respectively so that results for both methods have 0 mean and 1 variance. A sample of the distributions after normalization is shown in Figure
3. Notice that they are close to each other in distribution. This suggests that the boundaries of true top and the calculated top are close after normalization.Assume that the entire similarity space is divided into several buckets, then a vector with high similarity may drop down to no larger than a few buckets at a high probability. If we find the calculated top bounds and then consider the vectors in those lower buckets, and sort the original similarities of these vectors, the resulting top will have a high accuracy/recall rate.
As analyzed above, the fluctuation range of the distance is no larger than 15% of the whole range of the distance. This theoretical upper bound can help to determine the upper bound for extra distance. In application with 4 bits * 3 bits quantization, we believe considering 5% to 10% of SS range is enough for high recall rates ().
4. XFBQ based KNN Search Algorithm
As shown in Algorithm 1 , we perform a NN Search approach as following steps:

Quantization. Quantize all document vectors and the query vector.

Distance Calculation. Calculate distances between the quantized query vector and all quantized document vectors.

Histogram and select. Histogram the distances and find the th minimum distance as mentioned in Section 3.6. Based on this distance, select the candidate documents.

Refine. Sort candidate documents based on exact similarities calculated by original floatingpoint vectors. This step can also be done on CPU with little influence on overall performance , if GPU memory is limited.
4.1. Details of the Selection Algorithm
The details of these steps are described as follows.
Quantization. The quantization of an Ndimension floatingpoint vector is optimized by the Advanced Vector Extensions (AVX) intrinsic and some ingenious bit operations in CPU. The entire quantization process has two following steps. The first step is to quantize the floatingpoint vectors by Single Instruction Multiple Data (SIMD) fashion since the quantization process for each dimension is independent. In the second step, the required bits are extracted from the quantized vectors and stored into uint64_t array for efficient access. By bitwise AND with elaborate bit masks and some other bitwise operations, 8bit quantization can simultaneously be performed.
Distance Calculation. In this step distances between the quantized query vector and all quantized document vectors are calculated on GPU. Before this, all quantized document vectors are reorganized as bundles of size 32 for warps  units of execution in GPU, and rearranged to columnbased access pattern. With this structure of quantized document vectors, each warp can access the global memory in a highefficient way. The calculation of the distance is implemented by the bitwise XOR, CUDA integer intrinsic __popcll and bitwise shift operations as shown in Equation 10. Distances are stored as integers of , as they are equivalent to real inner products under a linear transformation.
Histogram and Select. The top documents that are most similar to the query need to be selected with the information of obtained unordered distances. Traditional sequential method is to maintain a minimum heap and get top K results, but it is not totally friendly for parallel computing. Instead, a twostep parallel approach is used here. First, a histogram of the distances is parallelly built to find the th minimum distance of all vectors. Since the integers stored by previous step take limited unique values, each unique value can take a bin in the histogram, and the process can be efficiently done. In the second step, with the th minimum distance (often we add an extra distance as mentioned in Section 3.6), the candidate top documents are selected from all documents.
Refine. To improve the recall rate, the selected candidate results will be refined by exact similarities calculated from floatingpoint vectors. This process can be done on either CPU or GPU. For the GPU implementation, considering the feature of warp execution and 128byte alignment of the L1 cache, vectors are grouped for calculation in a manner similar to loop unrolling. After calculation, candidate documents are sorted by exact similarities to get the top nearest neighbors.
4.2. Complexity Analysis
4.2.1. Time Complexity
In preprocessing step, we quantize all the document vectors from floating numbers into a space of bits on CPU and copy the quantized vectors to GPU. In practice we choose as 3. Time complexity here is .
In searching step, we first calculate the distance between quantized document vectors and the query vector. Benefit from the new representation, we can take the special XOR operation between 64 pairs of tokens and add them up at the same time with bit operations and additions. Here and represent how many bits the document/query vectors are quantized into. Therefore, the time complexity here is still . As and are fixed after preprocessing, this step can be seen as with a much smaller constant compared to brute force approach.
Next in histogram and select step, every distance is contributing to the corresponding bin in histogram on GPU parallelly, and the time cost does not exceed the size of the bin with the most distances – In the worst case , but in average . Here Distance Scale is a number related to scale introduced in Section 3.5. Generally speaking, time here can be ignored compared to distance calculation. Finding the boundary of top candidates takes time and finding top candidate vectors on GPU takes another time. Finally in refine step as candidates are taken, time are needed for calculating the exact similarity and sort them.
In general, searching takes a time complexity of . It can be at least 5  6 times faster than brute force way using the same computing resource.
4.2.2. Space Complexity
Beside storing the vector information with space on CPU, we need an extra space to store the quantized vectors on GPU. Distance result can be ignored compared to the vector storage. Detail of the storage has been shown in 3.4. Therefore, with the same space resources on GPU, our approach can deal with 10x data points compared to brute force way when data are all stored on GPU from the very beginning.
5. Experiments Results
In this section, we compare our approach with other popular methods on different public datasets. The baseline approach is a bruteforce way of the similarity calculation implemented by ourselves that only taking advantage of the parallelization of GPUs. We also compared with the stateoftheart method implemented by Faiss, including HNSW, IVFHNSW on CPU, and IVF Flat, IVF Product Quantization approach on GPU. All programs are run under Ubuntu 16.04 LTS with 20 Intel Core i76950X CPU @ 3.00GHz and 1  4 Nvidia Titan V GPU. As we did not optimize our program for multiquery requests, we represent results of singlequery requests here. The metric we use for comparison is query per second (QPS) and precision at top 1/10/100/1,000. Here we define
We first take experiments on synthetic datasets to choose parameters and that can make the search fast and accurate. When and are no larger than 2, the precision of the result will be poor. takes a balance between speed and accuracy. When , assume sizeof(uint64) = 8 bytes and sizeof(float) = 4 bytes, by Table 1 we know XFBQ based inner product calculation only costs 28% instructions and 9% space compared to the brute force algorithm. When and , 37% instructions and 11% space are cost compared to the brute force algorithm.
We choose two different open datasets for testing. Test purposes are different on these datasets.
Tencent AILab word embedding dataset (Song et al., 2018) has 8,824,330 records for Chinese and English words. Each record is a 200d vector. On this dataset we aim to test the speed of XFBQ based inner product calculation and the performance of our approach on single GPU. We randomly take 24,330 records in the dataset as queries and keep the others as dictionary. In our approach we choose as a common setting, and based on the range of records. Precision are controlled by setting different extra distance (from 0 to 20). Figure 4 show that our approach are on average 10 times faster than common parallel computing on inner product calculation. Results of the precision of kNN search are shown in Figure 5. IVFPQ (Jegou et al., 2010) provides a fast but lowprecision search for nearest neighbors. It also costs great time on training. HNSW(Malkov and Yashunin, 2018) also takes thousands of CPU seconds on training, and its efficiency drops down quickly when required increases.
Method  Resource Usage 
Preprocessing Time(s) 

Our approach  1 CPU + 1 GPU  17 
IVF3072FLAT  1 CPU + 1 GPU  18 
IVF65536PQ8/20/40  1 CPU + 1 GPU  440  800 
HNSW20/40/50  20 CPU  480  600 
When high accuracy becomes a must, our approach has the highest QPS when using only one GPU. On the whole search process, we have 6 times faster than brute force way with optimized parallel implementation on GPU, and a 1.5x  2x speed (measuring on same precision) compared to IVF3072 Flat search implemented by Faiss on GPU. Our approach also takes the least preprocessing time, which is similar to IVF Flat approach, as shown in Table 2. The space used by our approach is similar to HNSW. The result above shows that our approach can replace other stateoftheart methods for k nearest neighbor search asking for high precision and a quick start.
Deep100M is the first 100 million vectors of dataset Deep1B (Babenko and Lempitsky, 2016), which has 1 billion CNN representations for images with dimension 96.
We design a test on only 100M records to avoid extreme long training time on those approaches for comparison.
The query size is 10,000. We tested HNSW approach with 20 CPU cores. Other approaches are tested with 1 CPU and 1 or 4 GPUs in order to show the performance of our algorithm on multiple devices. In our approach we also choose and , but set to fit the data.
The result is shown in Figure 6. The scale of precision axis is different from that in Figure 5(a), as we try to show the performance of the best among all IVFPQ methods. The improvement on search efficiency shows that our approach fits well for multiple GPUs. When more devices are provided, the QPS increases linearly on our approach. In other words, our approach can be deployed on a distributed system. Our approach can deal with 2.5 billion tokens per gigabyte video memory, so a Nvidia P40 with 24GB memory can deal with 60 billion tokens, which exceeds the size of many real time searching problems. Our approach keeps a stable performance on all requests and is the fastest when precision is over 99%, using 4 GPUs. Compared to HNSW, we also have much lower memory usage.
Method  Resource Usage 
Preprocessing Time(s) 
Our approach  1 CPU + 1 GPU  57 
Our approach  1 CPU + 4 GPU  20 
FaissIVF65536PQ48  1 CPU + 1 GPU  1424 
HNSW32  20 CPU  6025 
Results on 1 GPU in Figure 6 reflects that our approach can still be dragged down by eager needs on computing power. This is the main drawback of it. As our approach only focus on how to calculate the similarity, it can be further improved by combining with other mature nearest neighbor search techniques focusing on reducing the searching area. For example, Locality Sensitive Hashing can be used in preprocessing and the search will then become nonexhaustive, taking away most of the calculation. In this way the searching will be n times faster than current speed when only one over n of the samples are chosen to be considered. Also, our techniques can be deployed on an inverted file system. Compared to IVFPQ structure, we can provide higher precision with lower preprocessing time. Therefore, our approach has great potential on accelerating search speed.
6. Conclusion
In this work, we have presented a new approach for performing efficient k nearestneighbor search using cosine similarity metric on GPU. We propose a new binary quantization method (XFBQ) for compressing calculation. This technique is combined with a special selection method using the calculated distance. Overall our techniques provide significant reductions on presearching time cost compared to other popular approximate nearestneighbor search algorithms and keep a stateoftheart searching efficiency, especially when high accuracy is needed. Since most of our work is on accelerating similarity calculation, our approach can be further combined with other popular techniques focusing on reducing search space, such as locality sensitive hashing, and inverted file system, to get a even faster speed. As a single highperformance GPU can handle the calculation work for hundreds of millions of vector productions, CPU resources can be liberated for works with higher complexity.
References
 Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pp. 459–468. Cited by: §2.
 ANNbenchmarks: a benchmarking tool for approximate nearest neighbor algorithms. In International Conference on Similarity Search and Applications, pp. 34–49. Cited by: §1, §2.
 The inverted multiindex. IEEE transactions on pattern analysis and machine intelligence 37 (6), pp. 1247–1260. Cited by: §2.

Efficient indexing of billionscale datasets of deep descriptors.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2055–2063. Cited by: §5.  LSH forest: selftuning indexes for similarity search. In Proceedings of the 14th international conference on World Wide Web, pp. 651–660. Cited by: §2.
 Annoy (approximate nearest neighbors oh yeah). Cited by: §2.
 Knearest neighbor search: fast gpubased implementations and application to highdimensional feature matching. In 2010 IEEE International Conference on Image Processing, pp. 3757–3760. Cited by: §1, §2.
 Learning binary codes for highdimensional data using bilinear projections. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 484–491. Cited by: §2.
 Angular quantizationbased binary codes for fast similarity search. In Advances in neural information processing systems, pp. 1196–1204. Cited by: §2, §3.5.1.
 Multiscale orderless pooling of deep convolutional activation features. In European conference on computer vision, pp. 392–407. Cited by: §1.
 Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §2, §5.
 Searching in one billion vectors: rerank with source coding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 861–864. Cited by: §2.
 Billionscale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §1, §2, §2.
 Contentbased music similarity search and emotion detection. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, pp. V–705. Cited by: §1.

New algorithms for efficient highdimensional nonparametric classification.
Journal of Machine Learning Research
7 (Jun), pp. 1135–1158. Cited by: §2.  Distinctive image features from scaleinvariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §1.
 Multiprobe lsh: efficient indexing for highdimensional similarity search. In Proceedings of the 33rd international conference on Very large data bases, pp. 950–961. Cited by: §2.
 Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §5.
 Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
 Five balltree construction algorithms. International Computer Science Institute Berkeley. Cited by: §2.
 Comprehensive vertical samplebased knn/lsvm classification for gene expression analysis. Journal of Biomedical Informatics 37 (4), pp. 240–248. Cited by: §1.
 Fast gpubased locality sensitive hashing for knearest neighbor computation. In Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems, pp. 211–220. Cited by: §1, §2.

GloVe: global vectors for word representation.
In
Empirical Methods in Natural Language Processing (EMNLP)
, pp. 1532–1543. External Links: Link Cited by: §3.5.1. 
Fast pose estimation with parametersensitive hashing
. In null, pp. 750. Cited by: §2.  CNN features offtheshelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806–813. Cited by: §1.
 Optimised kdtrees for fast image descriptor matching. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.
 Directional skipgram: explicitly distinguishing left and right context for word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 175–180. Cited by: §5.

Semisupervised hashing for scalable image retrieval
. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3424–3431. Cited by: §2.  Complementary hashing for approximate nearest neighbor search. In 2011 International Conference on Computer Vision, pp. 1631–1638. Cited by: §2.
Appendix A Proof of Proposition 3.2
Proof.
First we prove . When , by definition if . Otherwise . As , in either situation . The statement holds.
Now suppose the statement holds when . We have . Then when :
If , by definition we know in , and
Thus . The statement also holds.
If , we can prove the statement holds using the same technique.
Therefore by deduction, for any .
The statement of convergence can be directly verified by the definition of uniform convergence using the result above. ∎