It is hard to find specific content in massive resource library. Generally, these contents can be transformed into vectors of different lengths using proper embedding algorithms. The state-of-the-art examples include, for text data, word2vec (Mikolov et al., 2013)
, and for image data, convolutional neural network(Sharif Razavian et al., 2014; Gong et al., 2014). Using these embedding vectors of dozens to thousands of dimensions, the distance from queries to every entry in the database can be calculated, and the nearest ones can be found. The true problem is how to find the most similar contents from an arbitrary query in a large database when users request it with low delay. This is also the most computationally expensive part of many algorithms diversified in biology (gene classification (Pan et al., 2004)
), computer vision (local image feature matching(Lowe, 2004)), speech recognition (content-based music search (Li and Ogihara, 2004)) and many other fields.
We hope that the both speed and accuracy of the search algorithms can be high even employed on large datasets. When working with high-dimensional features, which is often the case in computer vision applications, there is no known exact nearest-neighbor search algorithm that has acceptable performance. To obtain a speed improvement, researchers developed approximate search algorithms (Aumüller et al., 2017). Generally these algorithms can provide 80 percent or more of the correct neighbors, and be much faster than exact search. When a even higher proportion of correctness is required, for example 90 percent or 98 percent, the speed of most approximate search algorithms drops quickly. Furthermore, Some of these algorithms need to train a codebook for indexing before searching, which is also a time-consuming part.
Since the data size is becoming extremely large and the embedding vectors can be long to keep more information, the distance calculation now requires a great number of basic arithmetic calculations and comparisons, but the calculation for pairs of embedding vectors does not affect each other. In this decade, great development on GPU computation has been achieved. GPUs are much skilled at parallel computing for simple calculation than CPUs, which meets the case of nearest-neighbor search problem. Exact search algorithm (Garcia et al., 2010) and some approximate search algorithms (Pan and Manocha, 2011; Johnson et al., 2019) have already been implemented on GPU and obtained great improvements on performance.
In this paper, we propose a fast and accurate algorithm implemented on GPU for approximate nearest-neighbor search problem. A binary quantization method is proposed to compress floating-point numbers into 3- or 4-bit binary codes without training. Cosine similarity calculations of vectors are simplified to exclusive OR (XOR) operations of binary codes and is further optimized based on the parallel characteristics of GPU. Based on the quantization method and optimized calculations, cosine similarities of normalized data points can then be fast calculated on GPU when both data size and vector length are large. As training is not needed for quantization, this method will be useful for the situations where dataset distribution changes rapidly or only few queries are asked in large datasets.
This paper makes the following contributions:
We propose a new quantization method to encode number within into arbitrary bits (Section 3).
We provide a novel view that transforming multiplication into bitwise XOR operation, and use this transformation to accelerate multiplications in limited scope (Section 3).
We propose a train-free algorithm to implement GPU exhaustive kNN-Selection on large datasets, which based on cosine similarity and has a series of parameters controlling the accuracy and speed (Section 3 & 4).
We conduct real-data experiments that show that the proposed algorithm has a state-of-the-art searching efficiency and high accuracy on large-scale nearest-neighbor search tasks. The algorithm is also extensible on multi-GPU configurations (Section 5).
2. Related Work
Generally, K-nearest neighbor search is to find the top most similar vectors in vectors for each vector query given the distance metric. Here each vector has components, and in this paper we specify the metric with descending cosine similarity, defined by the inner product of two normalized vectors. This section presents a review of previous work in this area.
The brute force way to solve the problem is to calculate pairwise distance between the query vector and each alternative vector and use a minimum heap to store the top nearest vector. This way costs great computing resources (with time complexity ). Garcia et al. (Garcia et al., 2010)
implemented parallel brute force algorithm on NVIDIA GPU using CUDA and CUBLAS, showing that the speed can be 25x and 62x faster than highly optimized ANN C++ library implemented on CPU. As this migration cannot really reduce the resource consumption on computation, researchers have been providing solutions to calculate the approximate nearest neighbors with high precision but much lower time complexity. Most of these techniques are target on reduce the search space. We revise the most widely used K-NN search techniques, classified in three categories: hashing based techniques, partitioning trees and nearest neighbor graph techniques.
The best-known hashing-based techniques might be local sensitive hashing (LSH) (Andoni and Indyk, 2006), in which many hash functions, with property that elements with similar hashes are more likely to be similar, are used. Variants of LSH such as multi-probe LSH (Lv et al., 2007) and LSH Forest (Bawa et al., 2005) help improve the performance of these techniques. As the performance of LSH is highly related to the quality of hashing functions, huge work focuses on improving hashing methods (Shakhnarovich et al., 2003; Wang et al., 2010; Xu et al., 2011). Pan implemented LSH based nearest-neighbor search on GPU (Pan and Manocha, 2011), making searching much faster. As LSH is highly sensitive to the hashing function we choose, we are not going to compare it with our method in the experiment section. However, combining LSH with our method to further accelerate the search speed is possible, as we will mention later.
Partitioning trees is also a popular technique for approximate nearest-neighbor search. KD tree (Silpa-Anan and Hartley, 2008) is one of the best known nearest-neighbor algorithms. It is effective on datasets with low dimensionality but gets poor performance on datasets with high dimensionality. Gong encoded image matrix into binary values to find similar results and achieved good search recalls for image datasets (Gong et al., 2012, 2013). This method does not provide good results for non-classification problems. Other methods like annoy (Bernhardsson and others, 2018), ball tree (Liu et al., 2006; Omohundro, 1989)
use decision tree to make searching anlevel job.
Herve Jegou proposed product quantization (PQ) to provide a short code representation of vectors (Jegou et al., 2010) and improved the searching efficiency by IVFADC algorithm (Jégou et al., 2011). In product quantization, space is decomposed into a Cartesian product of low dimensional subspaces, and data points are represented by compact codes computed as quantization indices in these subspaces. A codebook needs to be trained by a training dataset with distribution similar to the population before indexing. The training phase can take a long time when the training set is large. The compact codes are then efficiently compared to the query points using an asymmetric approximate distance in the search phase. Using an inverted file system, PQ can help efficiently search nearest neighbors on high-dimensional datasets. Inverted multi-index (IMI) proposed by Babenko and Lempitsky (Babenko and Lempitsky, 2014), which replaces the standard quantization in an inverted index with product quantization, obtains a denser subdivision of the search space. These methods are efficient at searching on large datasets with high dimensionally and are now used and accelerated by GPU in Facebook’s Faiss library (Johnson et al., 2019). As the new approach proposed in this paper is also a vector-quantization based technique, we will compare our results with the IVFADC version of PQ based nearest-neighbor searching.
Nearest neighbor graph methods is based on the thought that, when there comes a query, we start to calculate the distance from a random point to the query, and try to search along the ”steepest descent direction” of distance between points on the direction and the query. In practice, a graph structure in which points are vertices and edges connect each point to its friend points is built. For each point in the graph, the friend points are likely to be close to it. The query points are used to explore this graph using various strategies in order to get closer to their nearest neighbors. As an optimized graph must be built, these graph methods also have train phases to build graph and take long time on training when the dataset is huge. Malkov raised an efficient and robust searching algorithm using Hierarchical Navigable Small World (HNSW) graphs (Malkov and Yashunin, 2018). HNSW is one of the best practices of nearest neighbor graph techniques so far. However, it is not a good idea to implement HNSW on GPU, which takes advantage on parallel computing, since we need to access the points based on a strict hierarchical order. We will compare the performance of our approach and HNSW with comparable computing resources.
Faiss (Johnson et al., 2019) is a good solution that works on GPU verified by ANN-benchmark (Aumüller et al., 2017). The performance of PQ methods on GPU has been optimized by Faiss. The library also integrates a CPU version of HNSW search. For both algorithms here, the training and indexing step for large dataset takes a long time, which is a crucial drawback for some real-time online data. They suffer from a painful trade-off among training time, recall/precision rate and search speed. In the following section we will provide a novel approach with short preprocessing time, high recall/precision and fast search speed.
3. Compress Vectors with XOR-friendly Binary Quantization
Given two floating-point vectors and with , the cosine similarity of them is
It requires N multiplications and (N-1) additions, resulting in intensive computational complexity though these operations can be parallelized by SIMD instructions. Besides, for floating-point vectors, the memory bandwidth can also limit the throughput of computations when processing large scale similarity computation. For example, the memory bandwidth of DDR4 2666 is 21.3 GB/s. To solve these two problems, we propose a fast similarity search mechanism that quantizes 32-bit floating-point numbers to low-bit binary numbers and replaces high-complexity multiplications by XOR computations, which has low complexity on computation. We first introduce the relationship between multiplication and XOR operation.
3.1. Multiplication and XOR on simple sets
Consider two sets and . Define multiplication on is simple multiplication (), and multiplication on is XOR operation (). The following proposition reveals the relationship between these two different operations.
Proposition 3.1 ().
and are two isomorphism groups under mapping , where
The proposition can be directly verified by checking enumerate all possible operations. In these two groups, (+1) and 0 are identity elements while (-1) and 1 are inverse elements respectively.
Using 3.1, the multiplication on can be replaced by the XOR computation on with the following formula.
Therefore, if all components in vectors can be represented as a combination of , then the high-complexity multiplication can be replaced by the low-complexity XOR computation.
3.2. XOR-Friendly Binary Quantization
To take advantage of the XOR computation, a special strategy is used to quantize floating-point numbers to binary numbers.
For any real number , we use a mapping to approximate into a subspace of . For simplification we write . has the following representation:
where . Specially we define . Then the value of is decided by the following formula:
Similar with binary representation, this approximation has some good properties. Define as the set of all positive integer, then we have
Proposition 3.2 ().
. And uniformly convergent to :
The proof of this proposition is in Appendix A.
When , can be exactly represented. Instead, if is fixed to a finite number, will be approximately represented. In this case, if we take a fixed (indicating the number of encoding bits), with the vector as a fixed codebook, denote as , then all can then be encoded as
Next we will show that, based on this quantization scheme, the multiplication between floating-point numbers can then be replaced by XOR. Given the encoding bit , the product of can be calculated as
where are the corresponding element in described by proposition 3.1, and is shift logical left instruction (). Then, the multiplications are replaced by fast bitwise operations - XOR and bit shifts. For convenience of description, a new operator is introduced. We define and . Here and are integers represented by binary code. We name this kind of quantization as XOR-Friendly Binary Quantization (XFBQ). We introduce a new operation to denote
Then the result of can be equivalently transformed to the product of
Therefore, the multiplication of floating-point numbers can be optimized by following steps:
Calculate the result of using XOR operations.
Use Equation 5 to calculate the product of .
3.3. Inner Product of Vectors with XFBQ
Based the above quantization scheme of floating-point numbers, floating-point vectors can be quantized in a XOR-Friendly way using the following scheme. Given two N-dimension vectors and , , with the encoding bit , we have
where and .
We denote with . Using Eq. 5, the result of can be transformed to the inner product by
Furthermore, population count (POPCNT) operations are introduced to improve the performance of the inner product. Since
In order to utilize POPCNT operations and deduce the total number of instructions for a faster calculation performance, the N-dimension quantized vectors are considered to be reconstructed as -dimension vectors,
The benefit from this storage scheme can be shown by a example. Suppose and are both 32-d vectors, and we set . Without encoding 32 multiplications are needed. IN XFBQ, both vectors are encoded into 96 bits. Without this scheme we need to take XOR operations separately, which takes long time on sending instructions. If we store the data as and , then all and can be saved in 4-byte integers respectively as all bits in the same integer share the same shifting bits in calculation. Now both and are encoded into 3 integers, and only XOR and POPCNT operations with corresponding shifts are needed. As POPCNT is faster than multiplication, and we decrease the number of instructions, we can see great speedup. Modern processing units also support POPCNT64, which takes the same operation on two 8-byte integers. This instruction will further improve the performance. Based on this storage scheme, XFBQ works better on higher-dimensional vectors.
To sum all above up, calculating the inner product of two floating-point vectors can be optimized by following steps:
Based on Equation 10, calculate the result of .
3.4. Complexity Analysis
Given two N-dimension vectors and , here N is multiple of 64 for convenience of description, based on the above quantization scheme, these vectors can be quantized as and and then perform bitwise operations instead of multiplications for the inner product. As shown in Table 1, the calculation of the inner product can be significantly optimized by the proposed scheme.
|Calculation||Manipulations (times)||Memory (bytes)|
N multiplications and (N-1) additions
|2N * size of(float)|
XOR and POPCNT, bitwise shifts and additions
* size of(uint64)
3.5. Error Analysis of Inner Product Calculation
Using the quantization method we provided above, a 4-byte floating-point number can be easily transformed into and stored in 3 or 4 bits, saying that we can store 8-11 times data as storing the original data. We admit there is a huge loss on precision, but using the following tricks, the loss on the final precision of similarities will be acceptable.
Suppose are two vectors with same dimension. and are approximations of them. Then Similarities is calculated by
XFBQ can be seen as an approximate method. Using this formula, the similarity error introduced by quantization can be split into two parts: the length error and the angle error.
3.5.1. Length Error
For each component of a vector, it has been approximated to the nearest point that can be represented in a few bits, which introduce a length error in each dimension. As the error accumulates, the length of the quantized vector we stored should be different from the real value. That is the length error. Fixing the number of bits we use for one component, this error can be large when most of the components are small, which is just the case when vector has a lot of components. Good news is, in many fields like speech recognition and NLP, the data vectors are in high dimension, and we can expect that the vectors are usually small in every dimension with high probability. On the popular dataset of embedding vectors like GloVe(Pennington et al., 2014)
, all tokens are no larger than 0.5 after vector normalization. Tokens are even smaller for many fields needing nearest neighbor search such as fingerprints recognition and facial recognition. These findings make the following assumption reasonable:
Assumption 3.1 ().
3.5.2. Angle Error
As the quantized vectors are no longer in the direction of the original ones, when we use the new vectors to calculate the cosine value in 11, the value is, actually no way the same with the real value. This is the angle error. We may accept it when the change of angle is small as searching for the top nearest neighbors: cosine function has small derivative value when function value is around 1, so the cosine value will not be influenced much when the two vectors are intrinsically point to similar directions. For example, we find that when , the angle between primitive and quantized vectors can be as much as 30 degrees, seems to be unacceptable. Using the scaling up trick proposed for length error, we can decrease the angle error from the quantization as well. The common case for the difference between the angle of and the angle of after scaling up is just 5-10 degrees and less. Experiments on a higher dimension enhance this find. This angle error will finally add about 5% relative error, on the top nearest vectors, when , according to our trials.
In the discussion above, we used a conservative parameter that is no larger than any components of all vectors. In most real-world problem, we can relax this restriction, and make it no larger than 98% of all components, or even less. Experimental results show that a properly expanded can accelerate calculation, holding the error acceptable.
3.6. Control Selection Error by Extra Distance
Methods proposed in Section 3.5 can already help find similar vectors. Next we hope to further reduce the impact of these errors and make the result precise.
First, we show how errors influences the distance we get. 7 tells the relationship between the similarity () and the distance () we calculate. Especially when we quantize query with 4 bits and quantize document with 3 bits, the relationship is
There is a linear relationship between the distance and the similarity. If the error is considered, then the scaled similarity (SS) can be written as
When we fix the query and find the top similar documents, the error on query length is fixed, and only the length error on documents and the angle error is floating. The length error round within 10% of the scale, and the angle error can be assumed to be no larger than for top documents. In total, the fluctuation range of the distance is no larger than 15% of the whole range of the distance.
3.6.1. Extra distance method
Our strategy of improving search accuracy is as following: Given a query, distances from the query to candidate documents will be calculated by the quantization algorithm and then the minimum distance can be found to retrieve top similar documents (Gray part in Figure 2). Besides, the minimum distance can optionally be added with an ”extra distance” to prevent missing good results caused by the error. The documents whose distances are not greater than the minimum distance are recalled (Gray and green part in Figure 2), and then the floating-point similarities between these documents and the query are calculated. Based on these similarities, the top results are finally returned.
3.6.2. Choose the extra distance
In numerical experiments, we fixed one vector (as the query) and randomly generate massive vectors to calculate the pair-wise similarity by both real value and our quantization method. We normalize both similarities respectively so that results for both methods have 0 mean and 1 variance. A sample of the distributions after normalization is shown in Figure3. Notice that they are close to each other in distribution. This suggests that the boundaries of true top and the calculated top are close after normalization.
Assume that the entire similarity space is divided into several buckets, then a vector with high similarity may drop down to no larger than a few buckets at a high probability. If we find the calculated top bounds and then consider the vectors in those lower buckets, and sort the original similarities of these vectors, the resulting top will have a high accuracy/recall rate.
As analyzed above, the fluctuation range of the distance is no larger than 15% of the whole range of the distance. This theoretical upper bound can help to determine the upper bound for extra distance. In application with 4 bits * 3 bits quantization, we believe considering 5% to 10% of SS range is enough for high recall rates ().
4. XFBQ based K-NN Search Algorithm
As shown in Algorithm 1 , we perform a -NN Search approach as following steps:
Quantization. Quantize all document vectors and the query vector.
Distance Calculation. Calculate distances between the quantized query vector and all quantized document vectors.
Histogram and select. Histogram the distances and find the th minimum distance as mentioned in Section 3.6. Based on this distance, select the candidate documents.
Refine. Sort candidate documents based on exact similarities calculated by original floating-point vectors. This step can also be done on CPU with little influence on overall performance , if GPU memory is limited.
4.1. Details of the Selection Algorithm
The details of these steps are described as follows.
Quantization. The quantization of an N-dimension floating-point vector is optimized by the Advanced Vector Extensions (AVX) intrinsic and some ingenious bit operations in CPU. The entire quantization process has two following steps. The first step is to quantize the floating-point vectors by Single Instruction Multiple Data (SIMD) fashion since the quantization process for each dimension is independent. In the second step, the required bits are extracted from the quantized vectors and stored into uint64_t array for efficient access. By bitwise AND with elaborate bit masks and some other bitwise operations, 8-bit quantization can simultaneously be performed.
Distance Calculation. In this step distances between the quantized query vector and all quantized document vectors are calculated on GPU. Before this, all quantized document vectors are reorganized as bundles of size 32 for warps - units of execution in GPU, and rearranged to column-based access pattern. With this structure of quantized document vectors, each warp can access the global memory in a high-efficient way. The calculation of the distance is implemented by the bitwise XOR, CUDA integer intrinsic __popcll and bitwise shift operations as shown in Equation 10. Distances are stored as integers of , as they are equivalent to real inner products under a linear transformation.
Histogram and Select. The top documents that are most similar to the query need to be selected with the information of obtained unordered distances. Traditional sequential method is to maintain a minimum heap and get top K results, but it is not totally friendly for parallel computing. Instead, a two-step parallel approach is used here. First, a histogram of the distances is parallelly built to find the th minimum distance of all vectors. Since the integers stored by previous step take limited unique values, each unique value can take a bin in the histogram, and the process can be efficiently done. In the second step, with the th minimum distance (often we add an extra distance as mentioned in Section 3.6), the candidate top documents are selected from all documents.
Refine. To improve the recall rate, the selected candidate results will be refined by exact similarities calculated from floating-point vectors. This process can be done on either CPU or GPU. For the GPU implementation, considering the feature of warp execution and 128-byte alignment of the L1 cache, vectors are grouped for calculation in a manner similar to loop unrolling. After calculation, candidate documents are sorted by exact similarities to get the top nearest neighbors.
4.2. Complexity Analysis
4.2.1. Time Complexity
In preprocessing step, we quantize all the document vectors from floating numbers into a space of bits on CPU and copy the quantized vectors to GPU. In practice we choose as 3. Time complexity here is .
In searching step, we first calculate the distance between quantized document vectors and the query vector. Benefit from the new representation, we can take the special XOR operation between 64 pairs of tokens and add them up at the same time with bit operations and additions. Here and represent how many bits the document/query vectors are quantized into. Therefore, the time complexity here is still . As and are fixed after preprocessing, this step can be seen as with a much smaller constant compared to brute force approach.
Next in histogram and select step, every distance is contributing to the corresponding bin in histogram on GPU parallelly, and the time cost does not exceed the size of the bin with the most distances – In the worst case , but in average . Here Distance Scale is a number related to scale introduced in Section 3.5. Generally speaking, time here can be ignored compared to distance calculation. Finding the boundary of top candidates takes time and finding top candidate vectors on GPU takes another time. Finally in refine step as candidates are taken, time are needed for calculating the exact similarity and sort them.
In general, searching takes a time complexity of . It can be at least 5 - 6 times faster than brute force way using the same computing resource.
4.2.2. Space Complexity
Beside storing the vector information with space on CPU, we need an extra space to store the quantized vectors on GPU. Distance result can be ignored compared to the vector storage. Detail of the storage has been shown in 3.4. Therefore, with the same space resources on GPU, our approach can deal with 10x data points compared to brute force way when data are all stored on GPU from the very beginning.
5. Experiments Results
In this section, we compare our approach with other popular methods on different public datasets. The baseline approach is a brute-force way of the similarity calculation implemented by ourselves that only taking advantage of the parallelization of GPUs. We also compared with the state-of-the-art method implemented by Faiss, including HNSW, IVF-HNSW on CPU, and IVF Flat, IVF Product Quantization approach on GPU. All programs are run under Ubuntu 16.04 LTS with 20 Intel Core i7-6950X CPU @ 3.00GHz and 1 - 4 Nvidia Titan V GPU. As we did not optimize our program for multi-query requests, we represent results of single-query requests here. The metric we use for comparison is query per second (QPS) and precision at top 1/10/100/1,000. Here we define
We first take experiments on synthetic datasets to choose parameters and that can make the search fast and accurate. When and are no larger than 2, the precision of the result will be poor. takes a balance between speed and accuracy. When , assume sizeof(uint64) = 8 bytes and sizeof(float) = 4 bytes, by Table 1 we know XFBQ based inner product calculation only costs 28% instructions and 9% space compared to the brute force algorithm. When and , 37% instructions and 11% space are cost compared to the brute force algorithm.
We choose two different open datasets for testing. Test purposes are different on these datasets.
Tencent AILab word embedding dataset (Song et al., 2018) has 8,824,330 records for Chinese and English words. Each record is a 200-d vector. On this dataset we aim to test the speed of XFBQ based inner product calculation and the performance of our approach on single GPU. We randomly take 24,330 records in the dataset as queries and keep the others as dictionary. In our approach we choose as a common setting, and based on the range of records. Precision are controlled by setting different extra distance (from 0 to 20). Figure 4 show that our approach are on average 10 times faster than common parallel computing on inner product calculation. Results of the precision of k-NN search are shown in Figure 5. IVF-PQ (Jegou et al., 2010) provides a fast but low-precision search for nearest neighbors. It also costs great time on training. HNSW(Malkov and Yashunin, 2018) also takes thousands of CPU seconds on training, and its efficiency drops down quickly when required increases.
|Our approach||1 CPU + 1 GPU||17|
|IVF3072FLAT||1 CPU + 1 GPU||18|
|IVF65536PQ8/20/40||1 CPU + 1 GPU||440 - 800|
|HNSW20/40/50||20 CPU||480 - 600|
When high accuracy becomes a must, our approach has the highest QPS when using only one GPU. On the whole search process, we have 6 times faster than brute force way with optimized parallel implementation on GPU, and a 1.5x - 2x speed (measuring on same precision) compared to IVF3072 Flat search implemented by Faiss on GPU. Our approach also takes the least preprocessing time, which is similar to IVF Flat approach, as shown in Table 2. The space used by our approach is similar to HNSW. The result above shows that our approach can replace other state-of-the-art methods for k nearest neighbor search asking for high precision and a quick start.
Deep100M is the first 100 million vectors of dataset Deep1B (Babenko and Lempitsky, 2016), which has 1 billion CNN representations for images with dimension 96. We design a test on only 100M records to avoid extreme long training time on those approaches for comparison. The query size is 10,000. We tested HNSW approach with 20 CPU cores. Other approaches are tested with 1 CPU and 1 or 4 GPUs in order to show the performance of our algorithm on multiple devices. In our approach we also choose and , but set to fit the data.
The result is shown in Figure 6. The scale of precision axis is different from that in Figure 5(a), as we try to show the performance of the best among all IVF-PQ methods. The improvement on search efficiency shows that our approach fits well for multiple GPUs. When more devices are provided, the QPS increases linearly on our approach. In other words, our approach can be deployed on a distributed system. Our approach can deal with 2.5 billion tokens per gigabyte video memory, so a Nvidia P40 with 24GB memory can deal with 60 billion tokens, which exceeds the size of many real time searching problems. Our approach keeps a stable performance on all requests and is the fastest when precision is over 99%, using 4 GPUs. Compared to HNSW, we also have much lower memory usage.
|Our approach||1 CPU + 1 GPU||57|
|Our approach||1 CPU + 4 GPU||20|
|FaissIVF65536PQ48||1 CPU + 1 GPU||1424|
Results on 1 GPU in Figure 6 reflects that our approach can still be dragged down by eager needs on computing power. This is the main drawback of it. As our approach only focus on how to calculate the similarity, it can be further improved by combining with other mature nearest neighbor search techniques focusing on reducing the searching area. For example, Locality Sensitive Hashing can be used in preprocessing and the search will then become non-exhaustive, taking away most of the calculation. In this way the searching will be n times faster than current speed when only one over n of the samples are chosen to be considered. Also, our techniques can be deployed on an inverted file system. Compared to IVF-PQ structure, we can provide higher precision with lower preprocessing time. Therefore, our approach has great potential on accelerating search speed.
In this work, we have presented a new approach for performing efficient k nearest-neighbor search using cosine similarity metric on GPU. We propose a new binary quantization method (XFBQ) for compressing calculation. This technique is combined with a special -selection method using the calculated distance. Overall our techniques provide significant reductions on pre-searching time cost compared to other popular approximate nearest-neighbor search algorithms and keep a state-of-the-art searching efficiency, especially when high accuracy is needed. Since most of our work is on accelerating similarity calculation, our approach can be further combined with other popular techniques focusing on reducing search space, such as locality sensitive hashing, and inverted file system, to get a even faster speed. As a single high-performance GPU can handle the calculation work for hundreds of millions of vector productions, CPU resources can be liberated for works with higher complexity.
- Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pp. 459–468. Cited by: §2.
- ANN-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. In International Conference on Similarity Search and Applications, pp. 34–49. Cited by: §1, §2.
- The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence 37 (6), pp. 1247–1260. Cited by: §2.
Efficient indexing of billion-scale datasets of deep descriptors.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2055–2063. Cited by: §5.
- LSH forest: self-tuning indexes for similarity search. In Proceedings of the 14th international conference on World Wide Web, pp. 651–660. Cited by: §2.
- Annoy (approximate nearest neighbors oh yeah). Cited by: §2.
- K-nearest neighbor search: fast gpu-based implementations and application to high-dimensional feature matching. In 2010 IEEE International Conference on Image Processing, pp. 3757–3760. Cited by: §1, §2.
- Learning binary codes for high-dimensional data using bilinear projections. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 484–491. Cited by: §2.
- Angular quantization-based binary codes for fast similarity search. In Advances in neural information processing systems, pp. 1196–1204. Cited by: §2, §3.5.1.
- Multi-scale orderless pooling of deep convolutional activation features. In European conference on computer vision, pp. 392–407. Cited by: §1.
- Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §2, §5.
- Searching in one billion vectors: re-rank with source coding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 861–864. Cited by: §2.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §1, §2, §2.
- Content-based music similarity search and emotion detection. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, pp. V–705. Cited by: §1.
New algorithms for efficient high-dimensional nonparametric classification.
Journal of Machine Learning Research7 (Jun), pp. 1135–1158. Cited by: §2.
- Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §1.
- Multi-probe lsh: efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd international conference on Very large data bases, pp. 950–961. Cited by: §2.
- Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §5.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
- Five balltree construction algorithms. International Computer Science Institute Berkeley. Cited by: §2.
- Comprehensive vertical sample-based knn/lsvm classification for gene expression analysis. Journal of Biomedical Informatics 37 (4), pp. 240–248. Cited by: §1.
- Fast gpu-based locality sensitive hashing for k-nearest neighbor computation. In Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems, pp. 211–220. Cited by: §1, §2.
GloVe: global vectors for word representation.
Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §3.5.1.
Fast pose estimation with parameter-sensitive hashing. In null, pp. 750. Cited by: §2.
- CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806–813. Cited by: §1.
- Optimised kd-trees for fast image descriptor matching. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.
- Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 175–180. Cited by: §5.
Semi-supervised hashing for scalable image retrieval. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3424–3431. Cited by: §2.
- Complementary hashing for approximate nearest neighbor search. In 2011 International Conference on Computer Vision, pp. 1631–1638. Cited by: §2.
Appendix A Proof of Proposition 3.2
First we prove . When , by definition if . Otherwise . As , in either situation . The statement holds.
Now suppose the statement holds when . We have . Then when :
If , by definition we know in , and
Thus . The statement also holds.
If , we can prove the statement holds using the same technique.
Therefore by deduction, for any .
The statement of convergence can be directly verified by the definition of uniform convergence using the result above. ∎