Fast Neighborhood Graph Search using Cartesian Concatenation

12/11/2013 ∙ by Jingdong Wang, et al. ∙ Microsoft IEEE Peking University 0

In this paper, we propose a new data structure for approximate nearest neighbor search. This structure augments the neighborhood graph with a bridge graph. We propose to exploit Cartesian concatenation to produce a large set of vectors, called bridge vectors, from several small sets of subvectors. Each bridge vector is connected with a few reference vectors near to it, forming a bridge graph. Our approach finds nearest neighbors by simultaneously traversing the neighborhood graph and the bridge graph in the best-first strategy. The success of our approach stems from two factors: the exact nearest neighbor search over a large number of bridge vectors can be done quickly, and the reference vectors connected to a bridge (reference) vector near the query are also likely to be near the query. Experimental results on searching over large scale datasets (SIFT, GIST and HOG) show that our approach outperforms state-of-the-art ANN search algorithms in terms of efficiency and accuracy. The combination of our approach with the IVFADC system also shows superior performance over the BIGANN dataset of 1 billion SIFT features compared with the best previously published result.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nearest neighbor (NN) search is a fundamental problem in machine learning, information retrieval and computational geometry. It is also a crucial step in many vision and graphics problems, such as shape matching 

FromeSSM07 , object retrieval PhilbinCISZ07 , feature matching BrownL03 ; SnavelySS06 , texture synthesis LiangLXGS01 , image completion HaysE07

and so on. Recently, the nearest neighbor search problem attracts more attentions in computer vision because of the popularity of large scale and high-dimensional multimedia data.

The simplest solution to NN search is linear scan, comparing each reference vector to the query vector. The search complexity is linear with respect to both the number of reference vectors and the data dimensionality. Apparently, it is too time-consuming and does not scale well in large scale and high-dimensional problems. Algorithms, including the KD tree AryaMNSW98 ; BeisL97 ; Bentley75 ; FriedmanBF77 , BD trees AryaMNSW98 , cover tree BeygelzimerKL06 , nonlinear embedding HwangHA12 and so on, have been proposed to improve the search efficiency. However, for high-dimensional cases it turns out that such approaches are not much more efficient than linear scan and cannot satisfy the practical requirement. Therefore, a lot of efforts have been turned to approximate nearest neighbor (ANN) search, such as KD trees with its variants, hashing algorithms, neighborhood graph search, and inverted indices.

In this paper, we propose a new data structure for approximate nearest neighbor search 111A conference version appeared in WangWZGLG13 .. This structure augments the neighborhood graph with a bridge graph that is able to boost approximate nearest neighbor search performance. Inspired by the product quantization technology BabenkoL12 ; JegouDS11 , we adopt Cartesian concatenation (or Cartesian product), to generate a large set of vectors, which we call bridge vectors, from several small sets of subvectors to approximate the reference vectors. Each bridge vector is then connected to a few reference vectors that are near enough to it, forming a bridge graph. Combining the bridge graph with the neighborhood graph built over reference data vectors yields an augmented neighborhood graph. The ANN search procedure starts by finding the nearest bridge vector to the query vector, and discovers the first set of reference vectors connected to such a bridge vector. Then the search simultaneously traverses the bridge graph and the neighborhood graph in the best-first manner using a shared priority queue.

The advantages of adopting the bridge graph lie in two-fold. First, computing the distances from bridge vectors to the query is very efficient, for instance, the computation for bridge vectors that are formed by sets of subvectors takes almost the same time as that for vectors. Second, the best bridge vector is most likely to be very close to true NNs, allowing the ANN search to quickly reach true NNs through bridge vectors.

We evaluate the proposed approach by the feature matching performance on SIFT and HOG features, and the performance of searching similar images over tiny images TorralbaFF08 with GIST features. We show that our approach achieves significant improvements compared with the state-of-the-art in terms of accuracy and search time. We also demonstrate that our approach in combination with the IVFADC system JegouDS11 outperforms the state-of-the-art over the BIGANN dataset of billion SIFT vectors JegouTDA11 .

2 Literature review

Nearest neighbor search in the -dimensional metric space is defined as follows: given a query , the goal is to find an element from the database so that . In this paper, we assume that is an Euclidean space and , which is appropriate for most problems in multimedia search and computer vision.

There are two types of ANN search problems. One is error-constrained ANN search that terminates the search when the minimum distance found up to now lies in some scope around the true minimum (or desired) distance. The other one is time-constrained ANN search that terminates the search when the search reaches some prefixed time (or equivalently examines a fixed number of data points). The latter category is shown to be more practical and give better performance. Our proposed approach belongs to the latter category.

The ANN search algorithms can be roughly divided into four categories: partition trees, neighborhood graph, compact codes (hashing and source coding), and inverted index. The following presents a short review of the four categories.

2.1 Partition trees

The partition tree based approaches recursively split the space into subspaces, and organize the subspaces via a tree structure. Most approaches select hyperplanes or hyperspheres according to the distribution of data points to divide the space, and accordingly data points are partitioned into subsets.

The KD trees Bentley75 ; FriedmanBF77 , using axis-aligned hyperplane to partition the space, have been modified to find ANNs. Other trees using different partition schemes, such as BD tress AryaMNSW98 , metric trees DasguptaF08 ; LiuMGY04 ; Moore00 ; Yianilos93 , hierarchical -means tree NisterS06 , and randomized KD trees JiaWZZH10 ; Silpa-AnanH08 ; WangWJLZZH13 , have been proposed. FLANN MujaL09

aims to find the best configuration of the hierarchical k-means trees and randomized KD trees, and has been shown to work well in practice.

In the query stage, the branch-and-bound methodology Bentley75 is usually adopted to search (approximate) nearest neighbors. This scheme needs to traverse the tree in the depth-first manner from the root to a leaf by evaluating the query at each internal node, and pruning some subtrees according to the evaluation and the currently-found nearest neighbors. The current state-of-the-art search strategy, priority search AryaMNSW98 or best-first BeisL97

, maintains a priority queue to access subtrees in order so that the data points with large probabilities being true nearest neighbors are first accessed. It has been shown that best-first search (priority search) achieves the best performance for ANN search, while the performance might be worse for Exact NN search than the algorithms without using best-first search.

2.2 Neighborhood graph search

The data structure of the neighborhood graph is a directed graph connecting each vector and its nearest neighbors. Usually a -NN graph, that connects each vector to its nearest neighbors, is used. Various algorithms based on neighborhood graph AoyamaSSU11 ; AryaM93b ; BeisL97 ; HajebiASZ11 ; SameFoun2006 ; SebastianK02 ; WangL12 are developed for ANN search has been.

The basic procedure of neighborhood graph search starts from one or several seeding vectors, and puts them into a priority queue with the distance to the query being the key. Then the process proceeds by popping the top one in the queue, i.e., the nearest one to the query, and expanding its neighborhood vectors (from neighborhood graph), among which the vectors that have not been visited are pushed into the priority queue. This process iterates till a fixed number of vectors are accessed.

Using neighborhood vectors of a vector as candidates has two advantages. One is that extracting the candidates is very cheap and only takes time. The other is that if one vector is close to the query, its neighborhood vectors are also likely to be close to the query. The main research efforts consists of two aspects. One is to build an effective neighborhood graph AoyamaSSU11 ; SameFoun2006 . The other is to design efficient and effective ways to guide the search in the neighborhood graph, including presetting the seeds created via clustering SameFoun2006 ; SebastianK02 , picking the candidates from KD tress AryaM93b , iteratively searching between KD trees and the neighborhood graph WangL12 . In this paper, we present a more effective way, combining the neighborhood graph with a bridge graph, to search for approximate nearest neighbors.

2.3 Compact codes

The compact code approaches transform each data vector into a small code, using the hashing or source coding techniques. Usually the small code takes much less storage than the original vector, and particularly the distance in the small code space, e.g., hamming distance or using lookup table can be much more efficiently evaluated than in the original space.

Locality sensitive hashing (LSH) DatarIIM04 , originally used in a manner similar to inverted index, has been shown to achieve good theory guarantee in finding near neighbors with probability, but it is reported not as good as KD trees in practice MujaL09 . Multi-probe LSH LvJWCL07 adopts the search algorithm similar to priority search, achieving a significant improvement. Nowadays, the popular usage of hashing is to use the hamming distance between hash codes to approximate the distance in the original space and then adopt linear scan to conduct the search. To make the best of the data, recently, various data-dependent hashing algorithms are proposed by learning hash functions using metric learning-like techniques, including optimized kernel hashing HeLC10 , learned metrics JainKG08 , learnt binary reconstruction KulisD09 , kernelized LSH KulisG09 , and shift kernel hashing RaginskyL09 , semi-supervised hashing WangKC10A , (multidimensional) spectral hashing WeissFT12 ; WeissTF08 , spectral hashing WeissTF08 , iterative quantization GongL11 , complementary hashing XuWLZLY11 and order preserving hashing WangWYL13 .

The source coding approach, product quantization JegouDS11 , divides the vector into several (e.g., ) bands, and quantizes reference vectors for each band separately. Then each reference vector is approximated by the nearest center in each band, and the index for the center is used to represent the reference vector. Accordingly, the distance in the original space is approximated by the distance over the assigned centers in all bands, which can be quickly computed using precomputed lookup tables storing the distances between the quantization centers of each band separately.

2.4 Inverted index

Inverted index is composed of a set of inverted lists each of which contains a subset of the reference vectors. The query stage selects a small number of inverted lists, regards the vectors contained in the selected inverted lists as the NN candidates, and rerank the candidates, using the distance computed from the original vector or using the distance computed from the small codes followed by a second-reranking step using the distance computed from the original vector, to find the best candidates.

The inverted index algorithms are widely used for very large datasets of vectors (hundreds of million to billions) due to its small memory cost. Such algorithms usually load the inverted index (and possibly extra codes) into the memory and store the raw features in the disk. A typical inverted index is built by clustering algorithms, e.g., BabenkoL12 ; JegouDS11 ; NisterS06 ; SivicZ09 ; WangWHL12 , and is composed of a set of inverted lists, each of which corresponds to a cluster of reference vectors. Other inverted indices include hash tables DatarIIM04 , tree codebooks Bentley75 and complementary tree codebooks TuPW12 .

3 Preliminaries

This section gives short introductions on several algorithms our approach depends on: neighborhood graph search, product quantization, and the multi-sequence search algorithm.

3.1 Neighborhood graph search

A neighborhood graph of a set of vectors is a directed graph that organizes data vectors by connecting each data point with its neighboring vectors. The neighborhood graph is denoted as , where corresponds to a vector and is a list of nodes that correspond to its neighbors.

The ANN search algorithm proposed in AryaM93b , we call local neighborhood graph search, is a procedure that starts from a set of seeding points as initial NN candidates and propagates the search by continuously accessing their neighbors from previously-discovered NN candidates to discover more NN candidates. The best-first strategy AryaM93b is usually adopted for local neighborhood expansion222The depth-first search strategy can also be used. Our experiments show that the performance is much worse than the best-first search.. To this end, a priority queue is used to maintain the previously-discovered NN candidates whose neighborhoods are not expanded yet, and initially contains only seeds. The best candidate in the priority queue is extracted out, and the points in its neighborhood are discovered as new NN candidates and then pushed into the priority queue. The resulting search path, discovering NN candidates, may not be monotone, but always attempts to move closer to the query point without repeating points. As a local search that finds better solutions only from the neighborhood of the current solution, the local neighborhood graph search will be stuck at a locally optimal point and has to conduct exhaustive neighborhood expansions to find better solutions. Both the proposed approach and the iterated approach WangL12 aim efficiently find solutions beyond local optima.

3.2 Product quantization

The idea of product quantization is to decomposes the space into a Cartesian product of low dimensional subspaces and to quantize each subspace separately. A vector is then decomposed into subvectors, , such that . Let the quantization dictionaries over the subspaces be with being a set of centers . A vector is represented by a short code composed of its subspace quantization indices, . Equivalently,

(1)

where is a vector in which the entry is and all others are .

Given a query , the asymmetric scheme divides into subvectors , and computes distance arrays (for computation efficiency, store the square of the Euclidean distance) with the centers of the subspaces. For a database point encoded as , the square of the Euclidean distance is approximated as , which is called asymmetric distance.

The application of product quantization in our approach is different from applications to fast distance computation JegouDS11 and code book construction BabenkoL12 , the goal of Cartesian product in this paper is to build a bridge to connect the query and the reference vectors through bridge vectors.

3.3 Multi-sequence search

Given several monotonically increasing sequences, where is a sequence, , with , the multi-sequence search algorithm BabenkoL12 is able to efficiently traverse the set of -tuples in order of increasing the sum .

The algorithm uses a min-priority queue of the tuples with the key being the sum . It starts by initializing the queue with a tuple . At step , the tuple with top priority (the minimum sum), , is popped from the queue and regarded as the th best tuple whose sum is the th smallest. At the same time, the tuple , if all its preceding tuples, have already been pushed into the queue is pushed into the queue. As a result, the multi-sequence algorithm produces a sequence of -tuples in order of increasing the sum and can stop at step if the best -tuples are required. It is shown in BabenkoL12 that the time cost of extracting the best -tuples is .

4 Approach

The database contains -dimensional reference vectors, , . Our goal is to build an index structure using the bridge graph such that, given a query vector , its nearest neighbors can be quickly discovered. In this section, we first describe the index structure and then show the search algorithm.

4.1 Data structure

Our index structure consists of two components: a bridge graph that connects bridge vectors and their nearest reference vectors, and a neighborhood graph that connects each reference vector to its nearest reference vectors.

Bridge vectors.  Cartesian concatenation is an operation that builds a new set out of a number of given sets. Given sets, , where each set, in our case, contains a set of -dimensional subvectors such that , the Cartesian concatenation of those sets is defined as follows,

Here is a -dimensional vector, and there exist vectors ( is the number of elements in ) in the Cartesian concatenation . Without loss of generality, we assume that for convenience. There is a nice property that identifying the nearest one from to a query only takes time rather than , despite that the number of elements in is . Inspired by this property, we use the Cartesian concatenation , called bridge vectors, as bridges to connect the query vector with the reference vectors.

Computing bridge vectors.  We propose to use product quantization JegouDS11 , which aims to minimize the distance of each vector to the nearest concatenated center derived from subquantizers, to compute bridge vectors. This ensures that the reference vectors discovered through one bridge vector are not far away from the query and hence the probability that those reference vectors are true NNs is high.

It is also expected that the number of reference vectors that are close enough to at least one bridge vector should be as large as possible (to make sure that enough good reference vectors can be discovered merely through bridge vectors) and that the average number of the reference vectors discovered through each bridge vector should be small (to make sure that the time cost to access them is low). To this end, we generate a large amount of bridge vectors. Such a requirement is similar to JegouDS11 for source coding and different from BabenkoL12 for inverted indices.

Augmented neighborhood graph.  The augmented neighborhood graph is a combination of the neighborhood graph over the reference database and the bridge graph between the bridge vectors and the reference vectors . The neighborhood graph is a directed graph. Each node corresponds to a point , and is also denoted as for convenience. Each node is connected with a list of nodes that correspond to its neighbors, denoted by.

The bridge graph is constructed by connecting each bridge vector in to its nearest vectors in . To avoid expensive computation cost, we build the bridge graph approximately by finding top (typically in our experiments) nearest bridge vectors for each reference vector and then keeping top nearest (typically in our experiments) reference vectors for each bridge vector, which is efficient and takes time.

The bridge graph is different from the inverted multi-index BabenkoL12 . In the inverted multi-index, each bridge vector contains a list of vectors that are closer to than all other bridge vectors, while in our approach each bridge is associated with a list of vectors that are closer to than all other reference data points.

4.2 Query the augmented neighborhood graph

To make the description clear, without loss of generality, we assume there are two sets of subvectors, and . Given a query consisting of two subvectors and , the goal is to generate a list of () candidate reference points from where the true NNs of are most likely to lie. This is achieved by traversing the augmented neighborhood graph in a best-first strategy.

(a) Iteration 1
(b) Iteration 2
(c) Iteration 3
(d) Iteration 4
Figure 1: An example illustrating the search process. : the bridge graph, and : the neighborhood graph. The white numbers are the distances to the query. Magenta denotes the vectors in the main queue, green represents the vector being popped out from the main queue, and black indicates the vectors whose neighborhoods have already been expanded

We give a brief overview of the ANN search procedure over a neighborhood graph before describing how to make use of bridge vectors. The algorithm begins with a set of (one or several) vectors that are contained in the neighborhood graph. It maintains a set of nearest neighbor candidates (whose neighborhoods have not been expanded), using a min-priority queue, which we call the main queue, with the distance to the query as the key. The main queue initially contains the vectors in . The algorithm proceeds by iteratively expanding the neighborhoods in a best-first strategy. At each step, the vector with top priority (the nearest one to ) is popped from the queue. Then each neighborhood vector in is inserted to the queue if it is not visited, and at the same time it is added to the result set (maintained by a max-priority queue with a fixed length depending on how many nearest neighbors are expected).

To exploit the bridge vectors, we present an extraction-on-demand strategy, instead of fetching all the bridge vectors to the main queue, which leads to expensive cost in sorting them and maintaining the main queue. Our strategy is to maintain the main queue such that it consists of only one bridge vector if available. To be specific, if the top vector in the main queue is a reference vector, the algorithm proceeds as usual, the same to the above procedure without using bridge vectors. If the top vector is a bridge vector, we first insert its neighbors into the main queue and the result set, and in addition we find the next nearest bridge vector (to the query ) and insert it to the main queue. The pseudo code of the search algorithm is given in Algorithm 1 and an example process is illustrated in Figure 1.

Before traversing the augmented neighborhood graph, we first process the bridge vectors, and compute the distances (the square of the Euclidean distance) from to the subvectors in and from to the subvectors in , and then sort the subvectors in the order of increasing distances, respectively. We denote the sorted subvectors as and . As the size of and is typically not large (e.g., in our case), the computation cost is very small (See details in Section 6).

The extraction-on-demand strategy needs to visit the bridge vector one by one in the order of increasing distance from . It is easily shown that , where is consists of and . Naturally, , composed of and , is the nearest one to . The multi-sequence algorithm (corresponding to ExtractNextNearestBridgeVector() in Algorithm 1) is able to fast produce a sequence of pairs so that the corresponding bridge vectors are visited in the order of increasing distances to the query . The algorithm is very efficient and producing the -th bridge vector only takes time. Slightly different from extracting a fixed number of nearest bridge vectors once BabenkoL12 , our algorithm automatically determines when to extract the next one, that is when there is no bridge vector in the main queue.

0.    /* : the query; : the reference data vectors; : the set of bridge vectors; : the augmented neighborhood graph; : the main queue; : the result set; : the maximum number of discovered vectors; */
0.    ANNSearch(, , , , , , )
1.   /* Mark each reference vector undiscovered */
2.   for each  do
3.      Color[] white;
4.   end for
5.   /* Extract the nearest bridge vector */
6.    ExtractNextNearestBridgeVector();
7.   ;
8.   
9.   /* Start the search */
10.   while ( && do
11.      /* Pop out the best candidate vector and expand its neighbors */
12.      ();
13.      for each  do
14.          if Color[] = white then
15.             ;
16.             ;
17.             Color[] black; /* Mark it discovered */
18.             , ); /* Update the result set */
19.             ;
20.          end if
21.      end for
22.      /* Extract the next nearest bridge vector if is a bridge vector */
23.      if  then
24.           ExtractNextNearestBridgeVector();
25.          ;
26.      end if
27.   end while
28.   return  ;
Algorithm 1 ANN search over the augmented neighborhood graph

5 Experiments

5.1 Setup

We perform our experiments on three large datasets: the first one with local SIFT features, the second one with global GIST features, and the third one with HOG features, and a very large dataset, the BIGANN dataset of billion SIFT features JegouTDA11 .

The SIFT features are collected from the Caltech dataset FeiFP04 . We extract maximally stable extremal regions (MSERs) for each image, and compute a -dimensional byte-valued SIFT feature for each MSER. We randomly sample SIFT features and SIFT features, respectively as the reference and query set. The GIST features are extracted on the tiny image set TorralbaFF08 . The GIST descriptor is a -dimensional byte-valued vector. We sample images as the reference set and images as the queries. The HOG descriptors are extracted from Flickr images, and each HOG descriptor is a -dimensional byte-valued vector. We sample HOG descriptors as the reference set and as the queries. The BIGANN dataset JegouTDA11 consists of -dimensional byte-valued vectors as the reference set and vectors as the queries.

We use the accuracy score to evaluate the search quality. For -ANN search, the accuracy is computed as , where is the number of retrieved vectors that are contained in the true nearest neighbors. The true nearest neighbors are computed by comparing each query with all the reference vectors in the data set. We compare different algorithms by calculating the search accuracy given the same search time, where the search time is recorded by varying the number of accessed vectors. We report the performance in terms of search time vs. search accuracy for the first three datasets. Those results are obtained with bit programs on a Hz quad core Intel PC with memory.

size #partitions #clusters #reference
  SIFT
  GIST
  HOG
Table 1: The parameters of our approach and the statistics. #reference means the number of reference vectors associated with the bridge vectors, and means the average number of unique reference vectors associated with each bridge vector

5.2 Empirical analysis

The index structure construction in our approach includes partitioning the vector into subvectors and grouping the vectors of each partition into clusters. We conduct experiments to study how they influence the search performance. The results over the SIFT and GIST datasets are shown in Figure 2. Considering two partitions, it can be observed that the performance becomes better with more clusters for each partition. This is because more clusters produce more bridge vectors and thus more reference vectors are associated with bridge vectors and their distances are much smaller. The result with partitions and clusters per partition gets the best performance as in this case the properties desired for bridge vectors described in Section 4.1 are more likely to be satisfied.

5.3 Comparisons

We compare our approach with state-of-the-art algorithms, including iterative neighborhood graph search WangL12 , original neighborhood graph search (AryaM93) AryaM93b , trinary projection (TP) trees JiaWZZH10 , vantage point (VP) tree Yianilos93 , Spill trees LiuMGY04 , FLANN MujaL09 , and inverted multi-index BabenkoL12 . The results of all other methods are obtained by well tuning parameters. We do not report the results from hashing algorithms as they are much worse than tree-based approach, which is also reported in MujaL09 ; WangWJLZZH13 . The neighborhood graphs of different algorithms are the same, and each vector is connected with nearest vectors. We construct approximate neighborhood graphs using the algorithm WangWZTGL12 . Table 1 shows the parameters for our approach, together with some statistics.

The experimental comparisons are shown in Figure 3. The horizontal axis corresponds to search time (milliseconds), and the vertical axis corresponds to search accuracy. From the results over the SIFT dataset shown in the first row of Figure 3, our approach performs the best. We can see that, given the target accuracy -NN and -NN, our approach takes about time of the second best algorithm, iterative neighborhood graph search.

Figure 2: Search performances with different number of partitions and clusters over (a) SIFT and (b) GIST. : means #partitions and is #clusters
Figure 3: Performance comparison on (a) -dimensional SIFT features, (b) -dimensional GIST features, and (c) -dimensional HOG features. is the number of target nearest neighbors

The second row of Figure 3 shows the results over the GIST dataset. Compared with the SIFT feature (a -dimensional vector), the dimension of the GIST feature () is larger and the search is hence more challenging. It can be observed that our approach is still consistently better than other approaches. In particular, the improvement is more significant, and for the target precision our approach takes only half time of the second best approach, from to NNs. The third row of Figure 3 shows the results over the HOG dataset. This data set is the most difficult because it contains more () descriptors and its dimension is the largest (). Again, our approach achieves the best results. For the target accuracy , the search time in the case of NN is about of the time of the second best algorithm.

All the neighborhood graph search algorithms outperform the other algorithms, which shows that the neighborhood graph structure is good to index vectors. The superiority of our approach to previous neighborhood graph algorithms stems from that our approach exploits the bridge graph to help the search. Inverted multi-index does not produce competitive results because its advantage is small index structure size but its search performance is limited by an unfavorable trade-off between the search accuracy and the time overhead in quantization. It is shown in BabenkoL12 that inverted multi-index works the best when using a second-order multi-index and a large codebook, but this results in high quantization cost. In contrast, our approach benefits from the neighborhood graph structure so that we can use a high-order product quantizer to save the quantization cost.

In addition, we also conduct experiments to compare the source coding based ANN search algorithm JegouDS11 . This algorithm compresses each data vector into a short code using product quantization, resulting in the fast approximate distance computation between vectors. We report the results from the IVFADC system that performs the best as pointed in JegouDS11 over the M SIFT and GIST features. To compare IVFADC with our approach, we follow the scheme in JegouDS11 to add a verification stage to the IVFADC system. We cluster the data points into inverted lists and use a -bits code to represent each vector as done in JegouDS11 . Given a query, we first find its nearest inverted lists, then compute the approximate distance from the query to each of the candidates in the retrieved inverted lists. Finally we re-rank the top candidates using Euclidean distance and compute the -recall JegouDS11 of the nearest neighbor (the same to the definition of the search accuracy for -NN). Experimental results show that gets superior performance. Figure 4 shows the results with respect to the parameters and . One can see that our approach gets superior performance.

Figure 4: Search performances comparison with IVFADC JegouDS11 over (a) SIFT and (b) GIST. The parameters (the number of inverted lists visited), (the number of candidates for re-ranking) are given beside each marker of IVFADC

5.4 Experiments over the BIGANN dataset

We evaluate the performance of our approach when combining it with the IVFADC system JegouDS11 for searching very large scale datasets. The IVFADC system organizes the data using inverted indices built via a coarse quantizer and represents each vector by a short code produced by product quantization. During the search stage, the system visits the inverted lists in ascending order of the distances to the query and re-ranks the candidates according to the short codes. The original implementation only uses a small number of inverted lists to avoid the expensive time cost in finding the exact nearest inverted indices. The inverted multi-index BabenkoL12 is used to replace the inverted indices in the IVFADC system, which is shown better than the original IVFADC implementation JegouDS11 .

We propose to replace the nearest inverted list identification using our approach. The good search quality of our approach in terms of both accuracy and efficiency makes it feasible to handle a large number of inverted lists. We quantize the features into millions ( in our implementation) of groups using a fast approximate k-means clustering algorithm WangWKZL12 , and compute the centers of all the groups forming the vocabulary. Then we use our approach to assign each vector to the inverted list corresponding to the nearest center, producing the inverted indices. The residual displacement between each vector and its center is quantized using product quantization to obtain extra bytes for re-ranking. During the search stage, we find the nearest inverted lists to the query using our approach and then do the same reranking procedure as in BabenkoL12 ; JegouDS11

Following BabenkoL12 ; JegouDS11 we calculate the recall@ scores of the nearest neighbor with respect to different length of the visited candidate list and different numbers of extra bytes, . The recall@ score is equivalent to the accuracy for the nearest neighbor if a short list of vectors is verified using exact Euclidean distances JegouTDA11 . The performance is summarized in Table 2. It can be seen that our approach consistently outperforms Multi-D-ADC BabenkoL12 and IVFADC JegouDS11 in terms of both recall and time cost when retrieving the same number of visited candidates. The superiority over IVFADC stems from that our approach significantly increases the number of inverted indices and produces space partitions with smaller (coarse) quantization errors and that our system accesses a few coarse centers while guarantees relatively accurate inverted lists. For inverted multi-index approach, although the total number of centers is quite large the data vectors are not evenly divided into inverted lists. As reported in the supplementary material of BabenkoL12 , of the inverted lists are empty. Thus the quantization quality is not as good as ours. Consequently, it performs worse than our approach.

t] System List len. R@ R@ R@ Time BIGANN, billion SIFTs, bytes per vector IVFADC million Multi-D-ADC Multi-D-ADC Multi-D-ADC Graph-D-ADC Graph-D-ADC Graph-D-ADC BIGANN, billion SIFTs, bytes per vector IVFADC million Multi-D-ADC Multi-D-ADC Multi-D-ADC Graph-D-ADC Graph-D-ADC Graph-D-ADC

Table 2: The performance (recall for the top-, top-, and top- candidates after reranking and average search time in milliseconds) comparison between IVFADC JegouTDA11 , Multi-D-ADC BabenkoL12 and Our approach (Graph-D-ADC). IVFADC uses inverted lists with , Multi-D-ADC uses the second-order multi-index with and our approach use inverted lists with

6 Analysis and discussion

Index structure size.  In addition to the neighborhood graph and the reference vectors, the index structure of our approach includes a bridge graph and the bridge vectors. The number of bridge vectors in our implementation is , with being the number of the reference vectors. The storage cost of the bridge vectors are then , and the cost of the bridge graph is also . In the case of -dimensional GIST byte-valued features, without optimization, the storage complexity ( bytes) of the bridge graph is smaller than the reference vectors ( bytes) and the neighborhood graph ( bytes). The cost of KD trees, VP trees, and TP trees are , , and bytes. In summary, the storage cost of our index structure is comparable with those neighborhood graph and tree-based structures.

In comparison to source coding JegouDS11 ; JegouTDA11 and hashing without using the original features, and inverted indices (e.g. BabenkoL12 ), our approach takes more storage cost. However, the search quality of our approach in terms of accuracy and time is much better, which leaves users for algorithm selection according to their preferences to less memory or less time. Moreover the storage costs for GIST and SIFT features ( bytes) and even HOG features ( bytes) are acceptable in most today’s machines. When applying our approach to the BIGANN dataset of SIFT features, the index structure size for our approach is about for and for , which is similar with Multi-D-ADC BabenkoL12 ( for and for ) and IVFADC JegouDS11 ( for and for ).

Construction complexity.  The most time-consuming process in constructing the index structure in our approach is the construction of the neighborhood graph. Recent research WangWZTGL12 shows that an approximate neighborhood graph can be built in time, which is comparable to the cost of constructing the bridge graph. In our experiments, using a Hz quad core Intel PC, the index structures of the SIFT data, the GIST data, and the HOG data can be built within half an hour, an hour, and hours, respectively. These time costs are relatively large but acceptable as they are offline processes.

The algorithm of combining our approach with the IVFADC system JegouDS11 over the BIGANN dataset of size billion requires the similar construction cost with the state-of-the-art algorithm BabenkoL12 . Because the number of data vectors is very large (), the most time-consuming stage is to assign each vector to the inverted lists and both take about days. The structure of our approach organizing the centers takes only a few hours, which is relatively small. These construction stages are all run with threads on a server with AMD Opteron Hz quad core processors.

Search complexity.  The search procedure of our approach consists of the distance computation over the subvectors, the traversal over the bridge graph and the neighborhood graph. The distance computation over the subvectors is very cheap and takes small constant time (about the distance computation cost with vectors in our experiments). Compared with the number of reference vectors that are required to reach an acceptable accuracy (e.g., the number is about for accuracy in the -dimensional GIST feature data set), such time cost is negligible.

Besides the computation of the distances between the query vector and the visited reference vectors, the additional time overhead comes from maintaining the priority queue and querying the bridge vectors using the multi-sequence algorithm. Given there are reference vectors that have been discovered, it can be easily shown that the main queue is no longer than . Consider the worst case that all the reference vectors come from the bridge graph, where each bridge vector is associated with unique reference vectors on average (the statistics for in our experiments is presented in Table 1), we have that bridge vectors are visited. Thus, the maintenance of the main queue takes time. Extracting bridge vectors using the multi-sequence algorithm BabenkoL12 takes . Consequently the time overhead on average is .

Figure 5 shows the time cost of visiting reference vectors in different algorithms on two datasets. Linear scan represents the time cost of computing the distances between a query and all reference vectors. The overhead of a method is the difference between the time cost of this method and that of linear scan. We can see that the inverted multi-index takes the minimum overhead and our approach is the second minimum. This is because our approach includes extra operations over the main queue.

Relations to source coding JegouDS11 and inverted multi-index BabenkoL12 .  Product quantization (or generally Cartesian concatenation) has two attractive properties. One property is that it is able to produce a large set of concatenated vectors from several small sets of subvectors. The other property is that the exact nearest vectors to a query vector from such a large set of concatenated vectors can be quickly found using the multi-sequence algorithm. The application to source coding JegouDS11 exploits the first property, thus results in fast distance approximation. The application to inverted multi-index BabenkoL12 makes use of the second property to fast retrieve concatenated quantizers. In contrast, our approach exploits both the two properties: the first property guarantees that the approximation error of the concatenated vectors to the reference vectors is small with small sets of subvectors, and the second property guarantees that the retrieval from the concatenated vectors is very efficient and hence the time overhead is small.

Figure 5: Average time cost of visiting reference vectors. The time overhead (difference between the average time cost and the cost of liner scan) of our approach is comparably small

7 Conclusions

The key factors contribute to the superior performance of our proposed approach include: (1) Discovering NN candidates from the neighborhood of both bridge vectors and reference vectors is very cheap; (2) The NN candidates from the neighborhood of the bridge vector have high probability to be true NNs because there are a large number of effective bridge vectors generated by Cartesian concatenation; (3) Retrieving nearest bridge vectors is very efficient. The algorithm is very simple and is easily implemented. The power of our algorithm is demonstrated by the superior ANN search performance over large scale SIFT, HOG, and GIST datasets, as well as over a very large scale dataset, the BIGANN dataset of billion SIFT features through the combination of our approach with the IVFADC system.

Bibliography

  • (1) Aoyama, K., Saito, K., Sawada, H., Ueda, N.: Fast approximate similarity search based on degree-reduced neighborhood graphs. In: KDD, pp. 1055–1063 (2011)
  • (2) Arya, S., Mount, D.M.: Approximate nearest neighbor queries in fixed dimensions. In: SODA, pp. 271–280 (1993)
  • (3) Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J. ACM 45(6), 891–923 (1998)
  • (4) Babenko, A., Lempitsky, V.S.: The inverted multi-index. In: CVPR, pp. 3069–3076 (2012)
  • (5) Beis, J.S., Lowe, D.G.: Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In: CVPR, pp. 1000–1006 (1997)
  • (6) Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
  • (7) Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML, pp. 97–104 (2006)
  • (8) Brown, M., Lowe, D.G.: Recognising panoramas. In: ICCV, pp. 1218–1227 (2003)
  • (9) Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: STOC, pp. 537–546 (2008)
  • (10) Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on Computational Geometry, pp. 253–262 (2004)
  • (11) Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: CVPR 2004 Workshop on Generative-Model Based Vision (2004)
  • (12) Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3(3), 209–226 (1977)
  • (13)

    Frome, A., Singer, Y., Sha, F., Malik, J.: Learning globally-consistent local distance functions for shape-based image retrieval and classification.

    In: ICCV, pp. 1–8 (2007)
  • (14) Gong, Y., Lazebnik, S.: Iterative quantization: A procrustean approach to learning binary codes. In: CVPR, pp. 817–824 (2011)
  • (15) Hajebi, K., Abbasi-Yadkori, Y., Shahbazi, H., Zhang, H.: Fast approximate nearest-neighbor search with k-nearest neighbor graph. In: IJCAI, pp. 1312–1317 (2011)
  • (16) Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Trans. Graph. 26(3), 4 (2007)
  • (17) He, J., Liu, W., Chang, S.F.: Scalable similarity search with optimized kernel hashing. In: KDD, pp. 1129–1138 (2010)
  • (18) Hwang, Y., Han, B., Ahn, H.K.: A fast nearest neighbor search algorithm by nonlinear embedding. In: CVPR, pp. 3053–3060 (2012)
  • (19) Jain, P., Kulis, B., Grauman, K.: Fast image search for learned metrics. In: CVPR (2008)
  • (20) Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)
  • (21) Jégou, H., Tavenard, R., Douze, M., Amsaleg, L.: Searching in one billion vectors: Re-rank with source coding. In: ICASSP, pp. 861–864 (2011)
  • (22) Jia, Y., Wang, J., Zeng, G., Zha, H., Hua, X.S.: Optimizing kd-trees for scalable visual descriptor indexing. In: CVPR, pp. 3392–3399 (2010)
  • (23) Kulis, B., Darrells, T.: Learning to hash with binary reconstructive embeddings. In: NIPS, pp. 577–584 (2009)
  • (24) Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: ICCV (2009)
  • (25) Liang, L., Liu, C., Xu, Y.Q., Guo, B., Shum, H.Y.: Real-time texture synthesis by patch-based sampling. ACM Trans. Graph. 20(3), 127–150 (2001)
  • (26) Liu, T., Moore, A.W., Gray, A.G., Yang, K.: An investigation of practical approximate nearest neighbor algorithms. In: NIPS (2004)
  • (27) Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950–961 (2007)
  • (28)

    Moore, A.W.: The anchors hierarchy: Using the triangle inequality to survive high dimensional data.

    In: UAI, pp. 397–405 (2000)
  • (29) Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISSAPP (1), pp. 331–340 (2009)
  • (30) Nistér, D., Stewénius, H.: Scalable recognition with a vocabulary tree. In: CVPR (2), pp. 2161–2168 (2006)
  • (31) Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007)
  • (32) Raginsky, M., Lazebnik, S.: Locality sensitive binary codes from shift-invariant kernels. In: NIPS (2009)
  • (33) Samet, H.: Foundations of multidimensional and metric data structures. Elsevier, Amsterdam (2006)
  • (34) Sebastian, T.B., Kimia, B.B.: Metric-based shape retrieval in large databases. In: ICPR (3), pp. 291–296 (2002)
  • (35) Silpa-Anan, C., Hartley, R.: Optimised kd-trees for fast image descriptor matching. In: CVPR (2008)
  • (36) Sivic, J., Zisserman, A.: Efficient visual search of videos cast as text retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 591–606 (2009)
  • (37) Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM Trans. Graph. 25(3), 835–846 (2006)
  • (38)

    Torralba, A.B., Fergus, R., Freeman, W.T.: 80 million tiny images: A large data set for nonparametric object and scene recognition.

    IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)
  • (39) Tu, W., Pan, R., Wang, J.: Similar image search with a tiny bag-of-delegates representation. In: ACM Multimedia, pp. 885–888 (2012)
  • (40) Wang, J., Kumar, S., Chang, S.F.: Semi-supervised hashing for scalable image retrieval. In: CVPR
  • (41) Wang, J., Li, S.: Query-driven iterated neighborhood graph search for large scale indexing. In: ACM Multimedia, pp. 179–188 (2012)
  • (42) Wang, J., Wang, J., Hua, X.S., Li, S.: Scalable similar image search by joint indices. In: ACM Multimedia, pp. 1325–1326 (2012)
  • (43) Wang, J., Wang, J., Ke, Q., Zeng, G., Li, S.: Fast approximate k-means via cluster closures. In: CVPR, pp. 3037–3044 (2012)
  • (44) Wang, J., Wang, J., Yu, N., Li, S.: Order preserving hashing for approximate nearest neighbor search. In: ACM Multimedia (2013)
  • (45) Wang, J., Wang, J., Zeng, G., Gan, R., Li, S., Guo, B.: Fast neighborhood graph search using cartesian concatenation. In: ICCV, pp. 2128–2135 (2013)
  • (46) Wang, J., Wang, J., Zeng, G., Tu, Z., Gan, R., Li, S.: Scalable k-nn graph construction for visual descriptors. In: CVPR, pp. 1106–1113 (2012)
  • (47) Wang, J., Wang, N., Jia, Y., Li, J., Zeng, G., Zha, H., Hua., X.S.: Trinary-projection trees for approximate nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. (2013)
  • (48) Weiss, Y., Fergus, R., Torralba, A.: Multidimensional spectral hashing. In: ECCV (5), pp. 340–353 (2012)
  • (49) Weiss, Y., Torralba, A.B., Fergus, R.: Spectral hashing. In: NIPS, pp. 1753–1760 (2008)
  • (50) Xu, H., Wang, J., Li, Z., Zeng, G., Li, S., Yu, N.: Complementary hashing for approximate nearest neighbor search. In: ICCV, pp. 1631–1638 (2011)
  • (51) Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp. 311–321 (1993)