A Java library implementing practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It implements Locality-sensitive Hashing (LSH) and multi index hashing for hamming space.
There is growing interest in representing image data and feature descriptors using compact binary codes for fast near neighbor search. Although binary codes are motivated by their use as direct indices (addresses) into a hash table, codes longer than 32 bits are not being used as such, as it was thought to be ineffective. We introduce a rigorous way to build multiple hash tables on binary code substrings that enables exact k-nearest neighbor search in Hamming space. The approach is storage efficient and straightforward to implement. Theoretical analysis shows that the algorithm exhibits sub-linear run-time behavior for uniformly distributed codes. Empirical results show dramatic speedups over a linear scan baseline for datasets of up to one billion codes of 64, 128, or 256 bits.READ FULL TEXT VIEW PDF
A growing interest has been witnessed recently in building nearest neigh...
Binary codes are widely used to represent the data due to their small st...
This paper introduces a novel real-time Fuzzy Supervised Learning with B...
We propose an iteration-free source separation algorithm based on
We present optimal binary pebbling algorithms for in-place reversal (bac...
This paper considers the problem of approximate nearest neighbor search ...
The method of random projections has become a standard tool for machine
A Java library implementing practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It implements Locality-sensitive Hashing (LSH) and multi index hashing for hamming space.
There has been growing interest in representing image data and feature descriptors in terms of compact binary codes, often to facilitate fast near neighbor search and feature matching in vision applications (e.g., [2, 7, 32, 33, 35, 19]). Binary codes are storage efficient and comparisons require just a small number of machine instructions. Millions of binary codes can be compared to a query in less than a second. But the most compelling reason for binary codes, and discrete codes in general, is their use as direct indices (addresses) into a hash table, yielding a dramatic increase in search speed compared to an exhaustive linear scan (e.g., [38, 31, 25]).
Nevertheless, using binary codes as direct hash indices is not necessarily efficient. To find near neighbors one needs to examine all hash table entries (or buckets) within some Hamming ball around the query. The problem is that the number of such buckets grows near-exponentially with the search radius. Even with a small search radius, the number of buckets to examine is often larger than the number of items in the database, and hence slower than linear scan. Recent papers on binary codes mention the use of hash tables, but resort to linear scan when codes are longer than bits (e.g., [35, 31, 20, 25]). Not surprisingly, code lengths are often significantly longer than 32 bits in order to achieve satisfactory retrieval performance (e.g., see Fig. 5).
This paper presents a new algorithm for exact -nearest neighbor (NN) search on binary codes that is dramatically faster than exhaustive linear scan. This has been an open problem since the introduction of hashing techniques with binary codes. Our new multi-index hashing algorithm exhibits sub-linear search times, while it is storage efficient, and straightforward to implement. Empirically, on databases of up to 1B codes we find that multi-index hashing is hundreds of times faster than linear scan. Extrapolation suggests that the speedup gain grows quickly with database size beyond 1B codes.
, and parameter estimation. Sometimes the binary codes are generated directly as feature descriptors for images or image patches, such as BRIEF , FREAK , , or 
, and sometimes binary corpora are generated by discrete similarity-preserving mappings from high-dimensional data. Most such mappings are designed to preserve Euclidean distance (e.g., [11, 20, 29, 33, 38]). Others focus on semantic similarity (e.g., [25, 32, 31, 35, 26, 30, 21]). Our concern in this paper is not the algorithm used to generate the codes, but rather with fast search in Hamming space.111 There do exist several other promising approaches to fast approximate NN search on large real-valued image features (e.g., [3, 17, 27, 24, 5]). Nevertheless, we restrict our attention in this paper to compact binary codes and exact search.
We address two related search problems in Hamming space. Given a dataset of binary codes, , the first problem is to find the codes in that are closest in Hamming distance to a given query, i.e., NN search in Hamming distance. The -NN problem in Hamming space was called the Best Match problem by Minsky and Papert . They observed that there are no obvious approaches significantly better than exhaustive search, and asked whether such approaches might exist.
The second problem is to find all codes in a dataset that are within a fixed Hamming distance of a query, sometimes called the Approximate Query problem , or Point Location in Equal Balls (PLEB) . A binary code is an -neighbor of a query code, denoted , if it differs from in bits or less. We define the -neighbor search problem as: find all -neighbors of a query from .
One way to tackle -neighbor search is to use a hash table populated with the binary codes , and examine all hash buckets whose indices are within bits of a query (e.g., ). For binary codes of bits, the number of distinct hash buckets to examine is
As shown in Fig. 1 (top), grows very rapidly with . Thus, this approach is only practical for small radii or short code lengths. Some vision applications restrict search to exact matches (i.e., ) or a small search radius (e.g., [14, 37] ), but in most cases of interest the desired search radius is larger than is currently feasible (e.g., see Fig. 1 (bottom)).
Our work is inspired in part by the multi-index hashing results of Greene, Parnas, and Yao . Building on the classical Turan problem for hypergraphs, they construct a set of over-lapping binary substrings such that any two codes that differ by at most bits are guaranteed to be identical in at least one of the constructed substrings. Accordingly, they propose an exact method for finding all -neighbors of a query using multiple hash tables, one for each substring. At query time, candidate -neighbors are found by using query substrings as indices into their corresponding hash tables. As explained below, while run-time efficient, the main drawback of their approach is the prohibitive storage required for the requisite number of hash tables. By comparison, the method we propose requires much less storage, and is only marginally slower in search performance.
While we focus on exact search, there also exist algorithms for finding approximate -neighbors (-PLEB), or approximate nearest neighbors (-NN) in Hamming distance. One example is Hamming Locality Sensitive Hashing [15, 10], which aims to solve the -neighbors decision problem: determine whether there exists a binary code such that , or whether all codes in differ from in bits or more. Approximate methods are interesting, and the approach below could be made faster by allowing misses. Nonetheless, this paper will focus on the exact search problem.
This paper proposes a data-structure that applies to both NN and -neighbor search in Hamming space. We prove that for uniformly distributed binary codes of bits, and a search radius of bits when is small, our query time is sub-linear in the size of dataset. We also demonstrate impressive performance on real-world datasets. To our knowledge this is the first practical data-structure solving exact NN in Hamming distance.
Our approach is called multi-index hashing, as binary codes from the database are indexed times into different hash tables, based on disjoint substrings. Given a query code, entries that fall close to the query in at least one such substring are considered neighbor candidates. Candidates are then checked for validity using the entire binary code, to remove any non--neighbors. To be practical for large-scale datasets, the substrings must be chosen so that the set of candidates is small, and storage requirements are reasonable. We also require that all true neighbors will be found.
The key idea here stems from the fact that, with binary codes of bits, the vast majority of the possible buckets in a full hash table will be empty, since . It seems expensive to examine all buckets within bits of a query, since most of them contain no items. Instead, we merge many buckets together (most of which are empty) by marginalizing over different dimensions of the Hamming space. We do this by creating hash tables on substrings of the binary codes. The distribution of the code substring comprising the first bits is the outcome of marginalizing the distribution of binary codes over the last bits. As such, a given bucket of the substring hash table includes all codes with the same first bits, but having any of the values for the remaining bits. Unfortunately these larger buckets are not restricted to the Hamming volume of interest around the query. Hence not all items in the merged buckets are -neighbors of the query, so we then need to cull any candidate that is not a true -neighbor.
In more detail, each binary code , comprising bits, is partitioned into disjoint substrings, , each of length bits. For convenience in what follows, we assume that is divisible222When is not divisible by , we use substrings of different lengths with either or bits, i.e., differing by at most 1 bit. by , and that the substrings comprise contiguous bits. The key idea rests on the following statement: When two binary codes and differ by bits or less, then, in at least one of their substrings they must differ by at most bits. This leads to the first proposition:
Proposition 1: If , where denotes the Hamming distance between and , then
Proof of Proposition 1 follows straightforwardly from the Pigeonhole Principle. That is, suppose that the Hamming distance between each of the substrings is strictly greater than . Then, . Clearly, , since for some where , which contradicts the premise.
The significance of Proposition 1 derives from the fact that the substrings have only bits, and that the required search radius in each substring is just . For example, if and differ by bits or less, and , at least one of the 4 substrings must be identical. If they differ by at most bits, then in at least one substring they differ by no more than bit; i.e., we can search a Hamming radius of bits by searching a radius of bit on each of 4 substrings. More generally, instead of examining hash buckets, it suffices to examine buckets in each of substring hash tables, i.e., a total of buckets.
While it suffices to examine all buckets within a radius of in all hash tables, we next show that it is not always necessary. Rather, it is often possible to use a radius of just in some of the substring hash tables while still guaranteeing that all -neighbors of will be found. In particular, with , where , to find any item within a radius of on -bit codes, it suffices to search substring hash tables to a radius of , and the remaining substring hash tables up to a radius of . Without loss of generality, since there is no order to the substring hash tables, we search the first hash tables with radius , and all remaining hash tables with radius .
Proposition 2: If , then
To prove Proposition 2, we show that when (3a) is false, (3b) must be true. If (3a) is false, then it must be that , since otherwise , in which case (3a) and Proposition 1 are equivalent. If (3a) is false, it also follows that and differ in each of their first substrings by or more bits. Thus, the total number of bits that differ in the first substrings is at least . Because , it also follows that the total number of bits that differ in the remaining substrings is at most . Then, using Proposition 1, the maximum search radius required in each of the remaining substring hash tables is
and hence Proposition 2 is true. Because of the near exponential growth in the number of buckets for large search radii, the smaller substring search radius required by Proposition 2 is significant.
A special case of Proposition 2 is when , hence and . In this case, it suffices to search substring hash tables for a radius of (i.e., exact matches), and the remaining substring hash tables can be ignored. Clearly, if a code does not match exactly with a query in any of the selected substrings, then the code must differ from the query in at least bits.
In a pre-processing step, given a dataset of binary codes, one hash table is built for each of the substrings, as outlined in Algorithm 1. At query time, given a query with substrings , we search the substring hash table for entries that are within a Hamming distance of or of , as prescribed by (3). By doing so we obtain a set of candidates from the substring hash table, denoted . According to the propositions above, the union of the sets, , is necessarily a superset of the -neighbors of . The last step of the algorithm computes the full Hamming distance between and each candidate in , retaining only those codes that are true -neighbors of . Algorithm 2 outlines the -neighbor retrieval procedure for a query .
The search cost depends on the number of lookups (i.e., the number of buckets examined), and the number of candidates tested. Not surprisingly there is a natural trade-off between them. With a large number of lookups one can minimize the number of extraneous candidates. By merging many buckets to reduce the number of lookups, one obtains a large number of candidates to test. In the extreme case with , substrings are 1 bit long, so we can expect the candidate set to include almost the entire database.
We next develop an analytical model of search performance to help address two key questions: (1) How does search cost depend on substring length, and hence the number of substrings? (2) How do run-time and storage complexity depend on database size, code length, and search radius?
To help answer these questions we exploit a well-known bound on the sum of binomial coefficients ; i.e., for any and .
In what follows, continues to denote the number of -bit database codes, and is the Hamming search radius. Let denote the number of hash tables, and let denote the substring length . Hence, the maximum substring search radius becomes . As above, for the sake of model simplicity, we assume is divisible by .
We begin by formulating an upper bound on the number of lookups. First, the number of lookups in Algorithm 2 is bounded above by the product of , the number of substring hash tables, and the number of hash buckets within a radius of on substrings of length bits. Accordingly, using (5), if the search radius is less than half the code length, , then the total number of lookups is given by
Clearly, as we decrease the substring length , thereby increasing the number of substrings , exponentially fewer lookups are needed.
To analyze the expected number of candidates per bucket, we consider the case in which the binary codes are uniformly distributed over the Hamming space. In this case, for a substring of bits, for which a substring hash table has buckets, the expected number of items per bucket is . The expected size of the candidate set therefore equals the number of lookups times .
The total search cost per query is the cost for lookups plus the cost for candidate tests. While these costs will vary with the code length and the way the hash tables are implemented, empirically we find that, to a reasonable approximation, the costs of a lookup and a candidate test are similar (when ). Accordingly, we model the total search cost per query, for retrieving all -neighbors, in units of the time required for a single lookup, as
In practice, database codes will not be uniformly distributed, nor are uniformly distributed codes ideal for multi-index hashing. Indeed, the cost of search with uniformly distributed codes is relatively high since the search radius increases as the density of codes decreases. Rather, the uniform distribution is primarily a mathematical convenience that facilitates the analysis of run-time, thereby providing some insight into the effectiveness of the approach and how one might choose an effective substring length.
As noted above in Sec. 2.2, finding a good substring length is central to the efficiency of multi-index hashing. When the substring length is too large or too small the approach will not be effective. In practice, an effective substring length for a given dataset can be determined by cross-validation. Nevertheless this can be expensive.
In the case of uniformly distributed codes, one can instead use the analytic cost model in (7) to find a near optimal substring length. As discussed below, we find that a substring length of yields a near-optimal search cost. Further, with non-uniformly distributed codes in benchmark datasets, we find empirically that
is also a reasonable heuristic for choosing the substring length (e.g., see Table IV below).
In more detail, to find a good substring length using the cost model above, assuming uniformly distributed binary codes, we first note that, dividing in (7) by has no effect on the optimal . Accordingly, one can view the optimal as a function of two quantities, namely the number of items, , and the search ratio .
Figure 2 plots cost as a function of substring length , for -bit codes, different database sizes , and different search radii (expressed as a fraction of the code length ). Dashed curves depict in (7) while solid curves of the same color depict the upper bound in (8). The tightness of the bound is evident in the plots, as are the quantization effects of the upper range of the sum in (7). The small circles in Fig. 2 (top) depict cost when all quantization effects are included, and hence it is only shown at substring lengths that are integer divisors of the code length.
Fig. 2 (top) shows cost for search radii equal to , and of the code length, with in all cases. One striking property of these curves is that the cost is persistently minimal in the vicinity of , indicated by the vertical line close to 30 bits. This behavior is consistent over a wide range of database sizes.
Fig. 2 (bottom) shows the dependence of cost on for databases with , , and , all with and bits. In this case we have laterally displaced each curve by ; notice how this aligns the minima close to . These curves suggest that, over a wide range of conditions, cost is minimal for . For this choice of the substring length, the expected number of items per substring bucket, i.e., , reduces to 1. As a consequence, the number of lookups is equal to the expected number of candidates. Interestingly, this choice of substring length is similar to that of Greene et al. . A somewhat involved theoretical analysis based on Stirling’s approximation, omitted here, also suggests that as goes to infinity, the optimal substring length converges asymptotically to .
Choosing in the vicinity of also permits a simple characterization of retrieval run-time complexity, for uniformly distributed binary codes. When , the upper bound on the number of lookups (6) also becomes a bound on the number candidates. In particular, if we substitute for in (8), then we find the following upper bound on the cost, now as a function of database size, code length, and the search radius:
Thus, for a uniform distribution over binary codes, if we choose such that , the expected query time complexity is . For a small ratio of this is sub-linear in . For example, if , then , and the run-time complexity becomes . That is, the search time increases with the square root of the database size when the search radius is approximately 10% of the code length. For , this becomes . The time complexity with respect to is not as important as that with respect to since is not expected to vary significantly in most applications.
The storage complexity of our multi-index hashing algorithm is asymptotically optimal when , as is suggested above. To store the full database of binary codes requires bits. For each of hash tables, we also need to store unique identifiers to the database items. This allows one to identify the retrieved items and fetch their full codes; this requires an additional bits. In sum, the storage required is . When , this storage cost reduces to . Here, the term does not cancel as , but in most interesting cases , so the term does not matter.
While the storage cost for our multi-index hashing algorithm is linear in , the related multi-index hashing algorithm of Greene et al.  entails storage complexity that is super-linear in . To find all -neighbors, for a given search radius , they construct substrings of length bits per binary code. Their suggested substring length is also , so the number of substring hash tables becomes , each of which requires in storage. As a consequence for large values of , even with small , this technique requires a prohibitive amount of memory to store the hash tables.
Our approach is more memory-efficient than that of  because we do not enforce exact equality in substring matching. In essence, instead of creating all of the hash tables off-line, and then having to store them, we flip bits of each substring at run-time and implicitly create some of the substring hash tables on-line. This increases run-time slightly, but greatly reduces storage costs.
To use the above multi-index hashing in practice, one must specify a Hamming search radius . For many tasks, the value of is chosen such that queries will, on average, retrieve near neighbors. Nevertheless, as expected, we find that for many hashing techniques and different sources of visual data, the distribution of binary codes is such that a single search radius for all queries will not produce similar numbers of neighbors.
Figure 3 depicts empirical distributions of search radii needed for -NN and -NN on three sets of binary codes obtained from 1B SIFT descriptors [18, 22]. In all cases, for - and -bit codes, and for hash functions based on LSH  and MLH 
, there is a substantial variance in the search radius. This suggests that binary codes are not uniformly distributed over the Hamming space. As an example, for-NN in -bit LSH codes, about of the queries require a search radius of bits or larger, while about of the queries need a search radius of bits or smaller. Also evident from Fig. 3 is the growth in the required search radius as one moves from -bit codes to bits, and from -NN to -NN.
A fixed radius for all queries would produce too many neighbors for some queries, and too few for others. It is therefore more natural for many tasks to fix the number of required neighbors, i.e., , and let the search radius depend on the query. Fortunately, our multi-index hashing algorithm is easily adapted to accommodate query-dependent search radii.
Given a query, one can progressively increase the Hamming search radius per substring, until a specified number of neighbors is found. For example, if one examines all -neighbors of a query’s substrings, from which more than candidates are found to be within a Hamming distance of bits (using the full codes for validation), then it is guaranteed that -nearest neighbors have been found. Indeed, if all NNs of a query differ from in bits or less, then Propositions 1 and 2 above provide guanantees all such neighbors will be found if one searches the substring hash tables with the prescribed radii.
In our experiments, we follow this progressive increment of the search radius until we can find NN in the guaranteed neighborhood of a query. This approach, outlined in Algorithm 3, is helpful because it uses a query-specific search radius depending on the distribution of codes in the neighborhood of the query.
Our implementation of multi-index hashing is available on-line at . Experiments are run on two different architectures. The first is a mid- to low-end Ghz dual quad-core AMD Opteron processor, with MB of L2 cache, and GB of RAM. The second is a high-end machine with a Ghz dual quad-core Intel Xeon processor, MB of L2 cache, and B of RAM. The difference in the size of the L2 cache has a major impact on the run-time of linear scan, since the effectiveness of linear scan depends greatly on L2 cache lines. With roughly ten times the L2 cache, linear scan on the Intel platform is roughly twice as fast as on the AMD machines. By comparison, multi-index hashing does not have a serial memory access pattern and so the cache size does not have such a pronounced effect. Actual run-times for multi-index hashing on the Intel and AMD platforms are within 20% of one another.
Both linear scan and multi-index hashing were implemented in C++ and compiled with identical compiler flags. To accommodate the large size of memory footprint required for 1B codes, we used the libhugetlbfs package and Linux Kernel to allow the use of MB page sizes. Further details about the implementations are given in Section 6. Despite the existence of multiple cores, all experiments are run on a single core to simplify run-time measurements.
The memory requirements for multi-index hashing are described in detail in Section 6. We currently require approximately GB for multi-index hashing with 1B -bit codes, and approximately twice that for -bit codes. Figure 4 shows how the memory footprint depends on the database size for linear scan and multi-index hashing. As explained in the Sec. 3.3, and demonstrated in Figure 4 the memory requirements of multi-index hashing grow linearly in the database size, as does linear scan. While we use a single computer in our experiments, one could implement a distributed version of multi-index hashing on computers with much less memory by placing each substring hash table on a separate computer.
. SIFT vectors are D descriptors of local image structure in the vicinity of feature points. Gist features  extracted from images capture global image structure in D vectors. These two feature types cover a spectrum of NN search problems in vision from feature to image indexing.
. LSH is considered a baseline random projection method, closely related to cosine similarity. MLH is a state-of-the-art learning algorithm that, given a set of similarity labels, optimizes a mapping by minimizing a loss function over pairs or triplets of binary codes.
Both the M Gist and B SIFT corpora comprise three disjoint sets, namely, a training set, a base set for populating the database, and a test query set. Using a random permutation, Gist descriptors are divided into a training set with items, a base set of million items, and a query set of size . The SIFT corpus comes with M for training, in the base set, and test queries.
For LSH we subtract the mean, and pick a set of coefficients from the standard normal density for a linear projection, followed by quantization. For MLH the training set is used to optimize hash function parameters . After learning is complete, we remove the training data and use the resulting hash function with the base set to create the database of binary codes. With two image corpora (SIFT and Gist), up to three code lengths (, , and bits), and two hashing methods (LSH and MLH), we obtain several datasets of binary codes with which to evaluate our multi-index hashing algorithm. Note that -bit codes are only used with LSH and SIFT vectors.
Figure 5 shows Euclidean NN recall rates for NN search on binary codes generated from M and B SIFT descriptors. In particular, we plot the fraction of Euclidean nearest neighbors found, by NN in - and -bit LSH  and MLH  binary codes. As expected -bit codes are more accurate, and MLH outperforms LSH. Note that the multi-index hashing algorithm solves exact NN search in Hamming distance; the approximation that reduces recall is due to the mapping from the original Euclidean space to the Hamming space. To preserve the Euclidean structure in the original SIFT descriptors, it seems useful to use longer codes, and exploit data-dependant hash functions such as MLH. Interestingly, as described below, the speedup factors of multi-index hashing on MLH codes are better than those for LSH.
Obviously, Hamming distance computed on q-bit binary codes is an integer between 0 and q. Thus, the nearest neighbors in Hamming distance can be divided into subsets of elements that have equal Hamming distance (at most q+1 subsets). Although Hamming distance does not provide a means to distinguish between equi-distant elements, often a re-ranking phase using Asymmetric Hamming distance  or other distance measures is helpful in practice. Nevertheless, this paper is solely concerned with the exact Hamming NN problem up to a selection of equi-distant elements in the top elements.
Each experiment below involves queries, for which we report the average run-time. Our implementation of the linear scan baseline searches million -bit codes in just under one second on the AMD machine. On the Intel machine it examines over million -bit codes per second. This is remarkably fast compared to Euclidean NN search with D SIFT vectors. The speed of linear scan is in part due to memory caching, without which it would be much slower.
|speedup factors for NN vs. linear scan|
|dataset||# bits||mapping||-NN||-NN||-NN||-NN||linear scan|
|speedup factors for NN vs. linear scan|
|dataset||# bits||mapping||-NN||-NN||-NN||-NN||linear scan|
Tables I and II shows run-time per query for the linear scan baseline, along with speedup factors of multi-index hashing for different NN problems and nine different datasets. Despite the remarkable speed of linear scan, the multi-index hashing implementation is hundreds of times faster. For example, the multi-index hashing method solves the exact -NN for a dataset of 1B -bit codes in about ms, well over times faster than linear scan (see Table I). Performance on -NN and -NN are even more impressive. With -bit MLH codes, multi-index hashing executes the 1NN search task over 1000 times faster than the linear scan baseline.
The run-time of linear scan does not depend on the number of neighbors, nor on the underlying distribution of binary codes. The run-time for multi-index hashing, however, depends on both factors. In particular, as the desired number of NNs increases, the Hamming radius of the search also increases (e.g., see Figure 3). This implies longer run-times for multi-index hashing. Indeed, notice that going from -NN to -NN on each row of the tables shows a decrease in the speedup factors.
The multi-index hashing run-time also depends on the distribution of binary codes. Indeed, one can see from Table I that MLH code databases yield faster run times than the LSH codes; e.g., for -NN in B -bit codes the speedup for MLH is vs for LSH. Figure 3 depicts the histograms of search radii needed for -NN with B -bit MLH and LSH codes. Interestingly, the mean of the search radii for MLH codes is bits, while it is for LSH. While the means are similar the variances are not; the standard deviations of the search radii for MLH and LSH are and respectively. The longer tail of the distribution of search radii for LSH plays an important role in the expected run-time. In fact, queries that require relatively large search radii tend to dominate the average query cost.
It is also interesting to look at the multi-index hashing run-times as a function of , the number of binary codes in the database. To that end, Figure 8, 8, and 8 depict run-times for linear scan and multi-index NN search on 64, 128, and 256-bit codes on the AMD machine. The left two panels in each figure show different vertical scales (since the behavior of multi-index NN and linear scan are hard to see at the same scale). The right-most panels show the same data on log-log axes. First, it is clear from these plots that multi-index hashing is much faster than linear scan for a wide range of dataset sizes and . Just as importantly, it is evident from the log-log plots that as we increase the database size, the speedup factors improve. The dashed lines on the log-log plots depict (up to a scalar constant). The similar slope of multi-index hashing curves with the square root curves show that multi-index hashing exhibits sub-linear query time, even for the empirical, non-uniform distributions of codes.
An alternative to linear scan and multi-index hashing is to hash the entire codes into a single hash table (SHT), and then use direct hashing with each query. As suggested in the introduction and Figure 1, although this approach avoids the need for any candidate checking, it may require a prohibitive number of lookups. Nevertheless, for sufficiently small code lengths or search radii, it may be effective in practice.
Given the complexity associated with efficiently implementing collision detection in large hash tables, we do not directly experiment with the single hash table approach. Instead, we consider the empirical number of lookups one would need, as compared to the number of items in the database. If the number of lookups is vastly greater than the size of the dataset one can readily conclude that linear scan is likely to be as fast or faster than direct indexing into a single hash table.
Fortunately, the statistics of neighborhood sizes and required search radii for NN tasks are available from the linear scan and multi-index hashing experiments reported above. For a given query, one can use the nearest neighbor’s Hamming distance to compute the number of lookups from a single hash table that are required to find all of the query’s nearest neighbors. Summed over the set of queries, this provides an indication of the expected run-time.
Figure 9 shows the average number of lookups required for 1-NN and 1000-NN tasks on - and -bit codes (from LSH on SIFT) using a single hash table. They are plotted as a function of the size of the dataset, from to items. For comparison, the plots also show the number of database items, and the number of lookups that were needed for multi-index hashing. Note that Figure 9 has logarithmic scales.
|optimized speedup vs. linear scan (consecutive, % improvement)|
||(, %)||(, %)||(, %)||(, %)|
|(, %)||(, %)||(, %)||(, %)|
|(, %)||(, %)||(, %)||(, %)|
It is evident that with a single hash table the number of lookups is almost always several orders of magnitude larger than the number of items in the dataset. And not surprisingly, this is also several orders of magnitude more lookups than required for multi-index hashing. Although the relative speed of a lookup operation compared to a Hamming distance comparison, as used in linear scan, depends on the implementation, there are a few important considerations. Linear scan has an exactly serial memory access pattern and so can make very efficient use of cache, whereas lookups in a hash table are inherently random. Furthermore, in any plausible implementation of a single hash table for 64 bit or longer codes, there will be some penalty for collision detection.
As illustrated in Figure 9, the only cases where a single hash table might potentially be more efficient than linear scan are with very small codes (64 bits or less), with a large dataset (1 billion items or more), and a small search distances (e.g., for 1-NN). In all other cases, linear scan requires orders of magnitude fewer operations. With any code length longer than 64 bits, a single hash table approach is completely infeasible to run, requiring upwards of 15 orders of magnitude more operations than linear scan for 128-bit codes.
The substring hash tables used above have been formed by simply dividing the full codes into disjoint and consecutive sequences of bits. For LSH and MLH, this is equivalent to randomly assigning bits to substrings.
It natural to ask whether further gains in efficiency are possible by optimizing the assignment of bits to substrings. In particular, by careful substring optimization one may be able to maximize the discriminability of the different substrings. In other words, while the radius of substring searches and hence the number of lookups is determined by the desired search radius on the full codes, and will remain fixed, by optimizing the assignment of bits to substrings one might be able to reduce the number of candidates one needs to validate.
To explore this idea we considered a simple method in which bits are assigned to substrings one at a time in a greedy fashion based on the correlation between the bits. We initialize the substrings greedily. A random bit is assigned to the first substring. Then, a bit is assigned to substring , which is maximally correlated with the bit assigned to substring . Next, we iterate over the substrings, and assign more bits to them, one at a time. An unused bit is assigned to substring , if the maximum correlation between that bit and other bits already assigned to substring is minimal. This approach significantly decreases the correlation between bits within a single substring. This should make the distribution of codes within substrings buckets more uniform, and thereby lower the number of candidates within a given search radius. Arguably, a better approach consists of maximizing the entropy of the entries within each substring hash table, thereby making the distribution of substrings as uniform as possible. However, this entropic approach is left to future work.
The results obtained with the correlation-based greedy algorithm show that optimizing substrings can provide overall run-time reductions on the order of against consecutive substrings for some cases. Table III displays the improvements achieved by optimizing substrings for different codes lengths and different values of . Clearly, as the code length increases, substring optimization has a bigger impact. Figure 10 shows the run-time behavior of optimized substrings as a function of dataset size.
Our implementation of multi-index hashing is publicly available at . Nevertheless, for the interested reader we describe some of the important details here.
As explained above, the algorithm hinges on hash tables built on disjoint -bit substrings of the binary codes. We use direct address tables for the substring hash tables because the substrings are usually short (). Direct address tables explicitly allocate memory for buckets and store all data points associated with each substring in its corresponding bucket. There is a one-to-one mapping between buckets and substrings, so no time is spent on collision detection.
One could implement direct address tables with an array of pointers, some of which may be null (for empty buckets). On a -bit machine, pointers are bytes long, so just storing an empty address table for requires GB (as done in ). For greater efficiency here, we use sparse direct address tables by grouping buckets into subsets of elements. For each bucket group, a -bit binary vector encodes whether each bucket in the group is empty or not. Then, a single pointer per group is used to point to a single resizable array that stores the data points associated with that bucket group. Data points within each array are ordered by their bucket index. To facilitate fast access, for each non-empty bucket we store the index of the beginning and the end of the corresponding segment of the array. Compared to the direct address tables in , for , and bucket groups of size , an empty address table requires only GB. Also note that accessing elements in any bucket of the sparse address table has a worst case run-time of .
Memory Requirements: We store one -bit pointer for each bucket group, and a -bit binary vector to encode whether buckets in a group are empty; this entails bytes for an empty -bit hash table (), or GB when . Bookkeeping for each resizable array entails -bit integers. In our experiments, most bucket groups have at least one non-empty bucket. Taking this into account, the total storage for an -bit address table becomes bytes (GB for ).
For each non-empty bucket within a bucket group, we store a -bit integer to indicate the index of the beginning of the segment of the resizable array corresponding to that bucket. The number of non-empty buckets is at most , where is the number of hash tables, and is the number of codes. Thus we need an extra bytes. For each data point per hash table we store an ID to reference the full binary code; each ID is bytes since for our datasets. This entails bytes. Finally, storing the full binary codes themselves requires bytes, since .
The total memory cost is bytes. For , this cost is . For 1B -bit codes, and hash tables ( bits each), the cost is GB. For -bit and -bit codes our implementation requires GB and GB. Note that the last two terms in the memory cost for storing IDs and codes are irreducible, but the first terms can be reduced in a more memory efficient implementation.
Duplicate Candidates: When retrieving candidates from the substring hash tables, some codes will be found multiple times. To detect duplicates, and discard them, we allocate one bit-string with bits. When a candidate is found we check the corresponding bit and discard the candidate if it is marked as a duplicate. Before each query we initialize the bit-string to zero. In practice this has negligible run-time. In theory clearing an -bit vector requires , but there exist ways to initialize an array in constant time.
Hamming Distance: To compare a query with a candidate
(for multi-index search or linear scan), we compute Hamming distance
on the full -bit codes, with one
xor operation for every 64
bits, followed by a pop count to tally the ones. We used the built-in
__builtin_popcount for this purpose.
Number of Substrings: The number of substring hash tables we use is determined with a hold-out validation set of database entries. From that set we estimate the running time of the algorithm for different choices of near , and select the that yields the minimum run-time. As shown in Table IV this empirical value for is usually the closest integer to .
Translation Lookaside Buffer and Huge Pages: Modern processors have an on-chip cache that holds a lookup table of memory addresses, for mapping virtual addresses to physical addresses for each running process. Typically, memory is split into 4KB pages, and a process that allocates memory is given pages by the operating system. The Translation Lookaside Buffer (TLB) keeps track of these pages. For processes that have large memory footprints (tens of GB), the number of pages quickly overtakes the size of the TLB (typically about 1500 entries). For processes using random memory access this means that almost every memory access produces a TLB miss - the requested address is in a page not cached in the TLB, hence the TLB entry must be fetched from slow RAM before the requested page can be accessed. This slows down memory access, and causes volatility in run-times for memory-access intensive processes.
To avoid this problem, we use the
libhugetlbfs Linux library.
This allows the operating system to allocate Huge Pages (2MB
each) rather than 4KB pages. This reduces the number of pages and the
frequency of TLB misses, which improves memory access speed, and
reduces run-time volatility. The increase in speed of multi-index
hashing compared to the results reported in  is
partly attributed to the use of libhugetlbfs.
This paper describes a new algorithm for exact nearest neighbor search on large-scale datasets of binary codes. The algorithm is a form of multi-index hashing that has provably sub-linear run-time behavior for uniformly distributed codes. It is storage efficient and easy to implement. We show empirical performance on datasets of binary codes obtained from billion SIFT, and million Gist features. With these datasets we find that, for -bit and -bit codes, our new multi-index hashing implementation is often more than two orders of magnitude faster than a linear scan baseline.
While the basic algorithm is developed in this paper there are several interesting avenues for future research. For example our preliminary research shows that is a good choice for the substring length, and it should be possible to formulate a sound mathematical basis for this choice. The assignment of bits to substrings was shown to be important above, however the algorithm used for this assignment is clearly suboptimal. It is also likely that different substring lengths might be useful for the different hash tables.
Our theoretical analysis proves sub-linear run-time behavior of the multi-index hashing algorithm on uniformly distributed codes, when search radius is small. Our experiments demonstrate sub-linear run-time behavior of the algorithm on real datasets, while the binary code in our experiments are clearly not uniformly distributed333In some of our experiments with Billion binary codes, tens of thousands of codes fall into the same bucket of -bit substring hash tables. This is extremely unlikely with uniformly distributed codes.. Bridging the gap between theoretical analysis and empirical findings for the proposed algorithm remains an open problem. In particular, we are interested in more realistic assumptions on the binary codes, which still allow for theoretical analysis of the algorithm.
While the current paper concerns exact nearest-neighbor tasks, it would also be interesting to consider approximate methods based on the same multi-index hashing framework. Indeed there are several ways that one could find approximate rather than the exact nearest neighbors for a given query. For example, one could stop at a given radius of search, even though items may not have been found. Alternatively, one might search until a fixed number of unique candidates have been found, even though all substring hash tables have not been inspected to the necessary radius, Such approximate algorithms have the potential for even greater efficiency, and would be the most natural methods to compare to most existing methods which are approximate, such as binary LSH. That said, such comparisons are more difficult than for exact methods since one must taken into account not only the storage and run-time costs, but also some measure of the cost of errors (usually in terms of recall and precision).
Finally, recent results have shown that for many datasets in which the binary codes are the result of some form of vector quantization, an asymmetric Hamming distance is attractive [12, 17]. In such methods, rather than converting the query into a binary code, one directly compares a real-valued query to the database of binary codes. The advantage is that the quantization noise entailed in converting the query to a binary string is avoided and one can more accurately using distances in the binary code space to approximate the desired distances in the feature space of the query. One simple way to do this is to use multi-index hashing and then only use an asymmetric distance when culling candidates. The potential for more interesting and effective methods is yet another promising avenue for future work.
This research was financially supported in part by NSERC Canada, the GRAND Network Centre of Excellence, and the Canadian Institute for Advanced Research (CIFAR). The authors would also like to thank Mohamed Aly, Rob Fergus, Ryan Johnson, Abbas Mehrabian, and Pietro Perona for useful discussions about this work.
ACM Symposium on Theory of Computing. ACM, 2002.
Approximate nearest neighbors: towards removing the curse of dimensionality.In ACM Symposium on Theory of Computing, pages 604–613, 1998.
Segmentation propagation in imagenet.In Proc. European Conference on Computer Vision, 2012.
Proc. International Conference on Machine Learning, 2011.
80 million tiny images: A large data set for nonparametric object and scene recognition.IEEE Trans. PAMI, 30(11):1958–1970, 2008.