A general framework for similarity search.
Neyshabur and Srebro proposed Simple-LSH, which is the state-of-the-art hashing method for maximum inner product search (MIPS) with performance guarantee. We found that the performance of Simple-LSH, in both theory and practice, suffers from long tails in the 2-norm distribution of real datasets. We propose Norm-ranging LSH, which addresses the excessive normalization problem caused by long tails in Simple-LSH by partitioning a dataset into multiple sub-datasets and building a hash index for each sub-dataset independently. We prove that Norm-ranging LSH has lower query time complexity than Simple-LSH. We also show that the idea of partitioning the dataset can improve other hashing based methods for MIPS. To support efficient query processing on the hash indexes of the sub-datasets, a novel similarity metric is formulated. Experiments show that Norm-ranging LSH achieves an order of magnitude speedup over Simple-LSH for the same recall, thus significantly benefiting applications that involve MIPS.READ FULL TEXT VIEW PDF
Recently, locality sensitive hashing (LSH) was shown to be effective for...
Efficient Maximum Inner Product Search (MIPS) is an important task that ...
The inner-product navigable small world graph (ip-NSW) represents the
Efficient inference for wide output layers (WOLs) is an essential yet
Similarity search is a core component in various applications such as im...
We propose a novel dimensionality reduction method for maximum inner pro...
There has been substantial research on sub-linear time approximate algor...
A general framework for similarity search.
Given a dataset containing vectors (also called items) and a query , maximum inner product search (MIPS) finds the vector in that has the maximum inner product with ,
MIPS may require items with the top inner products and it usually suffices to return approximate results (i.e., items with inner products close to the maximum). MIPS has many important applications including recommendation based on user and item embeddings obtained from matrix factorization (Koren et al., 2009)
, multi-class classification with linear classifier(Dean et al., 2013)
, filtering in computer vision(Felzenszwalb et al., 2010), etc.
MIPS is a challenging problem as modern datasets often have high dimensionality and large cardinality. Initially, tree-based methods (Ram and Gray, 2012; Koenigstein et al., 2012) were proposed for MIPS, which use the idea of branch and bound similar to k-d tree (Friedman and Tukey, 1974)
. However, these methods suffer from the curse of dimensionality and their performance can be even worse than linear scan for feature dimension as low as 20(Weber et al., 1998). Shrivastava and Li proposed l2-alsh (2014), which attains the first provable sub-linear query time complexity guarantee for approximate MIPS that is independent of dimensionality. l2-alsh applies an asymmetric transformation 111Asymmetric transformation means that the transformations for the queries and the items are different, while symmetric transformation means the same transformation is applied to the items and queries. to transform MIPS into similarity search, which can be solved with well-known LSH functions. Following the idea of l2-alsh, Shrivastava and Li formulated another pair of asymmetric transformations called sign-alsh (2015) to transform MIPS into angular similarity search and obtained better query time complexity guarantee than that of l2-alsh.
Neyshabur and Srebro showed that asymmetry is not necessary when queries are normalized and items have bounded 2-norm (2015). They proposed simple-lsh, which adopts a symmetric transformation and transforms MIPS into angular similarity search similar to sign-alsh. However, they proved that simple-lsh is a universal LSH for MIPS, while l2-alsh and sign-alsh are not. simple-lsh is also parameter-free and avoids the parameter tuning of l2-alsh and sign-alsh. Most importantly, simple-lsh achieves superior performance over l2-alsh and sign-alsh in both theory and practice, and thus is the state-of-the-art hashing algorithm for MIPS.
simple-lsh requires the 2-norms of the items to be bounded, which is achieved by normalizing the items by the largest 2-norm in the dataset. However, real datasets often have long tails in the distribution of 2-norm, meaning that the largest 2-norm can be much larger than the majority of the items. As we will show in Section 3.1, the excessive normalization process of simple-lsh makes the maximum inner product between the query and the items small, which degrades the performance of simple-lsh in both theory and practice.
To solve this problem, we propose norm-ranging lsh. The idea is to partition the original dataset into multiple sub-datasets according to the percentiles of the 2-norm distribution. For each sub-dataset, norm-ranging lsh uses simple-lsh as a subroutine to build an index independent of other sub-datasets. As each sub-dataset is normalized by its own maximum 2-norm, which is usually significantly smaller than the maximum 2-norm in the entire dataset, norm-ranging lsh achieves a lower asymptotic query time complexity bound than simple-lsh. To support efficient query processing, we also formulate a novel similarity metric which defines a probing order for buckets from different sub-datasets. We compare norm-ranging lsh with simple-lsh and l2-alsh on three real datasets and show empirically that norm-ranging lsh offers an order of magnitude speedup for achieving the same recall.
A formal definition of locality sensitive hashing (LSH) (Indyk and Motwani, 1998) is given as follows:
(Locality Sensitive Hashing) A family is called -sensitive if, for any two vectors :
if , then ,
if , then .
For a family of LSH functions to be useful, it is required that and . Given a family of -sensitive hash functions, one can answer a query for -approximate nearest neighbor search 222-approximate nearest neighbor search solves the following problem: given parameters and , if there exists an -near neighbor of in , return some -near neighbor in with probability at least
with probability at least. with a time complexity of , where . For distance, there exists a well-known family of LSH functions defined as follows:
where is the floor operation,
is a random vector whose entries follow i.i.d. standard normal distribution and
is generated by a uniform distribution over. The probability that two vectors and are hashed to the same value under (2) is given as:
in which is the cumulative density function of standard normal distribution and is the distance between and . For angular similarity, sign random projection is an LSH. Its expression and collision probability can be given as (Goemans and Williamson, 1995):
where the entries of follow i.i.d. standard normal distribution.
Shrivastava and Li proved that there exists no symmetric LSH for MIPS if the domain of the item and query are both (2014). They proposed to apply a pair of asymmetric transformations, and , to the items and the query, respectively.
The scaling factor should ensure that for all and the query is normalized to unit 2-norm before the transformation. After the transformation, we have:
As the scaling factor is common for all items and the last term vanishes with sufficiently large because , (6) shows that MIPS is transformed into finding the nearest neighbor of in distance, which can be solved using the hash function in (2). Given and , a query time guarantee of can be obtained for -approximate MIPS with:
It is suggested to use a grid search to find the values of the parameters (, and ) that minimize .
Neyshabur and Srebro proved that l2-alsh is not a universal LSH for MIPS, that is, for any setting of , and , there always exists a pair of and such that and but (2015). Moreover, they showed that asymmetry is not necessary if the items have bounded 2-norm and the query is normalized, which is exactly the assumption of l2-alsh. They proposed a symmetric transformation to transform MIPS into angular similarity search as follows:
They apply the sign random projection in (4) to and to obtain an LSH for -approximate MIPS with a query time complexity and is given as:
They called their scheme simple-lsh as it avoids the parameter tuning process of l2-alsh. Moreover, simple-lsh is proved to be a universal LSH for MIPS under any valid configuration of and . simple-lsh also obtains better (lower) values than l2-alsh and sign-alsh in theory and outperforms both of them empirically on real datasets (Shrivastava and Li, 2015).
; (b) 2-norm distribution of the SIFT descriptors from the ImageNet dataset (maximum 2-norm scaled to 1); (c) The distribution of the maximum inner product of the queries after the normalization process ofsimple-lsh; (d) The distribution of the maximum inner product of the queries after the normalization process of range-lsh (32 sub-datasets).
In this section, we first motivate norm-ranging LSH by showing the problem of simple-lsh on real datasets, then introduce how norm-ranging LSH (or range-lsh for short) solves the problem.
We plot the relation between and for simple-lsh under different values of in Figure 1. Recall that the theoretical query time complexity of simple-lsh is and observe that is a decreasing function of . As is large when is small, simple-lsh suffers from poor query performance when the maximum inner product between a query and the items is small. To conduct the transformation in (8), simple-lsh requires the 2-norm of the items to be bounded by 1, which is usually achieved by dividing the items by the maximum 2-norm . Assuming for an original item vector , we have after normalization. If is significantly larger than , the inner product will be scaled to a small value.
We plot the distribution of the 2-norm of a real dataset in Figure 1. The distribution has a long tail and the maximum 2-norm is much larger than the majority of the items. We also plot in Figure 1 the distribution of the maximum inner product of the queries after the normalization process of simple-lsh. The results show that for the majority of the queries, the maximum inner product is small, which translates into a large and poor theoretical query performance.
The long tail distribution of real datasets also harms the performance of simple-lsh in practice. If is small after normalization, the term, which is irrelevant to the inner product between and , will be dominant in . In this case, the result of sign random projection in (4) will be largely determined by the last entry of , causing many items to be gathered in the same bucket. In our sample run of simple-lsh on the ImageNet dataset (Deng et al., 2009) with a code length of 32, there are only 60,000 buckets and the largest bucket holds about 200,000 items. Considering that the ImageNet dataset contains roughly 2 million items and 32-bit code offers approximately buckets, these statistics show that the large term severely degrades bucket balance in simple-lsh. Bucket balance is important for the performance of binary hashing algorithms such as simple-lsh because they use Hamming distance to determine the probing order of the buckets (Cai, 2016; Gong et al., 2013). If the number of buckets is small or some buckets contain too many items, Hamming distance cannot define a good probing order for the items, which results in poor query performance.
The index building and query processing procedures of range-lsh are presented in Algorithm 1 and Algorithm 2, respectively. To solve the excessive normalization problem of simple-lsh, range-lsh partitions the items into sub-datasets according to the percentiles of the 2-norm distribution so that each sub-dataset contains items with similar 2-norms. Note that ties are broken arbitrarily in the ranking process of Algorithm 1 to ensure that the percentiles based partitioning works even when many items have the same 2-norm. Instead of using , i.e., the maximum 2-norm in the entire dataset, simple-lsh uses the local maximum 2-norm in each sub-dataset for normalization, so as to keep the inner products of the queries large. In Figure 1, we plot the maximum inner product of the queries after the normalization process of range-lsh. Comparing with Figure 1, the values of the inner product are significantly larger. As a result, the of sub-dataset becomes , which is smaller than if . The smaller values translate into better query performance. In the following, we prove that range-lsh achieves a lower query time complexity bound than simple-lsh while providing the same quality guarantee.
For any and , the query time complexity upper bound of range-lsh for solving -approximate MIPS is lower than that of simple-lsh with sufficiently large , if the dataset is divided into sub-datasets with and there are at most sub-datasets satisfying with . The parameters are given as , and .
Firstly, we prove the correctness of range-lsh, that is, it indeed returns a approximate answer with probability at least . Note that is a pre-specified parameter common to all sub-datasets rather than the actual maximum inner product in each sub-dataset. If there is an item having an inner product of with in the original dataset, it is certainly contained in one of the sub-datasets. When we conduct MIPS on all the sub-datasets, the sub-dataset containing will return an item having inner product with with probability at least according to the guarantee of simple-lsh. The final query result is obtained by selecting the optimal one (the one having the largest inner product with ) from the query answers of all sub-dataset according to Algorithm 2, which is guaranteed to be no less than with probability at least .
Now we analyze the complexity of range-lsh. For each sub-dataset , it contains items and the query time complexity upper bound of -approximate MIPS is with . As there are sub-datasets, the time complexity of selecting the optimal one from the answers of all sub-datasets is . Considering is an increasing function of and there are sub-datasets with , the query time complexity of range-lsh can be expressed as:
Strictly speaking, the equal sign in the first line of (10) is not rigorous as the constants and non-dominant terms in the complexity of querying each sub-dataset are ignored. However, we are interested in the order rather than the precise value of query time complexity, so the equal sign is used for the conciseness of expression. Comparing with the complexity of simple-lsh,
(11) tends to 0 with sufficiently large when , and , which is satisfied by and . ∎
Note that the conditions of Theorem (1) can be easily satisfied. Theorem (1) imposes an upper bound instead of a lower bound on the number of sub-datasets, which is favorable as we do not want to partition the original dataset into a large number of sub-datasets. Moreover, the condition that the number of sub-datasets with is smaller than is easily satisfied as very often only the sub-dataset that contains the items with the largest 2-norms has . The proof also shows that range-lsh is not limited to datasets with long tail in the 2-norm distribution. As long as holds for most sub-datasets, range-lsh can provide better performance than simple-lsh. We acknowledge that range-lsh and simple-lsh are equivalent when all items have the same 2-norm. However, MIPS is also equivalent to angular similarity search without transformation in this case, and thus can be solved with sign random projection instead of simple-lsh.
The lower theoretical query time complexity of range-lsh also translates into much better bucket balance in practice. On the ImageNet dataset, range-lsh with 32-bit code maps the items to approximately 2 million buckets and most buckets contain only 1 item. Comparing with the statistics of simple-lsh in Section 3.1, these numbers show that range-lsh has much better bucket balance, and thus better ability to define a good probing order for the items. This can be explained by the fact that range-lsh uses more moderate scaling factors for each sub-dataset than simple-lsh, thus significantly reducing the magnitude of the term in .
Algorithm 2 needs to probe the hash indexes of all sub-datasets independently and the query result can only be obtained after all hash indexes are probed. In practice, it is more favorable to probe the buckets of different sub-datasets simultaneously in a manner similar to Hamming ranking, which motivates the design of a metric to rank the buckets across the indexes of different sub-datasets.
Combining the index building process of range-lsh and the collision probability of sign random projection in (4), the probability that an item and the query collide on one bit can be given as , where is the maximum 2-norm in sub-dataset . When the code length is , the expectation of the number of identical bits between query and item can be expressed as:
For , we have . Plugging it into (12) and rearranging the terms:
(13) shows that can serve as an indicator of the inner product between item and query (extending to the case is trivial). In practice, we cannot obtain , but the actual number of identical bits between bucket and query , denoted as , can be used as a surrogate. When , larger indicates higher inner product while the opposite is true when . Since the code length is limited and can diverge from , it is possible that a bucket has large and large inner product with , but it happens that . In this case, it will be probed late in the query process, hurting query performance. To alleviate this problem, we adjust the similarity indicator to , where is a small positive integer.
Note that our similarity metric can be manipulated with a complexity similar to Hamming distance. We can calculate the values of for all possible combinations of and , and sort them during index building. Note that the sorted structure is common for all queries and does not take too much space 333 can take values, can take values, so the size of the sorted structure is .. When a query comes, query processing can be conducted by traversing the sorted structure in ascending order. For a pair of , determines the sub-dataset while is used to choose the buckets to probe in that sub-dataset with standard hash lookup. Moreover, an analysis (see the supplementary material) shows that our similarity metric is also amendable to generate-to-probe scheme based on the inverted multi-index (Babenko and Lempitsky, 2012).
We used three popular datasets, i.e., Netflix, Yahoo!Music and ImageNet. For the Netflix dataset and Yahoo!Music dataset, the user and item embeddings were obtained using alternating least square based matrix factorization (Yun et al., 2013), and each embedding has 300 dimensions. We used the item embeddings as dataset items and the user embeddings as queries. The ImageNet dataset contains more than 2 million SIFT descriptors of the ImageNet images, and some randomly selected SIFT descriptors are used as queries and the rest are used as dataset items. Note that the 2-norm distributions of the Netflix and Yahoo!Music embeddings do not have long tail and the maximum 2-norm is close to the median (see the supplementary material), which helps verify the robustness of range-lsh to different 2-norm distributions. Different from previous empirical studies (Neyshabur and Srebro, 2015; Shrivastava and Li, 2014), which used small datasets containing no more than 20,000 items, our experiments covered the spectrum from small-scale to large-scale datasets with Netflix (17,770 items), Yahoo!Music (136,736 items), and ImageNet (2,340,373 items). For each dataset, we report the average performance of 1,000 randomly selected queries.
We compared range-lsh with simple-lsh and l2-alsh. For l2-alsh, we used the parameter setting recommended by its authors, i.e., , , . For range-lsh, part of the bits in the binary code are used to encode the index of the sub-datasets and the rest are generated by hashing. For example, if the code length is 16 and the dataset is partitioned into 32 sub-datasets, the 16-bit code of range-lsh consists of 5 bits for indexing the 32 sub-datasets, while the remaining 11 bits are generated by hashing. We partitioned the dataset into 32, 64 and 128 sub-datasets under a code length of 16, 32 and 64, respectively. For fairness of comparison, all algorithms use the same total code length.
We plot the time-recall curves in Figure 2 as it is critical to reach a high recall in a short time for many applications. The results show that range-lsh achieves significant speedups compared with simple-lsh and l2-alsh at the same recall. To be more specific, simple-lsh takes approximately 16, 18 and 18 times of the running time of range-lsh to achieve a 90% recall under 32-bit code on the three datasets. Due to space limitation, we only report the performance of the top 10 MIPS, the performance under more configurations can be found in the supplementary material.
Recall that Algorithm 1 partitions a dataset into sub-datasets according to percentiles in the 2-norm distribution. We tested an alternative partitioning scheme, which divides the domain of 2-norms into uniformly spaced ranges and items falling in the same range are partitioned into the same sub-dataset. The results are plotted in Figure 3, which shows that uniform partitioning achieves slightly better performance than percentile partitioning. This proves range-lsh is general and robust to different partitioning methods as long as items with similar 2-norms are into the same sub-dataset. We also experimented the influence of the number of sub-datasets on performance in Figure 3. The results show that performance improves with the number of sub-datasets when the number of sub-datasets is still small, but stabilizes when the number of sub-datasets is sufficiently large.
In this section, we show that the idea of range-lsh, which divides the original dataset into sub-datasets with similar 2-norms, can also be applied to l2-alsh (Shrivastava and Li, 2014) to obtain more favorable values than (7). Note that we get (7) from (6) as we only have if the entire dataset is considered. For a sub-dataset , if we have the range of its 2-norms as and , we can obtain the of as:
As and , the denominator in (14) increases in absolute value while the numerator decreases in absolute value compared with (7). Therefore, we have . Moreover, dividing the original dataset into sub-datasets allows us to use different normalization factor for each sub-dataset and we only need to satisfy rather than , which allows more flexibility for parameter optimization. Similar to Theorem (1), it can also be proved that dividing the dataset into sub-datasets results in an algorithm with lower query time complexity than the original l2-alsh. The same idea can also be trivially extended to obtain more favorable values for sign-alsh (Shrivastava and Li, 2015).
Maximum inner product search (MIPS) has many important applications such as collaborative filtering and computer vision. We showed that, simple-lsh, the state-of-the-art hashing method for MIPS, has critical performance limitations due to the long tail in the 2-norm distribution of real datasets. To tackle this problem, we proposed range-lsh, which attains provably lower query time complexity than simple-lsh. In addition, we also formulated a novel similarity metric that can be processed with low complexity. The experimental results showed that range-lsh significantly outperforms simple-lsh, and range-lsh is robust to the shape of 2-norm distribution and different partitioning methods. We also showed that the idea of simple-lsh hashing is general and can be applied to boost the performance of other hashing based methods for MIPS. The superior performance of range-lsh can benefit many applications that involve MIPS.
Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval.IEEE Trans. Pattern Anal. Mach. Intell., 35:2916–2929, 2013.