Quantization based methods are popular for solving large scale maximum inner product search problems. However, in most traditional quantization works, the objective is to minimize the reconstruction error for datapoints to be searched. In this work, we focus directly on minimizing error in inner product approximation and derive a new class of quantization loss functions. One key aspect of the new loss functions is that we weight the error term based on the value of the inner product, giving more importance to pairs of queries and datapoints whose inner products are high. We provide theoretical grounding to the new quantization loss function, which is simple, intuitive and able to work with a variety of quantization techniques, including binary quantization and product quantization. We conduct experiments on standard benchmarking datasets to demonstrate that our method using the new objective outperforms other state-of-the-art methods.READ FULL TEXT VIEW PDF
We propose a quantization based approach for fast approximate Maximum In...
Vector quantization (VQ) techniques are widely used in similarity search...
Product quantization (PQ) is a popular approach for maximum inner produc...
Inverted file and asymmetric distance computation (IVFADC) have been
Visual tracking plays an important role in perception system, which is a...
Learning discriminative shape representations is a crucial issue for
There has been substantial research on sub-linear time approximate algor...
Maximum inner product search (MIPS) has become a popular paradigm for solving large scale classification and retrieval tasks. For example, in recommendation systems, user queries and documents are embedded into dense vector space of the same dimensionality and MIPS is used to find the most relevant documents given a user queryCremonesi et al. (2010). Similarly, in extreme classification tasks Dean et al. (2013), MIPS is used to predict the class label when a large number of classes, often on the order of millions or even billions are involved. Lately, MIPS has also been applied to training tasks such as scalable gradient computation in large output spaces Yen et al. (2018), efficient sampling for speeding up softmax computation Mussmann and Ermon (2016) and sparse updates in end-to-end trainable memory systems Pritzel et al. (2017).
To formally define Maximum Inner Product Search (MIPS) problem, consider a database with datapoints, where each datapoint in a -dimensional vector space. In the MIPS setup, given a query , we would like to find the datapoint that has the highest inner product with , i.e., we would like to identify
Exhaustively computing the exact inner product between and datapoints is often very expensive and sometimes infeasible. Several techniques have been proposed in the literature based on hashing and quantization to solve the approximate maximum inner product search problem efficiently, and the quantization based techniques have shown strong performance Ge et al. (2014); Babenko and Lempitsky (2014); Johnson et al. (2017). Quantizing each datapoint to not only reduces storage costs and memory bandwidth bottlenecks, but also permits efficient computation of distances. It avoids memory bandwidth intensive floating point operations through Hamming distance computation and look up table operations Norouzi et al. (2014); Jegou et al. (2011); Wu et al. (2017). In most traditional quantization works, the objective in the quantization procedures is to minimize the reconstruction error for the datapoints to be searched.
In this paper, we propose a new class of loss functions in quantization to improve the performance of MIPS. Our contribution is threefold:
We derive a novel class of loss functions for quantization, which departs from regular reconstruction loss by weighting each pair of and based on its inner product value. We prove that such weighting leads to an effective loss function, which can be used by a wide class of quantization algorithms.
We devise algorithms for learning the codebook, as well as quantizing new datapoints, using the new loss functions. In particular, we give details for two families of quantization algorithms, binary quantization and product quantization.
We show that on large scale standard benchmark datasets, such as Glove100, the change of objective yields a significant gain on the approximation of true inner product, as well as the retrieval performance.
This paper is organized as follows. We first briefly review previous literature on quantization for Maximum Inner Product Search, as well as its links to nearest neighbor search in Section 2. Next, we give our main result, which is the derivation of our objective in Section 3. Applications of the new loss functions to binary quantization and product quantization are given in Section 4. Finally, we present the experimental results in Section 5.
There is a large body of similarity search literature on inner product and nearest neighbor search. We refer readers to Wang et al. (2014, 2016) for a comprehensive survey. Some methods also transform MIPS problem into its equivalent form of nearest neighbor using transformation such as Shrivastava and Li (2014); Neyshabur and Srebro (2014), but in general are less successful than the ones that directly work in the original space. In general, these bodies of works can be divided into two families: (1) representing the data as quantized codes so that similarity computation becomes more efficient (2) pruning the dataset during the search so that only a subset of data points is considered.
Typical works in the first family include binary quantization (or binary hashing) techniques Indyk and Motwani (1998); Shrivastava and Li (2014) and product quantization techniques Jegou et al. (2011); Guo et al. (2016), although other families such as additive quantization Babenko and Lempitsky (2014); Martinez et al. (2016) and trenary quantization Zhu et al. (2016) also apply. There are many subsequent papers that extend these base approaches to more sophisticated codebook learning strategies, such as He et al. (2013); Erin Liong et al. (2015); Dai et al. (2017) for binary quantization and Zhang et al. (2014); Wu et al. (2017) for product quantization. There are also lines of work that focus on learning transformations before quantization Gong et al. (2013); Ge et al. (2014). Different from these methods which essentially minimize reconstruction error of the database points, we argue in Section 3 that reconstruction loss is suboptimal in the MIPS context, and any quantization method can potentially benefit from our proposed objective.
The second family includes non-exhaustive search techniques such as tree search Muja and Lowe (2014); Dasgupta and Freund (2008), graph search Malkov and Yashunin (2016); Harwood and Drummond (2016), or hash bucketing Andoni et al. (2015) in nearest neighbor search literature. There also exist variants of these for MIPS problem Ram and Gray (2012); Shrivastava and Li (2014). Some of these approaches lead to larger memory requirement, or random access patterns due to the cost of constructing index structures in addition to storing original vectors. Thus they are usually used in combination with linear search quantization methods, in ways similar to inverted index Jegou et al. (2011); Babenko and Lempitsky (2012); Matsui et al. (2015).
Common quantization techniques focus on minimizing the reconstruction error (sum of squared error) when is quantized to . It can be shown that minimizing the reconstruction errors is equivalent to minimizing the expected inner product quantization error under a mild condition on the query distribution. Indeed, consider the quantization objective of minimizing the expected total inner product quantization errors over the query distribution:
Under the assumption that is isotropic, i.e., , where
is the identity matrix and, the objective function becomes
Therefore, the objective becomes minimizing the reconstruction errors of the database points , and this has been considered extensively in the literature.
One key observation about the above objective function (1) is that it takes expectation over all possible combinations of datapoints and queries . However, it is easy to see that not all pairs of are equally important. The approximation error on the pairs which have a high inner product is far more important since they are likely to be among the top ranked pairs and can greatly affect the search result, while for the the pairs whose inner product is low the approximation error matters much less. In other words, for a given datapoint , we should quantize it with a bigger focus on its error with those queries which have high inner product with .
Following this key observation, we propose a new loss function by weighting the approximation error of the inner product based on the value of true inner product. More precisely, let be a monotonically non-decreasing function, and consider the following inner-product weighted quantization error
One common choice can be , in which case we care about all pairs whose inner product is greater or equal to certain threshold , and disregard the rest of the pairs.
One key result of this work is to decompose the inner-product weighted quantization errors based on the direction of the datapoints. We show that the new loss function (2) can be expressed as a weighted sum of the parallel and orthognal components of the residual errors with respect to the raw datapoints. Formally, let denote the quantization residual function. Given the datapoint and its quantizer , we can decompose the residual error into two parts, the one parallel to and the one orthgonal to :
so that . Because the norm of does not matter to the ranking result, without loss of generality we can assume to simplify the derivation below.
Assuming the query is uniformly distributed in
is uniformly distributed in-dimensional unit sphere. Given the datapoint and its quantizer , conditioned on the inner product for some , we have
First, we can decompose with and where is parallel to and is orthogonal to . Then, we have
The last step uses the fact that due to symmetry. The first term of (6), . For the second term, since is uniformly distributed in the dimensional subspace orthogonal to with the norm , we have . Therefore,
In the common scenario of being unit-normed, i.e., , we have
Now we are ready to compute the inner-product weighted quantization error (2) for the case when (one can do similar derivations for any reasonable ). Without loss of generality, for simplicity we show results below for when and are unit-normed.
Assuming the query is uniformly distributed in the -dimensional unit sphere with , and all datapoints are unit-normed, given ,
where is defined as
We can show that can be analytically computed, and as the dimension .
Let , and . Note that is proportional to the surface area of -dimensional hypersphere with a radius of . Thus we have , where is the surface area of -sphere with unit radius.
Thus, can be re-written as:
This gives us a recursive formula to compute when is a positive integer :
With the base case of , and , the exact value of can be computed explicitly in time. ∎
We furthermore prove that the limit of exists and that it equals as . In Figure 1, we plot with and we can see it approaches its limit quickly as grows.
When , we have .
See the Section. 7.1 of Appendix. ∎
Therefore, motivated by minimizing the inner product approximation errors for pairs of queries and datapoints whose inner product is significant, given a quantization scheme (e.g., vector quantization, product quantization, additive quantization), we propose to minimize the weighted quantization error
is a hyperparameter depending on the datapoint dimensionand the threshold imposed on the inner product between queries and datapoints. Note that when the hyperparameter is set to be 1, (9) is reduced to the traditional reconstruction errors of the datapoints.
In this section, we derive algorithms for applying new loss functions in (9) to common quantization techniques, including vector quantization, product quantization and binary quantization.
Recall that in vector quantization, given a set of datapoints, we want to find a codebook of size and quantize each datapoint as one of the codes. The goal is to minimize the total squared quantization error. Formally, the traditional vector quantization solves
One of the most popular quantization algorithms is the -Means algorithm, where we iteratively partition the datapoints into quantizers where the centroid of each partition is set to be the mean of the datapoints assigned in the partition.
Motivated by minimizing the inner product quantization error for cases when the inner product between queries and datapoints is high, our proposed objective solves:
where is a hyperparameter as a function of and following (9).
We solve (10) through a -Means style Lloyd’s algorithm, which iteratively minimizes the new loss functions by assigning datapoints to partitions and updating the partition quantizer in each iteration. The assignment step is computed by enumerating each quantizer and finding the quantizer that minimizes (10). The update step finds the new quantizer for a partition of datapoints , i.e.,
Because of the changed objective, the best quantizer is no longer the center of the partition. Since (11) is a convex function of , there exists an optimal solution. The update rule given a fixed partitioning can be found by setting the partial derivative of (11) with respect to each codebook entry to zero. This algorithm provably converges in a finite number of steps. See Algorithm 1 in Appendix for a complete outline of the algorithm. Note that, in the special case that , it reduces to regular -Means algorithm.
The optimal solution of (11) is
See Section. 7.2 of the Appendix. ∎
Algorithm 1 converges in finite number of steps.
This immediately follows from the fact that the loss defined in (10) is always non-increasing during both assignment and averaging steps under the changed objective. ∎
A natural extension of vector quantization is product quantization, which works better in high dimensional spaces. In product quantization, the original vector space is decomposed as the Cartesian product of distinct subspaces of dimension , and vector quantizations are applied in each subspace separately 111Random rotation or permutation of the original vectors can be done before doing the Cartisean product.. For example, let be written as
where is denoted as the sub-vector for the -th subspace. We can quantize each of the to with its vector quantizer in subspace , for . With product quantization, is quantized as and can be represented compactly using the assigned codes.
Using our proposed loss objective (9), we minimize the following loss function instead of the usual objective of reconstruction error:
where denotes the product quantization of , i.e.,
To optimize (13), we apply the vector quantization of Section 4.1 over all subspaces, except that the subspace assignment is chosen to minimize the global objective over all subspaces (13), instead of using the objective in each subspace independently. Similarly, the update rule is found by setting the derivative of loss in (13) with respect to each codebook entry to zero. The complete algorithm box of Algorithm (1) is found in Section 7.4 of the Appendix.
Another popular family of quantization function is binary quantization. In such a setting, a function is learned to quantize datapoints into binary codes, which saves storage space and can speed up distance computation. There are many possible ways to design such a binary quantization function. We follow the setting of Dai et al. Dai et al. (2017), which explicitly minimizes reconstruction loss and has been shown to outperform earlier baselines. In their paper, a binary auto-encoder is learned to quantize and dequantize binary codes:
is the “encoder” part which binarizes original datapoint into binary space andis the “decoder” part which reconstructs the datapoints given the binary codes. The authors of the paper uses as the encoder function and as the decoder functions, respectively. The learning objective is to minimize the reconstruction error of
, and the weights in the encoder and decoder are optimized end-to-end using standard stochastic gradient descent. Following our discussion in Section.3, this is an suboptimal objective in the MIPS settings, and the learning objective given in (9) should be preferred. This implies minimal changes to the algorithm of Dai et al. (2017) except the loss function part, while holding everything else unchanged.
In this section, we show our proposed quantization objective leads to improved performance on MIPS, when applied to binary and product quantization tasks. Other more sophisticated methods can also benefit from the new objective. Our experiments analyze the usefulness of proposed objectives relative to the typical reconstruction loss. Finally, we also discuss the speed-recall trade off in Section 5.5.
), quickly approaches its limit. (b) The relative error of inner product estimation for true Top-1 onGlove1.2M dataset, across multiple number of bits settings for at . (c) The retrieval Recall1@10 for different .
In addition to retrieval, many application scenarios also require estimating the value of the inner product
. For example, in softmax functions, inner product values are often used to compute the probability directly; In many textual or multimedia retrieval system,is used as a score for downstream classification. One direct consequence of (9) is that the objective weighs pairs by their importance and thus leads to lower estimation error on top-ranking pairs. We compare the estimation error of true inner product value on the Top-1 pair, under the same bitrate with product quantization on Glove1.2M dataset. Glove1.2M is a collection of 1.2 million 100-dimensional word embeddings trained with the method described in Pennington et al. (2014). We measure as the relative error on true inner product. New objective clearly produces smaller relative error over all bitrate settings (Figure. 1).
Next, we show our MIPS retrieval results, comparing quantization methods that use our proposed loss function and the the state-of-the-arts that uses reconstruction loss. The goal of the comparison is to show that the new objective is compatible with common quantization techniques, such as product and binary quantization and can benefit their performance. To evaluate retrieval performance, we use RecallM@N metric, which measures the fraction of the ground truth MIPS results recalled in first datapoints returned by the retrieval algorithm.
We follow the formulation in Section 4.2 to evaluate the effect of new objective for MIPS, testing on Glove1.2M dataset, reporting Recall1@1, Recall1@10, Recall10@10, Recall10@100, respectively in Figure. 2a. To be compatible with SIMD optimized ADC computation, we set the number of centers in each cluster to . The experiments clearly show that the proposed loss function significantly outperforms existing state-of-the-arts methods.
Extreme classification with large number of classes requires evaluating the last layer (classification layer) with all possible classes. When there are classes, this becomes a major computation bottleneck as it involves huge matrix multiplication. Thus this is often solved using Maximum Inner Product Search to speed up the inference. We evaluate our methods on extreme classification using the Amazon-670k dataset Bhatia et al. (2015)
. An MLP classifier is trained over 670,091 classes, where the last layer has a dimensionality of 1,024. We evaluate retrieval performance on the classification layer and show the results in Figure.2c, by comparing it against brute force matrix multiplication.
Another interesting question is how the retrieval performance relates to , given . Intuitively, if is set too low, then the objective takes almost all pairs of query and database points into consideration, and becomes similar to the standard reconstruction loss. If is too high, then very few pairs will be considered and the quantization may be inaccurate for low value inner product pairs. Figure 1 shows the retrieval evaluation of Glove1.2M dataset under different . We use for all of our retrieval experiments.
Although MIPS and nearest neighbor search are different problems, they are often asked to be compared, especially in the case of unit norm data, where the problem become equivalent. To evaluate our methods on speed-recall trade-off, we adopted the methodology of public benchmark suite ANN-benchmarks Aumüller et al. (2019), which plots a comprehensive set of algorithms for comparison. The benchmarks are conducted on same platform of Intel Xeon W-2135 with one CPU single thread. Our implementation builds on product quantization in proposed method and SIMD based ADC Guo et al. (2016). This is further combined with a vector quantization based tree Wu et al. (2017), and the curve is plotted by varying the number of leaves to search in the tree. Figure 2d shows our performance on Glove1.2M significantly outperforms the competing methods, especially in high recall region, where Recall10 is over 80%.
In this paper, we propose a new quantization loss function for inner product search, which replaces traditional reconstruction error. The new loss function is weighted based on the inner product values, giving more weight to the pairs of query and database points with higher inner product values. The proposed loss function is theoretically proven and can be applied to a wide range of quantization methods, for example product and binary quantization. Our experiments show superior performance on retrieval recall and inner product value estimation, compared to methods that use reconstruction error. The speed-recall benchmark on public datasets further indicates that the proposed method outperform state-of-arts baselines which are known to be hard to beat.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 913–922. Cited by: §2, §4.3.
Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 537–546. Cited by: §2.
Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2916–2929. Cited by: §2.
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, pp. 482–490. External Links: Cited by: §2, §5.5.
Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613. Cited by: §2.
Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11), pp. 2227–2240. Cited by: §2.
Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Cited by: §5.1.
First, it is easy to see that because , . Next, from Cauchy–Schwarz inequality for integrals, we have
Rearranging this we have , which proves that is monotonically non-increasing. Given that it has a lower bound and is monotonically non-increasing, the limit of exists.
Dividing both sides of Equation. 8 by , we have
Thus exists. And therefore also exists. Furthermore,
Finally we have , and this proves Theorem 3.3. ∎
Therefore, the derivative of with respect to is
Setting , we have
A set of datapoints .
A scalar , the weight for quantization error components tradeoff (selection guided by desired inner product threshold in Sec 3).
A positive integer , the size of the codebook.
A set of codebook
Partition assignment for the datapoints such that is quantized as .
For example, consider the first vector in the codebook for the first subspace, and let be one of the datapoints the first subspace of which is encoded as , i.e., . Write as , and as . Then
Let denote the set of indices . Therefore, the partial derivative of (13) with respect to is
Set (14) to be zero, and we have