New Loss Functions for Fast Maximum Inner Product Search

08/27/2019 ∙ by Ruiqi Guo, et al. ∙ 11

Quantization based methods are popular for solving large scale maximum inner product search problems. However, in most traditional quantization works, the objective is to minimize the reconstruction error for datapoints to be searched. In this work, we focus directly on minimizing error in inner product approximation and derive a new class of quantization loss functions. One key aspect of the new loss functions is that we weight the error term based on the value of the inner product, giving more importance to pairs of queries and datapoints whose inner products are high. We provide theoretical grounding to the new quantization loss function, which is simple, intuitive and able to work with a variety of quantization techniques, including binary quantization and product quantization. We conduct experiments on standard benchmarking datasets to demonstrate that our method using the new objective outperforms other state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Maximum inner product search (MIPS) has become a popular paradigm for solving large scale classification and retrieval tasks. For example, in recommendation systems, user queries and documents are embedded into dense vector space of the same dimensionality and MIPS is used to find the most relevant documents given a user query 

Cremonesi et al. (2010). Similarly, in extreme classification tasks Dean et al. (2013), MIPS is used to predict the class label when a large number of classes, often on the order of millions or even billions are involved. Lately, MIPS has also been applied to training tasks such as scalable gradient computation in large output spaces Yen et al. (2018), efficient sampling for speeding up softmax computation Mussmann and Ermon (2016) and sparse updates in end-to-end trainable memory systems Pritzel et al. (2017).

To formally define Maximum Inner Product Search (MIPS) problem, consider a database with datapoints, where each datapoint in a -dimensional vector space. In the MIPS setup, given a query , we would like to find the datapoint that has the highest inner product with , i.e., we would like to identify

Exhaustively computing the exact inner product between and datapoints is often very expensive and sometimes infeasible. Several techniques have been proposed in the literature based on hashing and quantization to solve the approximate maximum inner product search problem efficiently, and the quantization based techniques have shown strong performance Ge et al. (2014); Babenko and Lempitsky (2014); Johnson et al. (2017). Quantizing each datapoint to not only reduces storage costs and memory bandwidth bottlenecks, but also permits efficient computation of distances. It avoids memory bandwidth intensive floating point operations through Hamming distance computation and look up table operations Norouzi et al. (2014); Jegou et al. (2011); Wu et al. (2017). In most traditional quantization works, the objective in the quantization procedures is to minimize the reconstruction error for the datapoints to be searched.

In this paper, we propose a new class of loss functions in quantization to improve the performance of MIPS. Our contribution is threefold:

  • We derive a novel class of loss functions for quantization, which departs from regular reconstruction loss by weighting each pair of and based on its inner product value. We prove that such weighting leads to an effective loss function, which can be used by a wide class of quantization algorithms.

  • We devise algorithms for learning the codebook, as well as quantizing new datapoints, using the new loss functions. In particular, we give details for two families of quantization algorithms, binary quantization and product quantization.

  • We show that on large scale standard benchmark datasets, such as Glove100, the change of objective yields a significant gain on the approximation of true inner product, as well as the retrieval performance.

This paper is organized as follows. We first briefly review previous literature on quantization for Maximum Inner Product Search, as well as its links to nearest neighbor search in Section 2. Next, we give our main result, which is the derivation of our objective in Section 3. Applications of the new loss functions to binary quantization and product quantization are given in Section 4. Finally, we present the experimental results in Section 5.

2 Related Works

There is a large body of similarity search literature on inner product and nearest neighbor search. We refer readers to Wang et al. (2014, 2016) for a comprehensive survey. Some methods also transform MIPS problem into its equivalent form of nearest neighbor using transformation such as Shrivastava and Li (2014); Neyshabur and Srebro (2014), but in general are less successful than the ones that directly work in the original space. In general, these bodies of works can be divided into two families: (1) representing the data as quantized codes so that similarity computation becomes more efficient (2) pruning the dataset during the search so that only a subset of data points is considered.

Typical works in the first family include binary quantization (or binary hashing) techniques Indyk and Motwani (1998); Shrivastava and Li (2014) and product quantization techniques Jegou et al. (2011); Guo et al. (2016), although other families such as additive quantization Babenko and Lempitsky (2014); Martinez et al. (2016) and trenary quantization Zhu et al. (2016) also apply. There are many subsequent papers that extend these base approaches to more sophisticated codebook learning strategies, such as He et al. (2013); Erin Liong et al. (2015); Dai et al. (2017) for binary quantization and Zhang et al. (2014); Wu et al. (2017) for product quantization. There are also lines of work that focus on learning transformations before quantization Gong et al. (2013); Ge et al. (2014). Different from these methods which essentially minimize reconstruction error of the database points, we argue in Section 3 that reconstruction loss is suboptimal in the MIPS context, and any quantization method can potentially benefit from our proposed objective.

The second family includes non-exhaustive search techniques such as tree search Muja and Lowe (2014); Dasgupta and Freund (2008), graph search Malkov and Yashunin (2016); Harwood and Drummond (2016), or hash bucketing Andoni et al. (2015) in nearest neighbor search literature. There also exist variants of these for MIPS problem Ram and Gray (2012); Shrivastava and Li (2014). Some of these approaches lead to larger memory requirement, or random access patterns due to the cost of constructing index structures in addition to storing original vectors. Thus they are usually used in combination with linear search quantization methods, in ways similar to inverted index Jegou et al. (2011); Babenko and Lempitsky (2012); Matsui et al. (2015).

3 Problem Formulation

Common quantization techniques focus on minimizing the reconstruction error (sum of squared error) when is quantized to . It can be shown that minimizing the reconstruction errors is equivalent to minimizing the expected inner product quantization error under a mild condition on the query distribution. Indeed, consider the quantization objective of minimizing the expected total inner product quantization errors over the query distribution:

(1)

Under the assumption that is isotropic, i.e., , where

is the identity matrix and

, the objective function becomes

Therefore, the objective becomes minimizing the reconstruction errors of the database points , and this has been considered extensively in the literature.

One key observation about the above objective function (1) is that it takes expectation over all possible combinations of datapoints and queries . However, it is easy to see that not all pairs of are equally important. The approximation error on the pairs which have a high inner product is far more important since they are likely to be among the top ranked pairs and can greatly affect the search result, while for the the pairs whose inner product is low the approximation error matters much less. In other words, for a given datapoint , we should quantize it with a bigger focus on its error with those queries which have high inner product with .

Following this key observation, we propose a new loss function by weighting the approximation error of the inner product based on the value of true inner product. More precisely, let be a monotonically non-decreasing function, and consider the following inner-product weighted quantization error

(2)

One common choice can be , in which case we care about all pairs whose inner product is greater or equal to certain threshold , and disregard the rest of the pairs.

One key result of this work is to decompose the inner-product weighted quantization errors based on the direction of the datapoints. We show that the new loss function (2) can be expressed as a weighted sum of the parallel and orthognal components of the residual errors with respect to the raw datapoints. Formally, let denote the quantization residual function. Given the datapoint and its quantizer , we can decompose the residual error into two parts, the one parallel to and the one orthgonal to :

(3)
(4)

so that . Because the norm of does not matter to the ranking result, without loss of generality we can assume to simplify the derivation below.

Theorem 3.1.

Assuming the query

is uniformly distributed in

-dimensional unit sphere. Given the datapoint and its quantizer , conditioned on the inner product for some , we have

(5)
Proof.

First, we can decompose with and where is parallel to and is orthogonal to . Then, we have

(6)

The last step uses the fact that due to symmetry. The first term of (6), . For the second term, since is uniformly distributed in the dimensional subspace orthogonal to with the norm , we have . Therefore,

. ∎

In the common scenario of being unit-normed, i.e., , we have

Now we are ready to compute the inner-product weighted quantization error (2) for the case when (one can do similar derivations for any reasonable ). Without loss of generality, for simplicity we show results below for when and are unit-normed.

Proposition 1.

Assuming the query is uniformly distributed in the -dimensional unit sphere with , and all datapoints are unit-normed, given ,

(7)

where is defined as

We can show that can be analytically computed, and as the dimension .

Theorem 3.2.

Proof.

Let , and . Note that is proportional to the surface area of -dimensional hypersphere with a radius of . Thus we have , where is the surface area of -sphere with unit radius.

Thus, can be re-written as:

Denote ,

This gives us a recursive formula to compute when is a positive integer :

(8)

With the base case of , and , the exact value of can be computed explicitly in time. ∎

We furthermore prove that the limit of exists and that it equals as . In Figure 1, we plot with and we can see it approaches its limit quickly as grows.

Theorem 3.3.

When , we have .

Proof.

See the Section. 7.1 of Appendix. ∎

Therefore, motivated by minimizing the inner product approximation errors for pairs of queries and datapoints whose inner product is significant, given a quantization scheme (e.g., vector quantization, product quantization, additive quantization), we propose to minimize the weighted quantization error

(9)

where

is a hyperparameter depending on the datapoint dimension

and the threshold imposed on the inner product between queries and datapoints. Note that when the hyperparameter is set to be 1, (9) is reduced to the traditional reconstruction errors of the datapoints.

4 Application to Quantization Techniques

In this section, we derive algorithms for applying new loss functions in  (9) to common quantization techniques, including vector quantization, product quantization and binary quantization.

4.1 Vector Quantization

Recall that in vector quantization, given a set of datapoints, we want to find a codebook of size and quantize each datapoint as one of the codes. The goal is to minimize the total squared quantization error. Formally, the traditional vector quantization solves

One of the most popular quantization algorithms is the -Means algorithm, where we iteratively partition the datapoints into quantizers where the centroid of each partition is set to be the mean of the datapoints assigned in the partition.

Motivated by minimizing the inner product quantization error for cases when the inner product between queries and datapoints is high, our proposed objective solves:

(10)

where is a hyperparameter as a function of and following (9).

We solve (10) through a -Means style Lloyd’s algorithm, which iteratively minimizes the new loss functions by assigning datapoints to partitions and updating the partition quantizer in each iteration. The assignment step is computed by enumerating each quantizer and finding the quantizer that minimizes (10). The update step finds the new quantizer for a partition of datapoints , i.e.,

(11)

Because of the changed objective, the best quantizer is no longer the center of the partition. Since (11) is a convex function of , there exists an optimal solution. The update rule given a fixed partitioning can be found by setting the partial derivative of (11) with respect to each codebook entry to zero. This algorithm provably converges in a finite number of steps. See Algorithm 1 in Appendix for a complete outline of the algorithm. Note that, in the special case that , it reduces to regular -Means algorithm.

Theorem 4.1.

The optimal solution of (11) is

(12)
Proof.

See Section. 7.2 of the Appendix. ∎

Theorem 4.2.

Algorithm 1 converges in finite number of steps.

Proof.

This immediately follows from the fact that the loss defined in (10) is always non-increasing during both assignment and averaging steps under the changed objective. ∎

4.2 Product Quantization

A natural extension of vector quantization is product quantization, which works better in high dimensional spaces. In product quantization, the original vector space is decomposed as the Cartesian product of distinct subspaces of dimension , and vector quantizations are applied in each subspace separately 111Random rotation or permutation of the original vectors can be done before doing the Cartisean product.. For example, let be written as

where is denoted as the sub-vector for the -th subspace. We can quantize each of the to with its vector quantizer in subspace , for . With product quantization, is quantized as and can be represented compactly using the assigned codes.

Using our proposed loss objective (9), we minimize the following loss function instead of the usual objective of reconstruction error:

(13)

where denotes the product quantization of , i.e.,

To optimize (13), we apply the vector quantization of Section 4.1 over all subspaces, except that the subspace assignment is chosen to minimize the global objective over all subspaces (13), instead of using the objective in each subspace independently. Similarly, the update rule is found by setting the derivative of loss in  (13) with respect to each codebook entry to zero. The complete algorithm box of Algorithm (1) is found in Section 7.4 of the Appendix.

4.3 Binary Quantization

Another popular family of quantization function is binary quantization. In such a setting, a function is learned to quantize datapoints into binary codes, which saves storage space and can speed up distance computation. There are many possible ways to design such a binary quantization function. We follow the setting of Dai et al. Dai et al. (2017), which explicitly minimizes reconstruction loss and has been shown to outperform earlier baselines. In their paper, a binary auto-encoder is learned to quantize and dequantize binary codes:

where

is the “encoder” part which binarizes original datapoint into binary space and

is the “decoder” part which reconstructs the datapoints given the binary codes. The authors of the paper uses as the encoder function and as the decoder functions, respectively. The learning objective is to minimize the reconstruction error of

, and the weights in the encoder and decoder are optimized end-to-end using standard stochastic gradient descent. Following our discussion in Section. 

3, this is an suboptimal objective in the MIPS settings, and the learning objective given in  (9) should be preferred. This implies minimal changes to the algorithm of Dai et al. (2017) except the loss function part, while holding everything else unchanged.

5 Experiments

In this section, we show our proposed quantization objective leads to improved performance on MIPS, when applied to binary and product quantization tasks. Other more sophisticated methods can also benefit from the new objective. Our experiments analyze the usefulness of proposed objectives relative to the typical reconstruction loss. Finally, we also discuss the speed-recall trade off in Section 5.5.

5.1 Estimation of True Inner Product

Figure 1: (a) in (7) computed analytically as function of using recursion of (8

), quickly approaches its limit. (b) The relative error of inner product estimation for true Top-1 on

Glove1.2M dataset, across multiple number of bits settings for at . (c) The retrieval Recall1@10 for different .

In addition to retrieval, many application scenarios also require estimating the value of the inner product

. For example, in softmax functions, inner product values are often used to compute the probability directly; In many textual or multimedia retrieval system,

is used as a score for downstream classification. One direct consequence of (9) is that the objective weighs pairs by their importance and thus leads to lower estimation error on top-ranking pairs. We compare the estimation error of true inner product value on the Top-1 pair, under the same bitrate with product quantization on Glove1.2M dataset. Glove1.2M is a collection of 1.2 million 100-dimensional word embeddings trained with the method described in Pennington et al. (2014). We measure as the relative error on true inner product. New objective clearly produces smaller relative error over all bitrate settings (Figure. 1).

5.2 Maximum Inner Product Search Retrieval

Next, we show our MIPS retrieval results, comparing quantization methods that use our proposed loss function and the the state-of-the-arts that uses reconstruction loss. The goal of the comparison is to show that the new objective is compatible with common quantization techniques, such as product and binary quantization and can benefit their performance. To evaluate retrieval performance, we use RecallM@N metric, which measures the fraction of the ground truth MIPS results recalled in first datapoints returned by the retrieval algorithm.

5.2.1 Product Quantization Retrieval

We follow the formulation in Section 4.2 to evaluate the effect of new objective for MIPS, testing on Glove1.2M dataset, reporting Recall1@1, Recall1@10, Recall10@10, Recall10@100, respectively in Figure. 2a. To be compatible with SIMD optimized ADC computation, we set the number of centers in each cluster to . The experiments clearly show that the proposed loss function significantly outperforms existing state-of-the-arts methods.

5.2.2 Binary Quantization Retrieval

We use SIFT1M, which is a standard benchmarking dataset with 1 million, 128 dimensional normalized datapoints extracted from image descriptors. We apply the learning algorithm in Section. 4.3 using MIPS and report the results in Figure. 2b.

5.3 Extreme Classification Inference

Extreme classification with large number of classes requires evaluating the last layer (classification layer) with all possible classes. When there are classes, this becomes a major computation bottleneck as it involves huge matrix multiplication. Thus this is often solved using Maximum Inner Product Search to speed up the inference. We evaluate our methods on extreme classification using the Amazon-670k dataset Bhatia et al. (2015)

. An MLP classifier is trained over 670,091 classes, where the last layer has a dimensionality of 1,024. We evaluate retrieval performance on the classification layer and show the results in Figure. 

2c, by comparing it against brute force matrix multiplication.

5.4 Sensitivity to choice of

Another interesting question is how the retrieval performance relates to , given . Intuitively, if is set too low, then the objective takes almost all pairs of query and database points into consideration, and becomes similar to the standard reconstruction loss. If is too high, then very few pairs will be considered and the quantization may be inaccurate for low value inner product pairs. Figure 1 shows the retrieval evaluation of Glove1.2M dataset under different . We use for all of our retrieval experiments.

Glove1.2M 1@1 1@10 10@10 10@100 100 bits, PQ 0.201 0.486 0.237 0.587 100 bits, Ours 0.243 0.550 0.268 0.635 200 bits, PQ 0.427 0.833 0.487 0.915 200 bits, Ours 0.535 0.907 0.559 0.955 400 bits, PQ 0.732 0.992 0.782 0.999 400 bits, Ours 0.805 0.998 0.833 1.000 (a) Retrieval recall on Glove1.2M SIFT1M 1@1 1@10 10@10 10@100 64 bits, SGH 0.028 0.096 0.053 0.220 64 bits, Ours 0.071 0.185 0.093 0.327 128 bits, SGH 0.073 0.195 0.105 0.376 128 bits, Ours 0.196 0.406 0.209 0.574 256 bits, SGH 0.142 0.331 0.172 0.539 256 bits, Ours 0.362 0.662 0.363 0.820 (b) Retrieval recall on SIFT1M Amazon670k 1@1 1@10 10@10 10@100 256 bits, PQ 0.652 0.995 0.782 0.974 256 bits, Ours 0.656 0.996 0.787 0.977 1024 bits, PQ 0.778 1.000 0.901 1.000 1024 bits, Ours 0.812 1.000 0.899 1.000 4096 bits, PQ 0.867 1.000 0.973 1.000 4096 bits, Ours 0.950 1.000 0.980 1.000 (c) Retrieval recall on Amazon670k (d) Speed-recall trade-off on Glove1.2M Recall@10
Figure 2: (a) (b) (c) Recall on Glove1.2M, SIFT1M and Amazon670k dataset, comparing the state-of-the-art with reconstruction loss and proposed objective, respectively. Retrieval is computed using asymmetric distance computation (ADC). (d) Speed benchmark with baselines from Aumüller et al. (2019), details of baselines can be found on their website: http://ann-benchmarks.com/. Our approach provides the best speed-recall trade-off over popular state-of-the-art methods.

5.5 Speed benchmark

Although MIPS and nearest neighbor search are different problems, they are often asked to be compared, especially in the case of unit norm data, where the problem become equivalent. To evaluate our methods on speed-recall trade-off, we adopted the methodology of public benchmark suite ANN-benchmarks Aumüller et al. (2019), which plots a comprehensive set of algorithms for comparison. The benchmarks are conducted on same platform of Intel Xeon W-2135 with one CPU single thread. Our implementation builds on product quantization in proposed method and SIMD based ADC Guo et al. (2016). This is further combined with a vector quantization based tree Wu et al. (2017), and the curve is plotted by varying the number of leaves to search in the tree. Figure 2d shows our performance on Glove1.2M significantly outperforms the competing methods, especially in high recall region, where Recall10 is over 80%.

6 Conclusion

In this paper, we propose a new quantization loss function for inner product search, which replaces traditional reconstruction error. The new loss function is weighted based on the inner product values, giving more weight to the pairs of query and database points with higher inner product values. The proposed loss function is theoretically proven and can be applied to a wide range of quantization methods, for example product and binary quantization. Our experiments show superior performance on retrieval recall and inner product value estimation, compared to methods that use reconstruction error. The speed-recall benchmark on public datasets further indicates that the proposed method outperform state-of-arts baselines which are known to be hard to beat.

References

  • A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt (2015) Practical and optimal lsh for angular distance. In Advances in Neural Information Processing Systems, pp. 1225–1233. Cited by: §2.
  • M. Aumüller, E. Bernhardsson, and A. Faithfull (2019) ANN-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Information Systems. Cited by: Figure 2, §5.5.
  • A. Babenko and V. Lempitsky (2012) The inverted multi-index. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3069–3076. Cited by: §2.
  • A. Babenko and V. Lempitsky (2014) Additive quantization for extreme vector compression. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 931–938. Cited by: §1, §2.
  • K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain (2015) Sparse local embeddings for extreme multi-label classification. In Advances in neural information processing systems, pp. 730–738. Cited by: §5.3.
  • P. Cremonesi, Y. Koren, and R. Turrin (2010) Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the Fourth ACM Conference on Recommender Systems, pp. 39–46. Cited by: §1.
  • B. Dai, R. Guo, S. Kumar, N. He, and L. Song (2017) Stochastic generative hashing. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 913–922. Cited by: §2, §4.3.
  • S. Dasgupta and Y. Freund (2008) Random projection trees and low dimensional manifolds. In

    Proceedings of the fortieth annual ACM symposium on Theory of computing

    ,
    pp. 537–546. Cited by: §2.
  • T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik (2013) Fast, accurate detection of 100,000 object classes on a single machine: technical supplement. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou (2015) Deep hashing for compact binary codes learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • T. Ge, K. He, Q. Ke, and J. Sun (2014) Optimized product quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (4), pp. 744–755. External Links: Document, ISSN 0162-8828 Cited by: §1, §2.
  • Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin (2013)

    Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2916–2929. Cited by: §2.
  • R. Guo, S. Kumar, K. Choromanski, and D. Simcha (2016) Quantization based fast inner product search. In

    Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016

    ,
    pp. 482–490. External Links: Link Cited by: §2, §5.5.
  • B. Harwood and T. Drummond (2016) FANNG: fast approximate nearest neighbour graphs. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 5713–5722. Cited by: §2.
  • K. He, F. Wen, and J. Sun (2013) K-means hashing: an affinity-preserving quantization method for learning binary compact codes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2938–2945. Cited by: §2.
  • P. Indyk and R. Motwani (1998)

    Approximate nearest neighbors: towards removing the curse of dimensionality

    .
    In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613. Cited by: §2.
  • H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §1, §2, §2.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §1.
  • Y. A. Malkov and D. A. Yashunin (2016) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. CoRR abs/1603.09320. External Links: Link, 1603.09320 Cited by: §2.
  • J. Martinez, J. Clement, H. H. Hoos, and J. J. Little (2016) Revisiting additive quantization. In European Conference on Computer Vision, pp. 137–153. Cited by: §2.
  • Y. Matsui, T. Yamasaki, and K. Aizawa (2015) Pqtable: fast exact asymmetric distance neighbor search for product quantization using hash tables. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1940–1948. Cited by: §2.
  • M. Muja and D. G. Lowe (2014)

    Scalable nearest neighbor algorithms for high dimensional data

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11), pp. 2227–2240. Cited by: §2.
  • S. Mussmann and S. Ermon (2016) Learning and inference via maximum inner product search. In Proceedings of The 33rd International Conference on Machine Learning, Vol. 48, pp. 2587–2596. Cited by: §1.
  • B. Neyshabur and N. Srebro (2014) On symmetric and asymmetric lshs for inner product search. arXiv preprint arXiv:1410.5518. Cited by: §2.
  • M. Norouzi, A. Punjani, and D. J. Fleet (2014) Fast exact search in hamming space with multi-index hashing. IEEE transactions on pattern analysis and machine intelligence 36 (6), pp. 1107–1119. Cited by: §1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    ,
    pp. 1532–1543. Cited by: §5.1.
  • A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell (2017) Neural episodic control. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 2827–2836. Cited by: §1.
  • P. Ram and A. G. Gray (2012) Maximum inner-product search using cone trees. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 931–939. Cited by: §2.
  • A. Shrivastava and P. Li (2014) Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329. Cited by: §2, §2, §2.
  • J. Wang, H. T. Shen, J. Song, and J. Ji (2014) Hashing for similarity search: a survey. arXiv preprint arXiv:1408.2927. Cited by: §2.
  • J. Wang, W. Liu, S. Kumar, and S. Chang (2016) Learning to hash for indexing big data survey. Proceedings of the IEEE 104 (1), pp. 34–57. Cited by: §2.
  • X. Wu, R. Guo, A. T. Suresh, S. Kumar, D. N. Holtmann-Rice, D. Simcha, and F. Yu (2017) Multiscale quantization for fast similarity search. In Advances in Neural Information Processing Systems 30, pp. 5745–5755. Cited by: §1, §2, §5.5.
  • I. E. Yen, S. Kale, F. Yu, D. Holtmann-Rice, S. Kumar, and P. Ravikumar (2018) Loss decomposition for fast learning in large output spaces. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 5640–5649. Cited by: §1.
  • T. Zhang, C. Du, and J. Wang (2014) Composite quantization for approximate nearest neighbor search.. In ICML, Vol. 2, pp. 3. Cited by: §2.
  • C. Zhu, S. Han, H. Mao, and W. J. Dally (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §2.

7 Appendix

7.1 Proof of Theorem 3.3

Proof of Theorem 3.3.

First, it is easy to see that because , . Next, from Cauchy–Schwarz inequality for integrals, we have

Rearranging this we have , which proves that is monotonically non-increasing. Given that it has a lower bound and is monotonically non-increasing, the limit of exists.

Dividing both sides of Equation. 8 by , we have

Thus exists. And therefore also exists. Furthermore,

Finally we have , and this proves Theorem 3.3. ∎

7.2 Proof of Theorem 4.1

Proof of Theorem 4.1.

Indeed,

Therefore, the derivative of with respect to is

Setting , we have

7.3 Algorithm for Vector Quantization with the modified objective

Input:
  • A set of datapoints .

  • A scalar , the weight for quantization error components tradeoff (selection guided by desired inner product threshold in Sec 3).

  • A positive integer , the size of the codebook.

Output:
  • A set of codebook

  • Partition assignment for the datapoints such that is quantized as .

Algorithm:
Initialize the centroids by choosing random datapoints.
Set .
do
     .
     [Partition Assignment]
     for each  do
     end for
     [Centroid Update]
     for each  do
where denote the cardinality of the set .
     end for
     Compute
while 
Output the codebook and the assignment .
Algorithm 1 Proposed Vector Quantization Algorithm For Minimizing Weighted Quantization Errors

7.4 Codebook Optimization in Product Quantization

For example, consider the first vector in the codebook for the first subspace, and let be one of the datapoints the first subspace of which is encoded as , i.e., . Write as , and as . Then

Let denote the set of indices . Therefore, the partial derivative of (13) with respect to is

(14)

Set (14) to be zero, and we have

(15)
<