Local Orthogonal Decomposition for Maximum Inner Product Search

03/25/2019 ∙ by Xiang Wu, et al. ∙ 8

Inverted file and asymmetric distance computation (IVFADC) have been successfully applied to approximate nearest neighbor search and subsequently maximum inner product search. In such a framework, vector quantization is used for coarse partitioning while product quantization is used for quantizing residuals. In the original IVFADC as well as all of its variants, after residuals are computed, the second production quantization step is completely independent of the first vector quantization step. In this work, we seek to exploit the connection between these two steps when we perform non-exhaustive search. More specifically, we decompose a residual vector locally into two orthogonal components and perform uniform quantization and multiscale quantization to each component respectively. The proposed method, called local orthogonal decomposition, combined with multiscale quantization consistently achieves higher recall than previous methods under the same bitrates. We conduct comprehensive experiments on large scale datasets as well as detailed ablation tests, demonstrating effectiveness of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Maximum inner product search (MIPS) has become a popular paradigm for solving large scale classification and retrieval tasks. For example, in recommendation systems, user queries and documents are embedded into dense vector space of the same dimensionality and MIPS is used to find the most relevant documents given a user query [9]. Similarly in extreme classification tasks [10], MIPS is used to predict the class label when a large number of classes are involved, often on the order of millions or even billions. Lately it has also been applied to training tasks such as scalable gradient computation in large output spaces [23], efficient sampling for speeding up softmax computation [17] and sparse update in end-to-end trainable memory systems [20].

Formally, MIPS solves the following problem. Given a database of vectors and a query vector , where both , we want to find such that .

Although related, MIPS is different from nearest neighbor search in that inner product (IP) is not a metric, and triangle inequality does not apply. We discuss this more in Section 2.

1.1 Background

We refer to several quantization techniques in this work and we briefly introduce their notations:

  • Scalar Quantization (SQ): The codebook of SQ contains scalars. A scalar is quantized into . The bitrate per input is .

  • Uniform Quantization (UQ): UQ is a specialization of SQ, whose codebook is parameterized with only 2 scalars: . Though the UQ codebook is restricted to this structure, its major advantage over SQ is that the codebook can be compactly represented with only 2 scalars.

  • Vector Quantization (VQ): VQ is a natural extension of scalar quantization into vector spaces. Give a codebook with codewords, an input vector is quantized into: . And the code that we store for vector is the index of the closest codeword in the VQ codebook: .

  • Product Quantization (PQ): To apply PQ, we first divide a vector into subspaces: . And within each subspace we apply an independent VQ with codewords, i.e., . The bitrate per input for PQ is thus .

The IVFADC [12] framework combines VQ for coarse partitioning and PQ for residual quantization:

  • IVF: An inverted file is generated via a VQ partitioning. Each VQ partition contains all database vectors whose closest VQ center is , i.e., . Within each partition , residual vectors are further quantized with PQ and we denote the quantized approximation of the residual as .

  • ADC: Asymmetric distance computation refers to an efficient table lookup algorithm that computes the approximate IP. For VQ, . And for PQ with multiple subspaces, we can decompose the dot product as:

  • Non-Exhaustive Search: When processing a query , we use IVF to determine the top partitions according to . We select top partitions to search into and then apply ADC to residuals in these top partitions.

There are many variations of the IVFADC setup. For example, the codebooks of VQ partitioning and PQ quantization can be (jointly) learned, and asymmetric distance computation can be implemented with SIMD instructions [22, 8]. We discuss these variations in depth and the relation to this work in the Section 2.

In large scale applications, as the database size increases, larger and are generally used in IVF. Auvolat et al. in [2] proposes to use for 1-level VQ partitions and for 2-level etc. From latest publications [5, 13], the number of partitions for large datasets is among . Hence in the following discussion, we focus on the case where the number of partitions is much larger than the vector dimension, i.e., .

The scale of modern MIPS systems is often limited by the cost of storing the quantized vectors in main memory. Therefore, we focus on methods that operate under low bitrate and can still achieve high recall. This is reflected in our experiments in Section 5.

1.2 Empirical Study of Inner Product Variance

The overall quality of IP approximation is crucially dependent on the joint distribution of the query and the residual , where is the center of the partition that is assigned to. In the non-exhaustive setup, the fact that we search into partition reveals strong information about the local conditional query distribution. Nonetheless, previous methods approximate by first quantizing independent of distribution. And a close analysis of the IP

clearly shows that its variance is distributed

non-uniformly in different directions. Formally a direction is a unit norm vector , and the the projected IP on direction is defined as: . Within a partition , we define the projected IP variance along as

. Note that the empirical first moment

by construction of VQ partitions.

We conduct two different analyses with the public Netflix [7] dataset. In Figure 0(a), we fix the query and thus its top partition and its center . We pick the first direction and the second direction orthogonal to randomly. We then generate evenly spaced directions in the subspace spanned by as: . We finally plot of the set of points , i.e., the distance between each point and the origin represents the projected IP variance on its direction. The elongated peanut shape demonstrates clearly that variance of projected IPs is more concentrated on some directions than others.

In Figure 0(b), we fix a partition and plot 1) the residuals in the partition and 2) queries that have maximum IPs with the partition center. We project all residuals and queries with maximum IPs onto the 2-dimensional subspace spanned by the partition center direction and the first principal direction of the residuals. Residuals in blue are scattered uniformly in this subspace, but queries in black are much more concentrated along the direction of partition center .

(a)
(b)
Figure 1:

Non-uniform distribution of projected IP variance: (a) projected IP variance vs. angle in 2-dimensional subspace spanned by

. The projected IP variance is represented by the distance from the origin to the corresponding blue point at the angle. Variances are linearly scaled so that they fit the aspect ratio. (b) Scatter plot of residuals and queries that have maximum IPs with the partition center.

1.3 Contributions

This paper makes following main contributions:

  • Introduces a novel quantization scheme that directly takes advantage of the non-uniform distribution of the variance of projected IPs.

  • Identifies the optimal direction for projection within each partition and proposes an effective approximation both theoretically and empirically.

  • Designs complete indexing and search algorithms that achieve higher recall than existing techniques on widely tested public datasets.

2 Related Work

The MIPS problem is closely related to the nearest neighbor search problem as there are multiple ways to transform MIPS into equivalent instances of nearest neighbor search. For example, Shrivastava and Li [21] proposed augmenting the original vector with a few dimensions. Neyshabur and Srebro proposed another simpler transformation to augment just one dimension to original vector: . Empirically, the augmentation strategies do not perform strongly against strategies that work in the unaugmented space.

Learning Rotation and Codebooks. Learning based variations of IVFADC framework have been proposed. One of the focuses is learning a rotation matrix which is applied before vectors are quantized. Such rotation reduces intra-subspace statistical dependence as analyzed in OPQ [11, 18] and its variant [14] and thus lead to smaller quantization error. Another focus is learning codebooks that are additive such as [3, 4, 15, 25, 26, 16]. In these works, codewords are learned in the full vector space instead of a subspace, and thus are more expressive. Empirically, such additive codebooks perform better than OPQ at low bitrates but the gain diminishes at higher bitrates.

ADC Implementation. ADC transforms inner product computations into a lookup table based operation, which can be implemented in different ways. The original ADC paper [12] used L1 cache based lookup table. Johnson et al. [13] used an GPU implementation for ADC lookup. A SIMD based approach was also developed by [1, 22]. Again, this is orthogonal to the local decomposition idea of this work, as any ADC implementation can be used in this work.

Rotations and codebooks are often applied in IVFADC variations, but there are significant costs associated with them. In the most extreme cases, Locally Optimized Product Quantization (LOPQ) [14] learns a separate rotation matrix and codebook for each partition. This leads to an extra memory cost of and more multiplications for each query at search time. where is the number of VQ partitions we search. When and increase, the overhead become quickly noticeable and may become even more expensive than ADC itself. For example, when , , , performing the rotation once is as expensive as performing 6,400 ADC computations under an optimized implementation. In practice, it is often desirable to avoid per partition rotation or codebooks, but learn global codebooks and rotation.

3 Methods

Existing approaches based on the IVFADC framework mainly focus on minimizing the squared loss when quantizing residuals. Formally, they aim at finding an optimal quantization parameter . As we have discussed in previous sections, our “signal”, i.e., residual IPs

exhibit strong non-uniformity locally within a partition. By directly taking advantage of this skewed distribution, our proposed method achieves higher recall at the same bitrate when compared with others that are agnostic of this phenomenon.

3.1 Local Orthogonal Decomposition

Given a unit norm vector or direction , we define:

Hence is the projection matrix onto direction and is projection matrix onto its complement subspace. We can thus decompose a residual as: .

Similar to the original IVFADC framework, we first decompose the IP between a query and a database vector into . With our new insight of non-uniformity of the distribution of our signal, we propose to further decompose the residual IP with respect to a learned direction as:

We name the projected component of residual and the orthogonal component. Note that the projected component resides in a 1-dimensional subspace spanned by and can also be very efficiently quantized with existing scalar quantization techniques.

3.2 Multiscale Quantization of Orthogonal Component

We define and to simplify notation. Multiscale quantization proposed in [22] learns a separate scale and rotation matrix that are multiplied to the product quantized residual as , where is a learned rotation matrix and is the production quantization learned from the normalized orthogonal components. Differently from the original MSQ, our scale is chosen to preserve the norm of the orthogonal component , not the whole residual :

The rotation is omitted as it doesn’t affect the norm. Another scalar quantization (SQ) is learned on the scales to further reduce the storage cost and speedup ADC. The final MSQ quantized residual is then:

Where is the non-uniform scalar quantization for partition learned via a Lloyd algorithm. The number of codewords in the SQ codebook is fixed at 16 for our experiments, hence its storage cost is negligible.

3.3 Adjustment to Projected Component

In general, unlike , is not orthogonal to anymore. Recall that we want to approximate in the orthogonal subspace as . Now a subtle performance issue arises. A critical improvement to ADC introduced since the original OPQ [11] is to move the rotation multiplication to the query side so that it is done only once globally. Formally with MSQ, we can perform following: .

With LOD, the extra projection in front of prevents us from moving to the side as the two matrices and are not commutative in general. However we have . We can perform fast ADC on the term as proposed in the orginal MSQ [22] and only multiply matrix to once. The extra term can be removed by subtracting it from the projected component before quantization.

3.4 Uniform Quantization of Projected Component

Following the procedure above, after projection onto direction , we have the original residual contributing and its quantized orthogonal component contributing an extra term . We thus need to quantize the difference between the two as . We propose to learn a uniform quantization:

Whereby:

  • and are the maximum and minimum of the finite input set . And is the number of bits for uniform quantization;

  • scales the input into the range .

  • centers the input;

  • is the function that rounds a floating point number to its nearest integer.

is the integer code in bits that we store for each residual. In practice, we may relax to the

th quantile of the input to guard against outliers, and similarly

th quantile for . We clip rounded outputs to within .

The main advantage of UQ over other scalar quantization techniques is that its codebook size is independent of the number of bits used for its codes. This is critical as we use for our experiments. It also enables fast computation of approximate IP between query and projected component as: .

Putting both quantization schemes together, we can approximate the residual IP by replacing each component with its quantized result:

And for each term, we can perform efficient ADC.

3.5 Preserving Norms

We design the LOD+MSQ framework with the objective of preserving norms of residuals. Note that:

In the projected subspace, we have:

In the orthogonal subspace, we have

Hence we preserve the norm of up to small scalar quantization errors in and . Empirically, preserving norms improves recall when there is considerable variation in residual norms [22].

3.6 Indexing and Search Algorithms

We list all parameters of the overall indexing and search algorithms besides their inputs in Table 1.

#partitions in the inverted file
#codebooks used for PQ encoding
#codewords used in each PQ codebook
#bits for UQ encoding
#bits for SQ encoding
#partitions to apply ADC to
Table 1: Parameters for the overall indexing and search algorithms.
Index() begin
        input : Database and function ProjDir
        output : Partitions with centers , projection directions , uniform quantization and multiscale quantization
        for  to  do
               Compute Compute
        for  to  do
               Compute Compute Compute
       return
Algorithm 1 Index database with local orthogonal decomposition and multiscale quantization. The projection direction is parameterized with the function ProjDir.
Search() begin
        input : query , number and outputs of
        output : Approximate top maximum inner products
        Compute for  do
               Compute Compute Compute Compute // : index of
              
       return
Algorithm 2 Search top inner products in an indexed database with query .

We want to highlight that in memory bandwidth limited large scale MIPS, the search time is well approximated by the number of bits read: . In our experiments, we fix . The bitrate of the original dataset is 32 bits per dimension and we use either or 1 bit per dimension in our quantization schemes. Hence we achieve over 2-orders of magnitudes of speedup.

4 Analysis

We leave the projection direction function as an input to our indexing algorithm in the previous section. In this section, we formally investigate the optimal projection direction given partition and its center conditional on the fact that .

We start by analyzing the error introduced by our quantization scheme to the approximate residual IP. Let and . Consider the quantization error on the residual IP within partition as:

First, UQ achieves an error bound of in its 1-dimensional subspace, which is much lower than the error bound that MSQ can achieve in the orthogonal -dimensional subspace. UQ and MSQ are two completely separate quantization steps, and the cross product of their quantization errors are expected to be small. Therefore we shall focus on minimizing the last quantization error term averaging over and :

If we define and , we can then rewrite the optimization as: . Notice that the matrix in the middle is also dependent on the direction , which makes this optimization problem very challenging.

However the learned rotation in MSQ serves two purposes: 1) it reduces correlation between dimensions and 2) it evens variance allocation across PQ subspaces [11]. Hence it is reasonable to expect the errors to be close to isotropic across dimensions assuming the subspace spanned by orthogonal components does not degenerate into a low dimensional subspace. This is to assume:

Assumption 1.

The empirical covariance matrix of orthogonal component errors is isotropic.

This assumption allows us to approximate with some constant . Now we arrive at . Let’s introduce a simplfication of the conditional expectation as . We need to solve the maximization problem of: . The matrix in the middle is the conditional covariance matrix of all queries that have maximum IPs with center

. If we can estimate this matrix

accurately, we can simply take its first principal direction as our optimal direction .

In real applications, for any partition center, we can only sample a very limited number of queries such that . This approach thus can’t scale to large in the range of . This makes the estimation of inherently of high variance. To overcome this noisy estimation issue, we provide both theoretical and empirical support of approximating the optimal direction with the partition center direction .

4.1 Alignment of Query and Partition Center

We first estimate the magnitude of the projected query component along the partition center direction. In the original setup, we have a set of fixed centers and a random query. To facilitate our analysis, we can fix the query and instead rotate centers with respect to the query. We start by studying the case where both centers and query are normalized and later lift the constraint. We consider the scenario where centers after rotation follow a uniform distribution over the unit sphere . This provides a more conservative bound than that of real datasets, because real queries tend to be tightly clustered around the “topics” in the database due to formulation of the training objectives and regularizers [27].

Given a normalized query and random centers uniformly sampled from the unit sphere

, with probability at least

, the maximum cosine similarity between the query and

is at least :

In practical settings, we have . Let , we can weaken it to a more intuitive form:

Lemma 1.

If we uniformly sample 2 vectors and from the unit sphere , we have

A few comments on these 2 results:

  • From Theorem 4.1, we can see that the dependency of the maximum residual IP on the confidence parameter is rather weak at .

  • If we choose , we can thus show that for at least half of the queries, the largest IP is at least larger than the cosine similarity between two randomly sampled vectors.

Next, we allow centers to have varying norms:

Suppose the directions of centers are uniformly sampled from the unit sphere , and their sorted norms are . With probability at least , the maximum cosine similarity between the query and is at least :

Intuitively, as increases, the first factor decreases, but the second one increases, thus the maximum is achieved somewhere in the middle. Specifically, we can see that . This bound is robust to any small outlier near , but it can be influenced by the largest norm .

However, we remark that when the largest center norm is significantly larger than the median , the MIPS problem itself becomes much easier. As the relative magnitude of increases, its partition becomes more likely to contain the maximum IP than the rest. And furthermore, the gap between the maximum IP in ’s partition and the maximum IP from other partitions becomes wider. Both the concentration of the maximum IP in one partition and the large gap contribute to better recall. Hence LOD helps adversarial instances more than easy instances, which explains the consistent recall improvement in our experiments. Exact quantification of this behavior is one of our future research directions.

We conclude this section with the observation that real queries tend to be more clustered along partition centers than what is suggested by Theorem 4.1, i.e., the observed is much higher than . We hypothesize that this is due to the training process that aligns query vectors with the natural topical structure in the database vector space.

4.2 Asymptotically Optimal Projection Direction

Let be the orthogonal query component in the complement subspace of the partition center. Under the same assumption as Theorem 4.1, we are ready to state our main result: Let

be the ratio between the the largest and smallest non-zero eigenvalues of the matrix

. The optimal direction is equal to the partition center direction with probability at least if:

With some positive constant , we can rewrite the above into a more intuitive form:

This theorem states that when the number of partitions increases above a threshold dependent on the ratio and , the optimal direction is equal to the partition center direction with probability at least . Hence asymptotically, the optimal direction approaches the partition center direction for our LOD+MSQ framework as and .

4.3 Approximation with Benefits

Approximating the optimal direction with the partition center direction also brings practical benefits:

  • No extra storage cost, as we don’t have to store a separate vector per partition.

  • Free projection at search time, as we have computed all IPs between the query and centers for partition selection. We just need to perform an operation to divide the IP by the center norm to get the projected component ).

5 Experiments

5.1 Datasets

We apply our method along with other state-of-the-art MIPS techniques on public datasets Netflix [7] and Glove [19]. The Netflix dataset is generated with a regularized matrix factorization model similar to [24]. The Glove dataset is downloaded from https://nlp.stanford.edu/projects/glove/. For word embeddings, the cosine similarity between these embeddings reflects their semantic relatedness. Hence we -normalize the Glove dataset, and then cosine similarity is equal to inner product.

We list details of these datasets in Table 2.

Dataset #Vectors #Dims
Netflix 17,770 200
Glove 1,183,514 200
20 2 10%
1000 100 10%
Table 2: Datasets used for MIPS experiments.

5.2 Recalls

We apply following algorithms to both of our datasets:

  • MIPS-PQ: implements the PQ [12] quantization scheme proposed in the original IVFADC framework.

  • MIPS-OPQ: implements the OPQ [11] quantization scheme that learns a global rotation matrix.

  • L2-OPQ: implements the OPQ quantization scheme and also the MIPS to -NNS conversion proposed in [6]. We do not transform the Glove dataset since -NNS retrieves the same set of database vectors as MIPS after normalization.

  • MIPS-LOD-MSQ: implements our proposed method with both LOD and MSQ. The projection direction is set to the partition center as an effective approximation to the optimal direction.

We set parameters to following values for all our recall experiments:

  • IVF: we keep average partition size at around 1,000 and we always search 10% of the partitions with ADC. This is in-line with other practices reported in benchmarks and industrial applications [5, 13].

  • Product Quantization: we use either codebooks, each of which contains codewords for PQ and OPQ. For LOD+MSQ, we set when and when to keep the number of bits spent on each database vector the same. The number of codewords is fixed at 16 for efficient SIMD based implementation of in-register table look-up [22, 8].

  • UQ: we use bits for uniform quantization for Netflix and bits for Glove, which results in 256 and 16 levels in the codebook respectively.

  • MSQ: we use bits and accordingly 16 levels for scalar quantization of scales in MSQ for all experiments. We apply the same technique in [22] to avoid explicitly storing the codes and hence it incurs no extra cost in storage.

The combination of LOD+MSQ consistently outperforms other existing techniques under the same bitrate. Its relative improvement is higher on Netflix because the residual norms of the Netflix dataset exhibit larger variance than those of the Glove dataset.

(a)
(b)
Figure 2: Experiments on the Netflix dataset: (a) recall vs for 100-bit encoding of database vectors and (b) recall vs for 200-bit encoding.
(a)
(b)
Figure 3: Experiments on the Glove dataset: a) recall vs for 100-bit encoding of database vector and (b) recall vs for 200-bit encoding.
(a)
(b)
Figure 4: Ablation study of both LOD and MSQ on Netflix and Glove. All plots are generated with 100 bit per database vector.

5.3 Ablation

To systematically investigate the contribution of LOD and MSQ in isolation, we perform ablation study with both datasets.

  • MIPS-OPQ, MIPS-LOD-MSQ: are repeated from the experiments reported from the previous section.

  • MIPS-MSQ: implements the MSQ quantization scheme directly on the residuals without LOD.

  • MIPS-LOD-OPQ: first applies LOD and then implements the OPQ quantization scheme on the orthogonal component .

The combination of LOD+MSQ consistently outperforms either one in isolation. Interestingly, LOD performs much better than MSQ alone on Netflix and worse on Glove. This is due to the fact that in the normalized Glove dataset, orthogonal components of residuals have larger norms than projected components. With LOD only, OPQ is applied to the orthogonal components and it fails to preserve norms at a low bitrate. And the decrease in recall is fairly discernable from the Figure 3(b).

6 Conclusion

In this work, we propose a novel quantization scheme that decomposes a residual into two orthogonal components with respect to a learned projection direction. We then apply UQ to the projected component and MSQ to the orthogonal component respectively. We provide theoretical and empirical support of approximating the optimal projection direction with the partition center direction, which does not require estimating the noisy conditional covariance matrix. The combination of local orthogonal decomposition and MSQ consistently outperforms other quantization techniques on widely tested public datasets.

References

  • [1] F. André, A.-M. Kermarrec, and N. Le Scouarnec. Cache locality is not enough: high-performance nearest neighbor search with product quantization fast scan. Proceedings of the VLDB Endowment, 9(4):288–299, 2015.
  • [2] A. Auvolat, S. Chandar, P. Vincent, H. Larochelle, and Y. Bengio. Clustering is efficient for approximate maximum inner product search. CoRR, abs/1507.05910, 2015.
  • [3] A. Babenko and V. Lempitsky. Additive quantization for extreme vector compression. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 931–938. IEEE, 2014.
  • [4] A. Babenko and V. Lempitsky. Tree quantization for large-scale similarity search and classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4240–4248, 2015.
  • [5] A. Babenko and V. Lempitsky. Efficient indexing of billion-scale datasets of deep descriptors. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2055–2063, June 2016.
  • [6] Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenigstein, N. Nice, and U. Paquet. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 257–264, 2014.
  • [7] J. Bennett, S. Lanning, and N. Netflix. The netflix prize. In In KDD Cup and Workshop in conjunction with KDD, 2007.
  • [8] D. W. Blalock and J. V. Guttag. Bolt: Accelerated data mining with fast vector compression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 727–735, 2017.
  • [9] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the Fourth ACM Conference on Recommender Systems, pages 39–46, 2010.
  • [10] T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine: Technical supplement. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [11] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):744–755, April 2014.
  • [12] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2011.
  • [13] J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
  • [14] Y. Kalantidis and Y. Avrithis. Locally optimized product quantization for approximate nearest neighbor search. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2329–2336. IEEE, 2014.
  • [15] J. Martinez, J. Clement, H. H. Hoos, and J. J. Little. Revisiting additive quantization. In European Conference on Computer Vision, pages 137–153. Springer, 2016.
  • [16] J. Martinez, H. H. Hoos, and J. J. Little. Stacked quantizers for compositional vector compression. CoRR, abs/1411.2173, 2014.
  • [17] S. Mussmann and S. Ermon. Learning and inference via maximum inner product search. In

    Proceedings of The 33rd International Conference on Machine Learning

    , volume 48, pages 2587–2596, 2016.
  • [18] M. Norouzi and D. J. Fleet.

    Cartesian k-means.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3017–3024, 2013.
  • [19] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1532–1543, 2014.
  • [20] A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell. Neural episodic control. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2827–2836, 2017.
  • [21] A. Shrivastava and P. Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pages 2321–2329, 2014.
  • [22] X. Wu, R. Guo, A. T. Suresh, S. Kumar, D. N. Holtmann-Rice, D. Simcha, and F. Yu. Multiscale quantization for fast similarity search. In Advances in Neural Information Processing Systems 30, pages 5745–5755. 2017.
  • [23] I. E.-H. Yen, S. Kale, F. Yu, D. Holtmann-Rice, S. Kumar, and P. Ravikumar. Loss decomposition for fast learning in large output spaces. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 5640–5649, 2018.
  • [24] H.-F. Yu, C.-J. Hsieh, Q. Lei, and I. S. Dhillon. A greedy approach for budgeted maximum inner product search. In Advances in Neural Information Processing Systems 30, pages 5453–5462. 2017.
  • [25] T. Zhang, C. Du, and J. Wang. Composite quantization for approximate nearest neighbor search. In ICML, number 2, pages 838–846, 2014.
  • [26] T. Zhang, G.-J. Qi, J. Tang, and J. Wang. Sparse composite quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4548–4556, 2015.
  • [27] X. Zhang, F. X. Yu, S. Kumar, and S. Chang. Learning spread-out local feature descriptors. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 4605–4613, 2017.

7 Appendix

7.1 Proof of Theorem 4.1

Without loss the generality, we can assume the query is fixed at . Thus the inner product between the query and a center becomes the value of the first dimension of the center, whose distribution is , where is the normalization constant. Its value is given by , where is the volume of the -dimensional unit hyperball: .

Ideally, we want to find the maximum that still satisfies . For any , it is clear that . And we have:

Let and replace by , we have:

Note that if we replace by 1, LHS increases, so we can replace it with this stronger guarantee:

Which becomes:

Note that: . So we can replace RHS with this stronger guarantee:

And for some positive constant with sufficient large , based on the two-sided Sterling formula. Plug this stronger guarantee into RHS:

And with sufficiently large , we can increase slightly to so that:

To make RHS more comprehensible, we note that:

For , we note that is concave, and it is entirely above the line , i.e., . Plug this into RHS, we arrive at:

Where .

7.2 Proof of Theorem 4.1

Fix an index , we can divide the centers into two groups with norms and . Let , i.e, is the index of , we have two cases:

  • , i.e., the maximum inner product center is in the second group. We know that its inner product at least the largest among centers all with norms . This implies that with probability at least .

  • . Now the maximum inner product is in the first group. We generate a new set of centers with dividing every center in the first group by the smallest norm , i.e.,