Library for fast classification in problems with large number of classes
We propose a quantization based approach for fast approximate Maximum Inner Product Search (MIPS). Each database vector is quantized in multiple subspaces via a set of codebooks, learned directly by minimizing the inner product quantization error. Then, the inner product of a query to a database vector is approximated as the sum of inner products with the subspace quantizers. Different from recently proposed LSH approaches to MIPS, the database vectors and queries do not need to be augmented in a higher dimensional feature space. We also provide a theoretical analysis of the proposed approach, consisting of the concentration results under mild assumptions. Furthermore, if a small sample of example queries is given at the training time, we propose a modified codebook learning procedure which further improves the accuracy. Experimental results on a variety of datasets including those arising from deep neural networks show that the proposed approach significantly outperforms the existing state-of-the-art.READ FULL TEXT VIEW PDF
Quantization based methods are popular for solving large scale maximum i...
Vector quantization (VQ) techniques are widely used in similarity search...
This paper addresses the nearest neighbor search problem under inner pro...
Top-k maximum inner product search (MIPS) is a central task in many mach...
There has been substantial research on sub-linear time approximate algor...
Product quantization (PQ) is a popular approach for maximum inner produc...
In this paper, we analyze the inner product of weight vector and input v...
Library for fast classification in problems with large number of classes
Many information processing tasks such as retrieval and classification involve computing the inner product of a query vector with a set of database vectors, with the goal of returning the database instances having the largest inner products. This is often called Maximum Inner Product Search (MIPS) problem. Formally, given a database , and a query vector drawn from the query distribution , where , we want to find such that . This definition can be trivially extended to return top- largest inner products.
The MIPS problem is particularly appealing for large scale applications. For example, a recommendation system needs to retrieve the most relevant items to a user from an inventory of millions of items, whose relevance is commonly represented as inner products 
. Similarly, a large scale classification system needs to classify an item into one of the categories, where the number of categories may be very large. A brute-force computation of inner products via a linear scan requires time and space, which becomes computationally prohibitive when the number of database vectors and the data dimensionality is large. Therefore it is valuable to consider algorithms that can compress the database and compute approximate much faster than the brute-force search.
The problem of MIPS is related to that of Nearest Neighbor Search with respect to distance (NNS) or angular distance (NNS) between a query and a database vector:
where is the norm. Indeed, if the database vectors are scaled such that constant , the MIPS problem becomes equivalent to LNNS or NNS problems, which have been studied extensively in the literature. However, when the norms of the database vectors vary, as often true in practice, the MIPS problem becomes quite challenging. The inner product (distance) does not satisfy the basic axioms of a metric such as triangle inequality and co-incidence. For instance, it is possible to have for some . In this paper, we focus on the MIPS problem where both database and the query vectors can have arbitrary norms.
As the main contribution of this paper, we develop a Quantization-based Inner Product (QUIP) search method to address the MIPS problem. We formulate the problem of quantization as that of codebook learning, which directly minimizes the quantization error in inner products (Sec. 3). Furthermore, if a small sample of example queries is provided at the training time, we propose a constrained optimization framework which further improves the accuracy (Sec. 3.2). We also provide a concentration-based theoretical analysis of the proposed method (Sec. 4). Extensive experiments on four real-world datasets, involving recommendation (Movielens, Netflix
) and deep-learning based classification (ImageNet and VideoRec) tasks show that the proposed approach consistently outperforms the state-of-the-art techniques under both fixed space and fixed time scenarios (Sec. 5).
The MIPS problem has been studied for more than a decade. For instance, Cohen et al.  studied it in the context of document clustering and presented a method based on randomized sampling without computing the full matrix-vector multiplication. In [10, 13], the authors described a procedure to modify tree-based search to adapt to MIPS criterion. Recently, Bachrach et al.  proposed an approach that transforms the input vectors such that the MIPS problem becomes equivalent to the NNS problem in the transformed space, which they solved using a PCA-Tree.
The MIPS problem has received a renewed attention with the recent seminal work from Shrivastava and Li , which introduced an Asymmetric Locality Sensitive Hashing (ALSH) technique with provable search guarantees. They also transform MIPS into NNS, and use the popular LSH technique . Specifically, ALSH applies different vector transformations to a database vector and the query , respectively:
where , is some constant that satisfies , and is a nonnegative integer. Hence, and are mapped to a new dimensional space asymmetrically. Shrivastava and Li  showed that when , MIPS in the original space is equivalent to NNS in the new space. The proposed hash function followed LSH form : where is a -dimensional vector whose entries are sampled i.i.d from the standard Gaussian, , and is sampled uniformly from . The same authors later proposed an improved version of ALSH based on Signed Random Projection (SRP) . It transforms each vector using a slightly different procedure and represents it as a binary code. Then, Hamming distance is used for MIPS.
Recently, Neyshabur and Srebro  argued that a symmetric transformation was sufficient to develop a provable LSH approach for the MIPS problem if query was restricted to unit norm. They used a transformation similar to the one used by Bachrach et al.  to augment the original vectors:
where , . They showed that this transformation led to significantly improved results over the SRP based LSH from . In this paper, we take a quantization based view of the MIPS problem and show that it leads to even better accuracy under both fixed space or fixed time budget on a variety of real world tasks.
Instead of augmenting the input vectors to a higher dimensional space as in [12, 15], we approximate the inner products by mapping each vector to a set of subspaces, followed by independent quantization of database vectors in each subspace. In this work, we use a simple procedure for generating the subspaces. Each vector’s elements are first permuted using a random (but fixed) permutation111Another possible choice is random rotation of the vectors which is slightly more expensive than permutation but leads to improved theoretical guarantees as discussed in the appendix.. Then each permuted vector is mapped to subspaces using simple chunking, as done in product codes [14, 9]. For ease of notation, in the rest of the paper we will assume that both query and database vectors have been permuted. Chunking leads to block-decomposition of the query and each database vector :
where each 222 One can do zero-padding wherever necessary, or use different dimensions in each block.
One can do zero-padding wherever necessary, or use different dimensions in each block.The subspace containing the blocks of all the database vectors, , is then quantized by a codebook where is the number of quantizers in subspace . Without loss of generality, we assume . Then, each database vector is quantized in the subspace as , where is a -dimensional one-hot assignment vector with exactly one and rest . Thus, a database vector is quantized by a single dictionary element in the subspace. Given the quantized database vectors, the exact inner product is approximated as:
Note that this approximation is ’asymmetric’ in the sense that only database vectors are quantized, not the query vector . One can quantize as well but it will lead to increased approximation error. In fact, the above asymmetric computation for all the database vectors can still be carried out very efficiently via look up tables similar to , except that each entry in the table is a dot product between and columns of .
Before describing the learning procedure for the codebooks and the assignment vectors , we first show an interesting property of the approximation in (1). Let be the partition of the database vectors in subspace such that , where is the element of and is the column of .
Where is the indicator function, and the last equality holds because for each , by definition. ∎
We will provide the concentration inequalities for the estimator in (1) in Sec. 4. Next we describe the learning of quantization codebooks in different subspaces. We focus on two different training scenarios: when only the database vectors are given (Sec. 3.1), and when a sample of example queries is also provided (Sec. 3.2). The latter can result in significant performance gain when queries do not follow the same distribution as the database vectors. Note that the actual queries used at the test time are different from the example queries, and hence unknown at the training time.
Our goal is to learn data quantizers that minimize the quantization error due to the inner product approximation given in (1). Assuming each subspace to be independent, the expected squared error can be expressed as:
where is the non-centered query covariance matrix in subspace . Minimizing the error in (2) is equivalent to solving a modified k-Means problem in each subspace independently. Instead of using the Euclidean distance, Mahalanobis distance specified by is used for assignment. One can use the standard Lloyd’s algorithm to find the solution for each subspace iteratively by alternating between two steps:
The Lloyd’s algorithm is known to converge to a local minimum (except in pathological cases where it may oscillate between equivalent solutions) . Also, note that the resulting quantizers are always the Euclidean means of their corresponding partitions, and hence, Lemma 3.1 is applicable to (2) as well, leading to an unbiased estimator.
The above procedure requires the non-centered query covariance matrix , which will not be known if query samples are not available at the training time. In that case, one possibility is to assume that the queries come from the same distribution as the database vectors, i.e., . In the experiments we will show that this version performs reasonably well. However, if a small set of example queries is available at the training time, besides estimating the query covariance matrix, we propose to impose novel constraints that lead to improved quantization, as described next.
In most applications, it is possible to have access to a small set of example queries, . Of course, the actual queries used at the test-time are different from this set. Given these exemplar queries, we propose to modify the learning criterion by imposing additional constraints while minimizing the expected quantization error. Given a query , since we are interested in finding the database vector with highest dot-product, ideally we want the dot product of query to the quantizer of to be larger than the dot product with any other quantizer. Let us denote the matrix containing the subspace assignment vectors for all the database vectors by . Thus, the modified optimization is given as,
We relax the above hard constraints using slack variables to allow for some violations, which leads to the following equivalent objective:
where is the standard hinge loss, and is a nonnegative coefficient. We use an iterative procedure to solve the above optimization, which alternates between solving and for each . In the beginning, each codebook is initialized with a set of random database vectors mapped to the subspace. Then, we iterate through the following three steps:
Find a set of violated constraints with each element as a triplet, i.e., , where is an exemplar query, is the database vector having the maximum dot product with , and is a vector such that but
Fixing and all columns of except , one can update as:
Since is typically small (256 in our experiments), we can find by enumerating all possible values of .
Fixing , and all the columns of except , one can update by gradient descent where gradient can be computed as:
Note that if no violated constraint is found, step 2 is equivalent to finding the nearest neighbor of in in Mahalanobis space specified by . Also, in that case, by setting , the update rule in step 3 becomes which is the stationary point for the first term. Thus, if no constraints are violated, the above procedure becomes identical to k-Means-like procedure described in Sec. 3.1. The steps 2 and 3 are guaranteed not to increase the value of the objective in (4). In practice, we have found that the iterative procedure can be significantly sped up by modifying the step 3 as perturbation of the stationary point of the first term with a single gradient step of the second term. The time complexity of step 1 is at most , but in practice it is much cheaper because we limit the number of constraints in each iteration to be at most . Step 2 takes and step 3 time. In all the experiments, we use at most constraints in each iteration, Also, we fix , step size at each iteration , and the maximum number of iterations .
In this section we present concentration results about the quality of the quantization-based inner product search method. Due to the space constraints, proofs of the theorems are provided in the appendix. We start by defining a few quantities.
Given fixed , let be an event such that the exact dot product is at least , but the quantized version is either smaller than or larger than .
Intuitively, the probability of eventmeasures the chance that difference between the exact and the quantized dot product is large, when the exact dot product is large. We would like this probability to be small. Next, we introduce the concept of balancedness for subspaces.
Let be a vector which is chunked into subspaces: . We say that chunking is -balanced if the following holds for every :
Since the input data may not satisfy the balancedness condition, we next show that random permutation tends to create more balanced subspaces. Obviously, a (fixed) random permutation applied to vector entries does not change the dot product.
Let be a vector of dimensionality and let be its version after applying random permutation of its dimensions. Then the expected is -balanced.
Another choice of creating balancedness is via a (fixed) random rotation, which also does not change the dot-product. This leads to even better balancedness property as discussed in the appendix (see Theorem 2.1). Next we show that the probability of can be upper bounded by an exponentially small quantity in , indicating that the quantized dot products accurately approximate large exact dot products when the quantizers are the means obtained from Mahalanobis k-Means as described in Sec. 3.1. Note that in this case quantized dot-product is an unbiased estimator of the exact dot-product as shown in Lemma 3.1.
Assume that the dataset of dimensionality resides entirely in the ball of radius , centered at . Further, let be -balanced for some , where is applied pointwise, and let be a martingale. Denote . Then, there exist sets of codebooks, each with quantizers, such that the following is true:
The above theorem shows that the probability of decreases exponentially as the number of subspaces (i.e., blocks) increases. This is consistent with experimental observation that increasing leads to more accurate retrieval.
Furthermore, if we assume that each subspace is independent, which is a slightly more restrictive assumption than the martingale assumption made in Theorem 4.2, we can use Berry-Esseen  inequality to obtain an even stronger upper bound as given below.
Suppose, , where is the maximum distance between a datapoint and its quantizer in subspace . Assume . Then,
where and is some universal constant.
We conducted experiments with 4 datasets which are summarized below:
This dataset consists of user ratings collected by the MovieLens site from web users. We use the same SVD setup as described in the ALSH paper  and extract 150 latent dimensions from SVD results. This dataset contains 10,681 database vectors and 71,567 query vectors.
This dataset comes from the state-of-the-art GoogLeNet  image classifier trained on ImageNet333The original paper ensembled 7 models and used 144 different crops. In our experiment, we focus on one global crop using one model.. The goal is to speed up the maximum dot-product search in the last i.e., classification layer. Thus, the weight vectors for different categories form the database while the query vectors are the last hidden layer embeddings from the ImageNet validation set. The data has 1025 dimensions (1024 weights and 1 bias term). There are 1,000 database and 49,999 query vectors.
This dataset consists of embeddings of user interests , trained via a deep neural network to predict a set of relevant videos for a user. The number of videos in the repository is 500,000. The network is trained with a multi-label logistic loss. As for the ImageNet dataset, the last hidden layer embedding of the network is used as query vector, and the classification layer weights are used as database vectors. The goal is to speed up the maximum dot product search between a query and 500,000 database vectors. Each database vector has 501 dimensions (500 weights and 1 bias term). The query set contains 1,000 vectors.
Following , we focus on retrieving Top-1, 5 and 10 highest inner product neighbors for Movielens and Netflix experiments. For ImageNet dataset, we retrieve top-5 categories as common in the literature. For the VideoRec dataset, we retrieve Top-50 videos for recommendation to a user. We experiment with three variants our technique: (1) QUIP-cov(x): uses only database vectors at training, and replaces by in the k-Means like codebook learning in Sec. 3.1, (2) QUIP-cov(q): uses estimated from a held-out exemplar query set for k-Means like codebook learning, and (3) QUIP-opt: uses full optimization based quantization (Sec. 3.2). We compare the performance (precision-recall curves) with 3 state-of-the-art methods: (1) Signed ALSH , (2) L2 ALSH 444The recommended parameters were used in the implementation.; and (3) Simple LSH . We also compare against the PCA-tree version adapted to inner product search as proposed in , which has shown better results than IP-tree . The proposed quantization based methods perform much better than PCA-tree as shown in the appendix.
We conduct two sets of experiments: (i) fixed bit - the number of bits used by all the techniques is kept the same, (ii) fixed time - the time taken by all the techniques is fixed to be the same. In the fixed bit experiments, we fix the number of bits to be . For all the QUIP variants, the codebook size for each subspace, C, was fixed to be , leading to a 8-bit representation of a database vector in each subspace. The number of subspaces (i.e., blocks) was varied to be leading to bit representation, respectively. For the fixed time experiments, we first note that the proposed QUIP variants use table lookup based distance computation while the LSH based techniques use POPCNT-based Hamming distance computation. Depending on the number of bits used, we found POPCNT to be 2 to 3 times faster than table lookup. Thus, in the fixed-time experiments, we increase the number of bits for LSH-based techniques by 3 times to ensure that the time taken by all the methods is the same.
Figure 1 shows the precision recall curves for Movielens and Netflix, and Figure 2 shows the same for the ImageNet and VideoRec datasets. All the quantization based approaches outperform LSH based methods significantly when all the techniques use the same number of bits. Even in the fixed time experiments, the quantization based approaches remain superior to the LSH-based approaches (shown with dashed curves), even though the former uses 3 times less bits than latter, leading to significant reduction in memory footprint. Among the quantization methods, QUIP-cov(q) typically performs better than QUIP-cov(x), but the gap in performance is not that large. In theory, the non-centered covariance matrix of the queries () can be quite different than that of the database (), leading to drastically different results. However, the comparable performance implies that it is often safe to use when learning a codebook. On the other hand, when a small set of example queries is available, QUIP-opt outperforms both QUIP-cov(x) and QUIP-cov(q) on all four datasets. This is because it learns the codebook with constraints that steer learning towards retrieving the maximum dot product neighbors in addition to minimizing the quantization error. The overall training for QUIP-opt was quite fast, requiring 3 to 30 minutes using a single-thread implementation, depending on the dataset size.
The quantization based inner product search techniques described above provide a significant speedup over the brute force search while retaining high accuracy. However, the search complexity is still linear in the number of database points similar to that for the binary embedding methods that do exhaustive scan using Hamming distance 
. When the database size is very large, such a linear scan even with fast computation may not be able to provide the required search efficiency. In this section, we describe a simple procedure to further enhance the speed of QUIPS based on data partitioning. The basic idea of tree-quantization hybrids is to combine tree-based recursive data partitioning with QUIPS applied to each partition. At the training time, one first learns a locality-preserving tree such as hierarchical k-means tree, followed by applying QUIPS to each partition. In practice only a shallow tree is learned such that each leaf contains a few thousand points. Of course, a special case of tree-based partitioners is a flat partitioner such as k-means. At the query time, a query is assigned to more than one partition to deal with the errors caused by hard partitioning of the data. This soft assignment of query to multiple partitions is crucial for achieving good accuracy for high-dimensional data.
In the VideoRec dataset, where , the quantization approaches (including QUIP-cov(x), QUIP-cov(q), QUIP-opt) reduce the search time by a factor of , compared to that of brute force search. The tree-quantization hybrid approaches (Tree-QUIP-cov(x), Tree-QUIP-cov(q), Tree-QUIP-opt) use 2000 partitions, and each query is assigned to the nearest 100 partitions based on its dot-product with the partition centers. These Tree-QUIP hybrids lead to a further speed up of x over QUIPS, leading to an overall end-to-end speed up of x over brute force search. To illustrate the effectiveness of the hybrid approach, we plot the precision recall curve in Fixed-bit and Fixed-time experiment on VideoRec in Figure 3. From the Fixed-bit experiments, Tree-Quantization methods have almost the same accuracy as their non-hybrid counterparts (note that the curves almost overlap in Fig. 3(a) for these two versions), while resulting in about 6x speed up. From the fixed-time experiments, it is clear that with the same time budget the hybrid approaches return much better results because they do not scan all the datapoints when searching.
We have described a quantization based approach for fast approximate inner product search, which relies on robust learning of codebooks in multiple subspaces. One of the proposed variants leads to a very simple kmeans-like learning procedure and yet outperforms the existing state-of-the-art by a significant margin. We have also introduced novel constraints in the quantization error minimization framework that lead to even better codebooks, tuned to the problem of highest dot-product search. Extensive experiments on retrieval and classification tasks show the advantage of the proposed method over the existing techniques. In the future, we would like to analyze the theoretical guarantees associated with the constrained optimization procedure. In addition, in the tree-quantization hybrid approach, the tree partitioning and the quantization codebooks are trained separately. As a future work, we will consider training them jointly.
The results on ImageNet and VideoRec datasets for different number of top neighbors and different number of bits are shown in Figure 4. In addition, we compare the performance of our approach against PCA-Tree. The recall curves with respect to different number of returned neighbors are shown in Figure 5.
In this section we present proofs of all the theorems presented in the main body of the paper. We also show some additional theoretical results on our quantization based method.
In this section we prove Theorem 4.1 and show that one can also obtain balancedness property with the use of the random rotation.
Let us denote and , where is the th block (). Let us fix some block . For a given denote by
a random variable such thatif is the block after applying random permutation and otherwise. Notice that a random variable captures this part of the squared norm of the vector that resides in block . We have:
Since the analysis presented above can be conducted for every block , we complete the proof.
Another possibility is to use random rotation, that can be performed for instance by applying random normalized Hadamard matrix . The Hadamard matrix is a matrix with entries taken from the set , where the rows form an orthogonal system. Random normalized Hadamard matrix can be obtained from the above one by first multiplying by the random diagonal matrix , (where the entries on the diagonal are taken uniformly and independently from the set ) and then by rescaling by the factor , where is the dimensionality of the data. Since dot product is invariant in regards to permutations or rotations, we end up with the equivalent problem.
If we take the random rotation approach then we have the following:
Let be a vector of dimensionality and let . Then after applying to linear transformation , the transformed vector is -balanced with probability at least , where is the number of blocks.
We start with the following Azuma’s concentration inequality that we will also use later:
Let be random variables such that , and for and some . Then is a martingale and the following holds for any :
Let us denote: . The th entry of the transformed is of the form: , where is the th row of and thus each (for the fixed ) takes uniformly at random and independently a value from the set .
Let us consider random variable that captures the squared -norm of the first block of the transformed vector . We have:
where the last inequality comes from the fact that for . Of course the same argument is valid for other blocks, thus we can conclude that in expectation the transformed vector is -balanced. Let us prove now some concentration inequalities regarding this result. Let us fix some . Denote . Let us find an upper bound on the probability for some fixed . We have already noted that .
Thus, by applying Lemma 8.1, we get the following:
Therefore, by the union bound, . Let us fix . Thus by taking , and again applying the union bound (over all the blocks) we conclude that the transformed vector is not -balanced with probability at most . That completes the proof. ∎
If some boundedness and balancedness conditions regarding datapoints can be assumed, we can obtain exponentially-strong concentration results regarding unbiased estimator considered in the paper. Next we show some results that can be obtained even if the boundedness and balancedness conditions do not hold. Below we present the proof of Theorem 4.2.
Let us define: , where: . We have:
Note that from Eq. (9), we get:
Let us fix now the th block (). From the -balancedness we get that every datapoint truncated to its th block is within distance to (i.e. truncated to its th block). Now consider in the linear space related to the th block the ball . Note that since the dimensionality of each datapoint truncated to the th block is , we can conclude that all datapoints truncated to their th blocks that reside in can be covered by balls of radius each, where: . We take as the set of quantizers for the th block the centers of mass of sets consisting of points from these balls. We will show now that sets: () defined in such a way are the codebooks we are looking for.
From the triangle inequality and Cauchy-Schwarz inequality, we get:
This comes straightforwardly from the way we defined sets: for .
Therefore, from Lemma 8.1, we get:
and that, by (10), completes the proof.
The following result is of its own interest since it does not assume anything about balancedness or boundedness. It shows that minimizing the objective function , where: , leads to concentration results regarding error made by the algorithm.
The following is true:
Fix some . Let us consider first the expression
that our algorithm aims to minimize. We will show that it is a rescaled version of the variance of the random variable.
where the last inequality comes from the unbiasedness of the estimator (Lemma 3.1).
Thus we obtain:
Therefore, by minimizing we minimize the variance of the random variable that measures the discrepancy between exact answer and quantized answer to the dot product query for the space truncated to the fixed th block. Denote . We are ready to give an upper bound on .