Library for fast classification in problems with large number of classes
Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm. Specifically, we propose to train a spherical k-means, after having reduced the MIPS problem to a Maximum Cosine Similarity Search (MCSS). Experiments on two standard recommendation system benchmarks as well as on large vocabulary word embeddings, show that this simple approach yields much higher speedups, for the same retrieval precision, than current state-of-the-art hashing-based and tree-based methods. This simple method also yields more robust retrievals when the query is corrupted by noise.READ FULL TEXT VIEW PDF
We present the first provably sublinear time algorithm for approximate
Recently it was shown that the problem of Maximum Inner Product Search (...
Neyshabur and Srebro proposed Simple-LSH, which is the state-of-the-art
We present the first provable Least-Squares Value Iteration (LSVI) algor...
We consider the problem of designing Locality Sensitive Filters (LSF) fo...
Similarity search is a core component in various applications such as im...
We propose a novel dimensionality reduction method for maximum inner pro...
Library for fast classification in problems with large number of classes
The Maximum Inner Product Search (MIPS) problem has recently received increased attention, as it arises naturally in many large scale tasks. In recommendation systems (Koenigstein et al., 2012; Bachrach et al., 2014)
, users and items to be recommended are represented as vectors that are learnt at training time based on the user-item rating matrix. At test time, when the model is deployed for suggesting recommendations, given a user vector, the model will perform a dot product of the user vector with all the item vectors and pick topitems with maximum dot product to recommend. With millions of candidate items to recommend, it is usually not possible to do a full linear search within the available time frame of only few milliseconds. This problem amounts to solving a -MIPS problem.
Another common instance where the -MIPS problem arises is in extreme classification tasks (Vijayanarasimhan et al., 2014), with a huge number of classes. At inference time, predicting the top- most likely class labels for a given data point can be cast as a
-MIPS problem. Such extreme (probabilistic) classification problems occur often in Natural Language Processing (NLP) tasks where the classes are words in a predetermined vocabulary. For example in neural probabilistic language models(Bengio et al., 2003)
the probabilities of a next word given the context of the few previous words is computed, in the last layer of the network, as a multiplication of the last hidden layer representation with a very large matrix (an embedding dictionary) that has as many columns as there are words in the vocabulary. Each such column can be seen as corresponding to the embedding of a vocabulary word in the hidden layer space. Thus an inner product is taken between each of these and the hidden representation, to yield an inner product “score” for each vocabulary word. Passed through a softmax nonlinearity, these yield the predicted probabilities for all possible words. The ranking of these probability values is unaffected by the softmax layer, so finding themost probable words is exactly equivalent to finding the ones with the largest inner product scores, i.e. solving a -MIPS problem.
In many cases the retrieved result need not be exact: it may be sufficient to obtain a subset of vectors whose inner product with the query is very high, and thus highly likely (though not guaranteed) to contain some of the exact -MIPS vectors. These examples motivate research on approximate -MIPS algorithms. If we can obtain large speedups over a full linear search without sacrificing too much on precision, it will have a direct impact on such large-scale applications.
Formally the -MIPS problem is stated as follows: given a set of points and a query vector , find
where the notation corresponds to the set of the indices providing the maximum values. Such a problem can be solved exactly in linear time by calculating all the and selecting the maximum items, but such a method is too costly to be used on large applications where we typically have hundreds of thousands of entries in the set.
All the methods discussed in this article are based on the notion of a candidate set, i.e. a subset of the dataset that they return, and on which we will do an exact -MIPS, making its computation much faster. There is no guarantee that the candidate set contains the target elements, therefore these methods solve approximate -MIPS. Better algorithms will provide us with candidate sets that are both smaller and have larger intersections with the actual maximum inner product vectors.
MIPS is related to nearest neighbor search (NNS), and to maximum similarity search. But it is considered a harder problem because the inner product neither satisfies the triangular inequality as distances usually do, nor does it satisfy a basic property of similarity functions, namely that the similarity of an entry with itself is at least as large as its similarity with anything else: for a vector , there is no guarantee that for all . Thus we cannot directly apply efficient nearest neighbor search or maximum similarity search algorithms to the MIPS problem.
Given a set of points and a query vector , the -NNS problem with Euclidean distance is defined as:
and the maximum cosine similarity problem (-MCSS) is defined as:
-NNS and -MCSS are different problems than -MIPS, but it is easy to see that all three become equivalent provided all data vectors have the same Euclidean norm. Several approaches to MIPS make use of this observation and first transform a MIPS problem into a NNS or MCSS problem.
In this paper, we propose and empirically investigate a very simple approach for the approximate -MIPS problem. It consists in first reducing the problem to an approximate -MCSS problem (as has been previously done in (Shrivastava and Li, 2015) ) on top of which we perform a spherical -means clustering. The few clusters whose centers best match the query yield the candidate set.
The rest of the paper is organized as follows: In section 2, we review previously proposed approaches for MIPS. Section 3 describes our proposed simple solution -means MIPS in more details and section 4 discusses ways to further improve the performance by using a hierarchical -means version. In section 5, we empirically compare our methods to the state-of-the-art in tree-based and hashing-based approaches, on two standard collaborative filtering benchmarks and on a larger word embedding datasets. Section 6 concludes the paper with discussion on future work.
There are two common types of solution for MIPS in the literature: tree-based methods and hashing-based methods. Tree-based methods are data dependent (i.e. first trained to adapt to the specific data set) while hash-based methods are mostly data independent.
Tree-based approaches: The Maximum Inner Product Search problem was first formalized in (Ram and Gray, 2012). Ram and Gray (2012) provided a tree-based solution for the problem. Specifically, they constructed a ball tree with vectors in the database and bounded the maximum inner product with a ball. Their novel analytical upper bound for maximum inner product of a given point with points in a ball made it possible to design a branch and bound algorithm to solve MIPS using the constructed ball tree. Ram and Gray (2012) also proposes a dual-tree based search using cone trees when you have a batch of queries. One issue with this ball-tree based approach (IP-Tree) is that it partitions the set of data points based on the Euclidean distance, while the problem hasn’t effectively been converted to NNS. In contrast, PCA-Tree (Bachrach et al., 2014), the current state-of-the-art tree-based approach to MIPS, first converts MIPS to NNS by appending an additional component to the vector that ensures that all vectors are of constant norm. This is followed by PCA and by a balanced kd-tree style tree construction.
Hashing based approaches: Shrivastava and Li (2014) is the first work to propose an explicit Asymmetric Locality Sensitive Hashing (ALSH) construction to perform MIPS. They converted MIPS to NNS and used the L2-LSH algorithm (Datar et al., 2004). Subsequently, Shrivastava and Li (2015) proposed another construction to convert MIPS to MCSS and used the Signed Random Projection (SRP) hashing method. Both works were based on the assumption that a symmetric-LSH family does not exist for MIPS problem. Later, Neyshabur and Srebro (2015) showed an explicit construction of a symmetric-LSH algorithm for MIPS which had better performance than the previous ALSH algorithms. Finally, Vijayanarasimhan et al. (2014) propose to use Winner-Take-All hashing to pick top- classes to consider during training and inference in large classification problems.
A notable approach to address the problem of scaling classifiers to a huge number of classes is the hierarchical softmax(Morin and Bengio, 2005). It is based on prior clustering of the words into a binary, or more generally -ary tree that serves as a fixed structure for the learning process of the model. The complexity of training is reduced from to . Due to its clustering and tree structure, it resembles the MIPS techniques we explore in this paper. However, the approaches differ at a fundamental level. Hierarchical softmax defines the probability of a leaf node as the product of all the probabilities computed by all the intermediate softmaxes on the way to that leaf node. By contrast, an approximate MIPS search imposes no such constraining structure on the probabilistic model, and is better though as efficiently searching for top winners of what amounts to a large ordinary flat softmax.
In this section, we propose a simple -means clustering based solution for approximate MIPS.
We follow the previous work by Shrivastava and Li (2015) for reducing the MIPS problem to the MCSS problem by ingeniously rescaling the vectors and adding new components, making the norms of all the vectors approximately the same. Let be our dataset. Let and be parameters of the algorithm. The first step is to scale all the vectors in our dataset by the same factor such that . We then apply two mappings and , one on the data points and another on the query vector. These two mappings simply concatenate new components to the vectors making the norms of the data points all roughly the same. The mappings are defined as follows:
As shown in Shrivastava and Li (2015), mapping brings all the vectors to roughly the same norm: we have , with the last term vanishing as , since . We thus have the following approximation of MIPS by MCSS for any query vector ,
Assuming all data points have been transformed as so as to be scaled to a norm of approximately 1, then the spherical -means111Note that we use to refer to the number of top- items to retrieve in search and for the number of clusters in -means. These two quantities are otherwise not the same. algorithm (Zhong, 2005) can efficiently be used to do approximate MCSS. Algorithm 1 is a formal specification of the spherical -means algorithm, where we denote by the centroid of cluster () and the index of the cluster assigned to each point .
The difference between standard -means clustering and spherical -means is that in the spherical variant, the data points are clustered not according to their position in the Euclidian space, but according to their direction.
To find the one vector that has maximum cosine similarity to query point in a dataset clustered by this method, we first find the cluster whose centroid has the best cosine similarity with the query vector – i.e. the such that is maximal – and consider all the points belonging to that cluster as the candidate set. We then simply take as an approximation for our maximum cosine similarity vector. This method can be extended for finding the maximum cosine similarity vectors: we compute the cosine similarity between the query and all the vectors of the candidate set and take the best matches.
One issue with constructing a candidate set from a single cluster is that the quality of the set will be poor for points close to the boundaries between clusters. To alleviate this problem, we can increase the size of candidate sets by constructing them instead from the top- best matching clusters to construct our candidate set.
We note that other approximate search methods exploit similar ideas. For example, Bachrach et al. (2014) proposes a so-called neighborhood boosting method for PCA-Tree, by considering the path to each leaf as a binary vector (based on decision to go left or right) and given a target leaf, consider all other leaves which are one hamming distance away.
While using a single-level clustering of the data points might yield a sufficiently fast search procedure for moderately large databases, it can be insufficient for much larger collections.
Indeed, if we have points, by clustering our dataset into clusters so that each cluster contains approximately points, we reduce the complexity of the search from to roughly . If we use the single closest cluster as a candidate set, then the candidate set size is of the order of . But as mentioned earlier, we will typically want to consider the two or three closest clusters as a candidate set, in order to limit problems arising from the query points close to the boundary between clusters or when doing approximate -MIPS with fairly big (for example 100). A consequence of increasing candidate sets this way is that they can quickly grow wastefully big, containing many unwanted items. To restrict the candidate sets to a smaller count of better targeted items, we would need to have smaller clusters, but then the search for the best matching clusters becomes the most expensive part. To address this situation, we propose an approach where we cluster our dataset into many small clusters, and then cluster the small clusters into bigger clusters, and so on any number of times. Our approach is thus a bottom-up clustering approach.
For example, we can cluster our datasets in first-level, small clusters, and then cluster the centroids of the first-level clusters into
second-level clusters, making our data structure a two-layer hierarchical clustering. This approach can be generalized to as many levels of clustering as necessary.
To search for the small clusters that best match the query point and will constitute a good candidate set, we go down the hierarchy keeping at each level only the best matching clusters. This process is illustrated in Figure 1. Since at all levels the clusters are of much smaller size, we can take much larger values for , for example or .
Formally, if we have levels of clustering, let be a set of indices for the clusters at level . Let be the centroids of the clusters at level , with conveniently defined as being the data points themselves, and let be the assignment of the centroids to the clusters of layer . The candidate set is found using the method described in Algorithm 2. Our candidate set is the set obtained at the end of the algorithm.
In our approach, we do a bottom-up clustering, i.e. we first cluster the dataset into small clusters, then we cluster the small cluster into bigger clusters, and so on until we get to the top level which is only one cluster. Other approaches have been suggested such as in (Mnih and Hinton, 2009), where the method employed is a top-down clustering strategy where at each level the points assigned to the current cluster are divided in smaller clusters. The approach of (Mnih and Hinton, 2009) also addresses the problem that using a single lowest-level cluster as a candidate set is an inaccurate solution by having the data points be in multiple clusters. We use an alternative solution that consists in exploring several branches of the clustering hierarchy in parallel.
In this section, we will evaluate the proposed algorithm for approximate MIPS. Specifically, we analyze the following characteristics: speedup, compared to the exact full linear search, of retrieving top- items with largest inner product, and robustness of retrieved results to noise in the query.
We have used 2 collaborative filtering datasets and 1 word embedding dataset, which are descibed below:
Movielens-10M: A collaborative filtering dataset with 10,677 movies (items) and 69,888 users. Given the user-item matrix , we follow the pureSVD procedure described in (Cremonesi et al., 2010) to generate user and movie vectors. Specifically, we subtracted the average rating of each user from his individual ratings and considered unobserved entries as zeros. Then we compute an SVD approximation of with its top 150 singular components, . Each row in is used as the vector representation of the user and each row in is the vector representation of the movie. We construct a database of all 10,677 movies and consider 60,000 randomly selected users as queries.
Netflix: Another standard collaborative filtering dataset with 17,770 movies (items) and 480,189 users. We follow the same procedure as described for movielens but construct 300 dimensional vector representations, as is standard in the literature (Neyshabur and Srebro, 2015). We consider 60,000 randomly selected users as queries.
Word2vec embeddings: We use the 300-dimensional word2vec embeddings released by Mikolov et al. (2013). We construct a database composed of the first 100,000 word embedding vectors. We consider two types of queries: 2,000 randomly selected word vectors from that database, and 2,000 randomly selected word vectors from the database corrupted with Gaussian noise. This acts as a test bench to evaluate the performance of different algorithms based on the characteristics of the queries.
We consider the following baselines to compare with.
. This method first converts MIPS to NNS by appending an additional component to the vectors to make them of constant norm. Then the principal directions are learnt and the data is projected using these principal directions. Finally, a balanced tree is constructed using as splitting criteria at each level the median of component values along the corresponding principal direction. Each level uses a different principal direction, in decreasing order of variance.
SRP-Hash: This is the signed random projection hashing method for MIPS proposed in Shrivastava and Li (2015). SRP-Hash converts MIPS to MCSS by vector augmentation. We consider hash functions and each hash function considers random projections of the vector to compute the hash.
WTA-Hash: Winner Takes All hashing (Vijayanarasimhan et al., 2014) is another hashing-based baseline which also converts MIPS to MCSS by vector augmentation. We consider hash functions and each hash function does different random permutations of the vector. Then the prefix constituted by the first elements of each permuted vector is used to construct the hash for the vector.
In these first experiments, we consider the two collaborative filtering tasks and evaluate the speedup provided by the different approximate -MIPS algorithms (for ) compared to the exact full search. Note that this section does not include the hierarchical version of -means in the experiments, as the databases were small enough (less than 20,000) for flat -means to perform well.
Specifically, speedup is defined as
where is the exact linear search algorithm that consists in computing the inner product with all training items. Because we want to compare the preformance of algorithms, rather than of specifically optimized implementations, we approximate the time with the number of dot product operations computed by the algorithm222For example, -means algorithm was run using GPU while PCA-Tree was run using CPU.. In other words, our unit of time is the time taken by a dot product. All algorithms return a set of candidates for which we do exact linear seacrh. This induces a number of dot products at least as large as the size of the identified candidate set. In addition to the candidate set size, the following operations count towards the count of dot products:
-means: dot products done with all cluster centroids involved in finding the top- clusters of the (hierarchical) search.
PCA-Tree: dot product done to project the query to the PCA space. Note that if the tree is of depth , then we need to do dot products to project the query.
SRP-Hash: total number of random projections of the data (each random projection is considered a single dot product). If we have hashes with random projections each, then the cost is .
WTA-Hash: a full random permutation of the vector involves the same number of query element access operations as a single dot product. However, we consider only prefixes in the permutations, which means we only need to do a fraction of dot product. While a dot product involves accessing all components of the vector, each permutation in WTA-Hash only needs to access elements of the vector. So we consider its cost to be a fraction of the cost of a dot product. Specifically, if we have hash functions each with random permutations and consider prefixes of length , then the total cost would be where is the dimension of the vector.
Let us call true top- the actual elements from the database that have the largest inner products with the query. Let us call retrieved top- the elements, among the candidate set retrieved by a specific approximate MIPS, that have the largest inner products with the query. We define precision for -MIPS as the number of elements in the intersection of true top- and retrived top- vectors, divided by .
We varied hyper-parameters of each algorithm ( in -means, depth in PCA-Tree, number of hash functions in SRP-Hash and WTA-Hash), and computed the precision and speedup in each case. Resulting precision v.s. speedup curves obtained for the Movielens-10M and Netflix datasets are reported in Figure 2. We make the following observations from these results:
Hashing-based methods perform better with lower speedups. But their performance decrease rapidly after 10x speedup.
PCA-Tree performs better than SRP-Hash.
WTA-Hash performs better than PCA-Tree with lower speedups. However, their performance degrades faster as the speedup increases and PCA-Tree outperforms WTA-Hash with higer speedups.
-means is a clear winner as the speed up increases. Also, performance of -means degrades very slowly with increase in speedup as compared to rapid decrease in performance of other algorithms.
In this experiment, we consider a word embedding retrieval task. As a first experiment, we consider using a query set of 2,000 embeddings, corresponding to a subset of a large database of pretrained embeddings. Note that while a query is thus present in the database, it is not guaranteed to correspond to the top-1 MIPS result. Also, we’ll be interested in the top-10 and top-100 MIPS performance. Algorithms which perform better in top-10 and top-100 MIPS for queries which already belong to the database preserve the neighborhood of data points better. Figure 3 shows the precision vs. speedup curve for top-1, top-10 and top-100 MIPS. From the results, we can see that data dependent algorithms (-means and PCA-Tree) better preserve the neighborhood, compared to data independent algorithms (SRP-Hash, WTA-Hash), which is not surprising. However, -means and hierarchical -means performs significantly better than PCA-Tree in top-10 and top-100 MIPS suggesting that it is better than PCA-Tree in capturing the neighborhood. One reason might be that -means has the global view of the vector at every step while PCA-Tree considers one dimension at a time.
As the next experiment, we would like to study how different algorithms behave with respect to the noise in the query. For a fair comparison, we chose hyper-parameters for each model such that the speedup is the same (we set it to 30x) for all algorithms. We take 2,000 random word embeddings from the database and corrupt them random Gaussian noise. We vary the scale of the noise from 0 to 0.4 and plot the performance. Figure 4 shows the performance of various algorithms on the top-1, top-10, top-100 MIPS problems, as the noise increases. We can see that -means always performs better than other algorithms, even with increase in noise. Also, the performance of -means remains reasonable, compared to other algorithms. These results suggest that our approach might be particularly appropriate in a scenario where word embeddings are simultaneously being trained, and are thus not fixed. In such a scenario, having a robust MIPS method would allow us to update the MIPS model less frequently.
retrieval as the noise in the query increases. We increase the standard deviation of the Gaussian noise and we see that-means performs better than other algorithms.
In this paper, we have proposed a new and efficient way of solving approximate -MIPS based on a simple clustering strategy, and showed it can be a good alternative to the more popular LSH or tree-based techniques. We regard the simplicity of this approach as one of its strengths. Empirical results on three real-world datasets show that this simple approach clearly outperforms the other families of techniques. It achieves a larger speedup while maintaining precision, and is more robust to input corruption, an important property for generalization, as query test points are expected to not be exactly equal to training data points. Clustering MIPS generalizes better to related, but unseen data than the hashing approaches we evaluated.
In future work, we plan to research ways to adapt on-the-fly the clustering for our approximate -MIPS as its input representation evolves during the learning of a model, leverage efficient -MIPS to speed up extreme classifier training and improve precision and speedup by combining multiple clusterings.
Finally, we mention that, while putting the final touches to this paper, another very recent and different MIPS approach, based on vector quantization, came to our knowledge (Guo et al., 2015). We highlight that the first arXiv post of our work predates their work. Nevertheless, while we did not have time to empirically compare to this approach here, we hope to do so in future work.
The authors would like to thank the developers of Theano(Bergstra et al., 2010) for developing such a powerful tool. We acknowledge the support of the following organizations for research funding and computing support: Samsung, NSERC, Calcul Quebec, Compute Canada, the Canada Research Chairs and CIFAR.
Hierarchical probabilistic neural network language model.In Robert G. Cowell and Zoubin Ghahramani, editors,
Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 246–252, 2005.
Proceedings of the 31st International Conference on Machine Learning, 2015.