Efficient Inner Product Approximation in Hybrid Spaces

03/20/2019 ∙ by Xiang Wu, et al. ∙ 10

Many emerging use cases of data mining and machine learning operate on large datasets with data from heterogeneous sources, specifically with both sparse and dense components. For example, dense deep neural network embedding vectors are often used in conjunction with sparse textual features to provide high dimensional hybrid representation of documents. Efficient search in such hybrid spaces is very challenging as the techniques that perform well for sparse vectors have little overlap with those that work well for dense vectors. Popular techniques like Locality Sensitive Hashing (LSH) and its data-dependent variants also do not give good accuracy in high dimensional hybrid spaces. Even though hybrid scenarios are becoming more prevalent, currently there exist no efficient techniques in literature that are both fast and accurate. In this paper, we propose a technique that approximates the inner product computation in hybrid vectors, leading to substantial speedup in search while maintaining high accuracy. We also propose efficient data structures that exploit modern computer architectures, resulting in orders of magnitude faster search than the existing baselines. The performance of the proposed method is demonstrated on several datasets including a very large scale industrial dataset containing one billion vectors in a billion dimensional space, achieving over 10x speedup and higher accuracy against competitive baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

An astronomical amount of data has been produced by human race in recent years compared with the past history (Lynch, 2008). It is valuable to understand and process this data using data mining and machine learning. We cannot fail to notice that the data is generated from diverse sources in practice. For example, it may come from web services, mobile devices, location services or digital imagery. The feature vectors representing these records have varying properties e.g., dimensionality (high or low), sparsity (sparse or dense) and modality (categorical or continuous).

We are interested in developing effective methods to process, index and search large scale datasets of heterogeneous nature. Specifically, we focus on datasets consisting of feature vectors that combine sparse and dense features. Such representations are increasingly popular in several applications. For instance, documents have been traditionally represented as a vector of normalized n-grams (e.g., unigrams and bigrams). It is easy to have millions or billions of n-grams, leading to high dimensional representations. For each document only a few such n-grams are present, yielding a very sparse vector representation. On the contrary, with widespread use of deep learning techniques, it is common to represent text as embeddings e.g., word2vec 

(Mikolov et al., 2013) or LSTM hidden states (Palangi et al., 2016). Such embeddings are typically low dimensional (in hundreds) but dense, i.e., all features are nonzero. Both n-gram (sparse) and embedding (dense) representations have pros and cons. Embeddings capture correlations among n-gram features, which is not possible with direct n-gram representations. On the contrary, direct n-grams can "memorize" specific rare features that are helpful in document representations. To get the best of both worlds, concatenations of sparse and dense features are increasingly used in learning tasks. For instance, in a recent “Wide-and-Deep” framework (Cheng et al., 2016)

, data has both a dense component (from the “deep” neural network) and a sparse component (from the “wide” sparse logistic regression). Fast computation of inner products (or equivalently cosine similarity

111Cosine similarity between two vectors is equivalent to their inner product after normalizing the vectors to unit L2 norms.) in sparse-dense hybrid spaces is useful in several applications:

  • Finding similar items in a hybrid dataset such as DBLP (Ley, 2002);

  • Collaborative filtering with a hybrid of sparse and dense features (Cheng et al., 2016; Xiao et al., 2017);

  • Extreme classification where the classifier makes use of both dense and sparse features while the number of classes is large (commonly seen in recommendation tasks 

    (Shan et al., 2016; Wang et al., 2017)).

1.1. Insufficiency of the existing methods

Historically, disjoint approaches to efficient inner product search have been applied to sparse and dense datasets. The most popular way to compute inner product for high dimensional sparse vectors is based on variants of inverted indices (Manning et al., 2008) which exploit the sparsity of non-zero (“active”) dimensions in the data. However, the performance of inverted indices degrades quickly when even a few dimensions are dense. On the contrary, search in dense data is usually based on either tree-based or hashing-based techniques. For example, KD-Tree (Bentley, 1975), VP-Tree (Yianilos, 1993) and Spill-Tree (Liu et al., 2005), have been widely used for lower data dimensionalities (typically less than one hundred). For higher dimensional data, hashing methods (Wang et al., 2016; Shrivastava and Li, 2014; Andoni et al., 2015) or quantization methods (Jegou et al., 2011; Guo et al., 2016; Babenko and Lempitsky, 2014; Kalantidis and Avrithis, 2014; Zhang et al., 2014; Norouzi and Fleet, 2013) are more popular. But none of these techniques scales to billions of dimensions without significantly degrading search recall.

Another approach is to search the sparse and dense spaces separately, using appropriate data structures, and then combining the results. However, such heuristics can easily fail when the most query-similar items in the combined space are only middling in the dense and sparse spaces individually. We also note that projection based methods such as LSH 

(Indyk and Motwani, 1998) and RPTree (Dasgupta and Freund, 2008) can be applied to hybrid data because they first project the data onto a low-dimensional space, but such techniques often result in sizable recall loss.

Currently, no good solutions exist for efficiently searching with high-recall in the hybrid spaces that are the focus of this paper. We highlight the insufficiency of the existing methods by extensive experiments (Section 7). For example, for the large-scale QuerySim dataset consisting of a billion datapoints with 200 dense and 1 billion sparse dimensions (with about 200 nonzeros per vector), treating all the dimensions as dense is infeasbile. Moreover, treating all the dimensions as sparse leads to abysmally slow search performance (a few minutes per query). This is because the dense dimensions of the dataset are active in all vectors, leading to full inverted lists for these dimensions. Similarly, projection related methods like LSH give very poor performance even with a large number of bits.

1.2. Contributions

This paper makes the following main contributions:

  • introduces the challenging problem of efficient search in sparse-dense hybrid vector spaces and addresses it by developing a fast approximation of inner products in such spaces,

  • identifies the bottleneck of fast inner product approximation in sparse components, and designs a novel data structure called cache-sorted inverted index to overcome it. Such a data structure makes inner product approximation streamlined with SIMD operations on modern CPU architectures and avoids expensive cache-misses (Section. 3.1),

  • approximates inner product in dense components by learning data-dependent quantization in subspaces, indexed with cache-friendly Look Up Tables (LUT16) and coupled with fast SIMD computations (Section. 4.1),

  • proposes residual reordering which reduces the impact of approximation error in inner products, improving the accuracy substantially (Section. 5),

  • demonstrates the effectiveness of the proposed method on two public and one very large industrial datasets, significantly outperforming competitive baselines (Section. 7).

2. Our Approach

2.1. Overview

We start with a high-level overview of our approach. As mentioned before, we address the challenge of efficient search in hybrid spaces via fast approximation of inner products. Due to the decomposibility of the inner product measure, one can approximate the inner product in sparse and dense spaces separately. Formally, let us consider a hybrid vector , with sparse subvector and dense subvector , where and . Given a query , the inner product of and can be written as,

(1)

where we refer to and as sparse and dense inner product, respectively. Similarity search in this paper refers to finding item(s) from a hybrid dataset with highest inner product(s) to query , i.e.,

In this work, we approximate sparse and dense inner products independently. Specifically, we propose a novel cache-sorted inverted index for sparse inner-product, and a product code based quantization method with cache-efficient LookUp Table (LUT16) for dense inner product. The errors induced due to approximate inner-product computations are alleviated by using a second pass reordering with residual indices (Section 5), leading to substantial increase in recall. We start with a description of inverted index and product codes based inner product computation below. The proposed efficient data structures are described in Section 3.

2.2. Inverted Index for Sparse Inner Product

Given a set of sparse vectors consisting of just the sparse dimensions of the hybrid set , an inverted index constructs inverted lists, each corresponding to a dimension from the sparse dimension set . The list consists of indices of all the datapoints whose dimension is nonzero, and whose corresponding values for this dimension, i.e., , where implies element of the vector in . The complete inverted index is defined by . The inner product of sparse component of query, with the vector of dataset can be computed by summing over the nonzero dimensions of , i.e., :

Figure 1. Basics of inverted index for indexing sparse component of the dataset. Essentially, an inverted index keys the dataset by nonzero dimensions. Inner products of a query with all are computed by accumulating partial inner products of and as described in Section 2.2. Each small tile indicates nonzero values from datapoints, and the color codes indicate values on the same dimension.

While iterating over each nonzero dimension of , one can accumulate the inner products for all points in set that are active in the inverted list as illustrated in Figure 1. This accumulation based approach was first suggested in information retrieval literature (Wong and Lee, 1993; Moffat and Zobel, 1996; Zobel and Moffat, 2006), and also employed by (Bayardo et al., 2007).

Although the idea of inverted index is fairly simple, it is non-trivial to implement a practically efficient solution. Several implementation heuristics have been proposed in the literature (Wong and Lee, 1993; Moffat and Zobel, 1996; Zobel and Moffat, 2006) but they are mostly based on old computer architectures such as slowness of floating point operations which is no longer an issue with modern CPUs. Another line of heuristics was suggested with regard to making disk-based access more efficient, which is again not relevant to modern distributed systems that tend to use in-memory inverted indices. In this paper, we claim that the major bottleneck in practical inverted index implementation is memory access and propose a new technique called cache-sorting which can give several-fold gains in sparse inner product computation. More details are given in Section 3.2.

2.3. Product Codes for Dense Inner Product

For indexing high-dimensional dense data, product codes based on quantization in subspaces (Gersho and Gray, 2012) have become very popular. Product codes were used for fast Euclidean distance based search by (Jegou et al., 2011), which along with its variants have regularly claimed top spots on public benchmarks such as GIST1M, SIFT1B (Jegou et al., 2011) and DEEP1B (Babenko and Lempitsky, 2016). Product codes were later extended to the inner-product similarity measure by (Guo et al., 2016), which showed dramatic improvement in performance over LSH based techniques. Recently product codes have also been used for general matrix compression in (Blalock and Guttag, 2017).

We approximate the inner products for the dense component of hybrid vectors via product quantization similar to (Guo et al., 2016). The dense dimensions are split into blocks of contiguous subvectors and vector quantization is applied on each subspace separately. Here, Vector Quantization (VQ) approximates a dimensional vector by finding its closest quantizer in a codebook :

where is the quantization codebook with codewords.

To index a hybrid vector’s dense component , we first decompose it into subvectors, leading to its product quantization:

(2)

where denotes the subvector of , is a collection of codebooks, and is the PQ codebook with sub-quantizers. The codebooks are learned using -Means in each subspace independently (Guo et al., 2016).

The indexing (quantization) procedure is illustrated graphically in Figure 2 and the dense inner product is then approximated as inner product of and the product quantization of :

(3)
Figure 2. An illustration of indexing the dense component of the dataset using a set of learned quantization codebooks , one for each subspace. The resulting representation contains the codes, one for each subspace. In above example, = 3, = 16 (4 bits/code). The color codes indicate matching dimensionality.

3. Efficient Data Structures

In addition to asymptotic complexity, the success of an algorithm often depends upon efficient implementation on modern computing architectures, where memory bandwidth, cache access patterns, and utilization of instruction level parallelism (ILP) can have order-of-magnitude impact on performance. Well implemented algorithms can sometimes even outperform algorithms with lower asymptotic complexity (the well known example being matrix multiplication).

We found these factors to be critical to inner product search. Although we have described our formulation for indexing hybrid datasets, in practice an efficient implementation is non-obvious. In this section, we focus on the computational aspects of implementing hybrid inner product search.

3.1. Memory I/O and Inverted Index

Modern x86 processors arrange memory into a sequence of 64-byte “cache-lines”. When a program reads from a memory location that is not already cached, the CPU loads the 64-byte cache-line containing that address. The CPU never reads or writes fewer than 64 bytes at a time, even if the program only utilizes a small portion of that cache-line.

In practice, query performance for sparse inverted indices is significantly more constrained by the memory bandwidth required to access the accumulator than by the rate at which the CPU performs arithmetic operations. Simply counting the expected number of cache-lines touched per query provides an accurate estimation of query time.

Each accumulator cache-line can hold a fixed number of values ; on x86, is for 32-bit accumulators, and for 16-bit ones. Within a given dataset, each aligned block of consecutive datapoints shares an accumulator cache-line. For a particular dimension, if any of these datapoints is nonzero, all queries active in that dimension will have to access the corresponding cache-line. The cost of processing a second datapoint within the same cache-line is negligible compared to the cost of accessing another cache-line.

The expected query cost can be estimated as follows. For a sparse dataset, , containing datapoints, where is the value of the datapoint on the dimension, and where

is the probability that the

dimension is active in a randomly sampled query vector:

Given a sparse dataset, , our goal is to find a permutation for the ordering of datapoints which minimizes Cost(), a process that we call cache sorting.

Figure 3. Motivation of cache sorted inverted index. The white color indicates that a cache-line can be skipped altogether. Note that after sorting, many more cache-lines can be skipped and those accessed are also more sequential.

3.2. Cache Sorting

CacheSort() begin
       input : A sparse dataset
       output : A permutation
       for  to  do
            
      // Sort dimensions by number of nonzeros.
       // Initialize to be identity.
       return PartitionByDim ()
PartitionByDim() begin
       input : Dataset , index range to be sorted [start, end ],
       Dimension to perform partition ,
       Sorted ordering of the dimensions
       output : Permutation in range [start, end ]
       if end- start  then
             return [start end ]
      // Partition the range by .
       for  start to end do
            
      , pivot Argpartition () // Recursively partition the subpartitions
      // by the next most active dimension.
      
       return
Algorithm 1 Cache sorting algorithm. This computes a permutation of the datapoint index, which reduces the number of main memory accesses and optimizes the access pattern. Here Argsort performs an indirect sort and return the indices that index the data in a sorted order. Argpartition performs an indirect partitioning and returns the indices and pivot.

The general form of this optimization problem is known to be hard to solve. We thus propose a greedy algorithm that produces highly efficient accumulator memory layout, even when compared with state-of-the-art approximation algorithms using LP relaxation. Moreover, the greedy approach only takes few seconds even with millions of datapoints, while LP relaxation approaches are much more expensive.

The goal in cache sorting is to reorder the datapoint indices of shared active dimensions into long consecutive blocks as much as possible. This is done by greedily selecting the most active dimension (i.e., the dimension that is nonzero in most data vectors), and partitioning the dataset into two subsets, where the indices of the datapoints that have nonzero values for that dimension are contiguous. This is performed recursively in the order of the dimensions, from the most active to the least active dimension. The details of the algorithm are given in Algorithm 1.

Define as the permutation on dimensions that sorts them from the most active to the least, i.e., for all . Consider indicator variables for a datapoint which indicate whether the most popular dimension is nonzero, . Then the above algorithm is conceptually the same as sorting the indicator variables in decreasing order. In practice, we do not need to explicitly construct these indicator variables and sorting is performed by recursive partitioning. Thus the average complexity of cache sorting is essentially .

There are many conceivable modification to this basic approach (Algorithm. 1). For example, one can imagine performing sorting using gray-code sorting order (Gray, [n. d.]), by arranging the permuted indices of datapoints with and to be consecutive. In practice, modified variants often do not make a big differences in performance, and we used the simple decreasing order in our experiments. In addition, we use an optimized implementation of prefix sorting which, instead of sorting the original dataset, takes advantage the inverted index layout, requiring just 16 bytes per datapoint of temporary memory to efficiently compute the datapoint index permutation vector.

3.3. Efficiency of Cache Sorting

Let us view the unsorted sparse component as a sparse matrix with datapoints and dimensions, and suppose each entry of the matrix is independent. Let the dimension of datapoint, denoted as , is nonzero with probability . Similarly the dimension of query vector is nonzero with probability .

Without loss of generality, let us assume that the dimensions are numbered as , and the probability that the entries of a datapoint have nonzero values are sorted in decreasing order, i.e., . For most real-world data, it is commonly observed that these probabilities tend to be distributed according to power law, i.e.,

If a block of values is completely zero, we can simply skip it. Hence, the number of cache-lines we need to access for a given dimension can be computed as: . The expected number of cache-line accesses, , is:

(4)

After learning the permutation using the cache-sorting procedure from , let us denote the matrix after permutation as . The number of contiguous blocks in dimension is , and the average number of cache-lines each such block occupies is . In the worst case, no two blocks can be fit in one cache-line. Then, an upper-bound on the expected number of cache-lines accessed is:

(5)
(a)
(b)
Figure 4. Savings in cache-line accesses using cache sorting. (a) Fraction of accumulator cache-lines accessed when computing inner products without and with cache-sorting, i.e., , , respectively. (b) The number of times cache-sorting reduces cache-line access versus the unsorted, as a function of and Here, in is fixed to be 16.

For simplicity, we assume . In Figure. 3(a), we assume datapoints, and (64 byte cache-lines with floats) and plot the expected fraction of cache-lines accessed in the unsorted inverted index and the upper bound on the fraction in cache-sorted inverted index at each dimension . With cache-sorting, the expected number of cache-line accesses as well as the total expected number of accesses (area under the curve) drop significantly across all dimensions.

In Figure. 3(b), we study the effect of parameters , and . As increases, the “active” dimensions will contain more datapoints, resulting in a larger cache-sorting impact. The amount of saving also increases with cache-line size , which motivates us to quantize values in the inverted index to reduce the memory footprint.

The analysis above is somewhat pessimistic as it assumes zero correlation between activity patterns of different dimensions. Most real-world datasets have significant correlations among dimensions e.g., several sets of dimensions tend to be active together. Empirically, we have observed over 10x improvement in throughput on several real-world datasets from proposed cache-sorting. In addition to the benefits of reducing the total number of cache-lines read from main memory, for active dimensions, cache-sorting creates long sequential runs of nonzeros that can be encoded efficiently, and computed with vectorized SIMD instructions, further increasing the throughput.

4. Indexing and Search

4.1. Dense Component

As described in Section 2.3, we use product codes based approach for fast inner product approximation in dense space. Given Product Quantization (PQ) codebooks and a dense vector , the indices of nearest centers that each of the subvectors is assigned to are concatenated together into a sequence of integer codes. If there are centers in each subspace codebook, the overall code length is bits.

4.1.1. Asymmetric Distance Computation (ADC)

ADC allows us to compute the inner product between a quantized dense component and an unquantized query with high efficiency and accuracy:

Observe that in each subspace , is essentially the index of one of the vectors from codebook . Thus, to approximate the subproduct , one needs to simply precompute a lookup table for each subspace based on the input query . The subproduct in subspace is then a simple lookup in . The overall inner product is the sum of all subproducts.

4.1.2. AVX2 In-Register Lookup Table

An in-memory lookup table with codewords (LUT256) was proposed in (Jegou et al., 2011; Ge et al., 2014; Martinez et al., 2014). Our implementation uses codewords (LUT16), which is also reported in (André et al., 2015, 2017; Blalock and Guttag, 2017). While this results in slightly lower per-bit database encoding efficiency, LUT16 can be implemented via an in-register lookup table using the PSHFB x86 instruction. This instruction was introduced in the SSSE3 instruction set, operating on 128-bit registers, and later extended to 256-bit registers in the AVX2 instruction set. Each AVX2 PSHUFB instruction performs 32 parallel 16-way lookups of 8-bit values. To perform accumulation without integer overflow, each register of 328-bit values must be extended to registers of 1616-bit values, followed by PADDW instructions that perform parallel addition.

In this work, we present a significant optimization of the 8-bit to 16-bit width-extension operation, described as follows. First, we observe that we can implement width-extension of unsigned integers more efficiently than for signed integers. When constructing lookup tables, we bias the quantized lookup values from [-128,+127] to [0, 255], perform unsigned accumulation, then from the final values, we subtract the net effect of this bias.

Unsigned width extension (aka "zero-extension") from 1 register of 8-bit values into 2 registers of 16-bit values can be accomplished using 2 instructions as follows. Each 16-bit slot in the source register contains two of our 8-bit values (). The PSRLW instruction bit-shifts the source register (), and the PAND instruction zeros out the upper bits ().

We eliminate the PAND instruction, using the original register of 8-bit values as-is. This produces a result that is wrong by a large but known value; then we restore the correct value during a post-processing step. This works despite the fact that the unzeroed bits cause our 16-bit accumulators to overflow many times. Overflows during addition are perfectly matched by a corresponding underflow during subtraction.

When operating on batches of 3 or more queries, our AVX2 implementation is able to sustain lookup-accumulate operations per clock-cycle on Intel Haswell CPUs, including the overhead for post-processing final inner products. This is 8x better than a LUT256 implementation’s architectural upper-bound of two scalar loads per clock-cycle. When LUT16 operates on 1 query at a time, the CPU can compute inner products faster than it can stream the PQ codes from the main memory. This constrains single-query performance to the maximum memory bandwidth available.

4.1.3. Approximation Error from Product Quantization

We first recap the rate distortion bound of parallel Gaussian channels (Cover and Thomas, 2006) from information theory:

Proposition 4.1 ().

The lower bound of the expected squared error (MSE) achievable with bits of quantization for

i.i.d.  Gaussian dimensions, each of which has variance

is:

In practice, with the Lloyd’s

-Means algorithm on Gaussian white noise 

(Ge et al., 2014), we can achieve an error within a few percentage of this theoretical lower bound. For inner product computation, we can always whiten the dense component by multiplying with the matrix . At query time, is also multiplied by . This is a special case of QUIPS (Guo et al., 2016), where the query distribution is the same as datapoint distribution. So from this point on, we assume that there exists a small constant , such that with suitable seeding, we can always achieve an MSE . We then present following bound using Azuma’s inequality :

Proposition 4.2 ().

The difference between the exact inner product and its approximation is bounded by with probability:

Where , and are the subvectors of , and , respectively under product quantization. The probability is defined over the randomness in the query.

We also note that the upper bound is taken over all datapoint and all subspace .

4.2. Sparse Component

We can generate an even sparser representation of an already sparse dataset by pruning out small entries while maintaining accurate approximation to inner products. Pruning reduces the size of the index and the amount of arithmetic operations during search. An even sparser data also helps cache sorting generate even more continuous memory layout of the index.

We first observe that on average only a small number of dimensions contribute nonzero products to the overall inner product between two sparse vectors. Also, in each dimension, the occurrence of large absolute values is rare across a dataset. These two phenomena combined make the contribution from small entries to the overall inner product much less significant than that from large entries, especially if our goal is to identify largest inner products. Mathematically, we represent a sparse vector as . And given a set of thresholds , one for each dimension, pruning outputs a sparser representation as: .

4.2.1. Approximation Error from Pruning

We now investigate the error in inner product approximation due to pruning. We proceed with the assumption that both and are generated by the following process:

  • Each dimension is independently nonzero with probability and zero with probability .

  • If nonzero, the value for dimension is sampled from a distribution , and the support of is .

The error of pruning can be characterized with Chernoff bound:

Proposition 4.3 ().

The difference between the exact inner product and its approximation is bounded by with probability:

Where is the pruning output of and .

The probability of two vectors having nonzero values for the same dimension is typically low, i.e., . So the above probability approaches 1 asymptotically at the rate of , where is the minimum number of entries that need to be removed from the sparse component to cause an error bigger than .

5. Residual Reordering for Improved Search

The index that we build provides a lossy approximation of the original component, and we define residual as the difference between the original component and its index approximation. When retrieving datapoints from the dataset, we can overfetch candidates from the original data index and then use the residual to reorder these overfetched candidates. Due to the simple decomposition , we can obtain the exact inner product by adding the inner product between the query and the residual to the approximate inner product from the index. With the exact inner product, we can refine the candidates and improve our accuracy for top candidates significantly. This allows us to reap the benefit of fast index scanning described in previous sections and also maintain high recall at a small cost due to residual reordering. To further reduce the cost of this residual reordering step, we can also build another index on the residual, which is much more precise than the data index. We apply this paradigm multiple times to achieve the best trade-off between speed and accuracy:

  1. Overfetch from both sparse and dense data indices: we take the sum of both approximate inner products from sparse and dense components and retain the largest for further reordering. The parameter can be tuned to balance recall and search time.

  2. Reorder with dense residual index and retain : we perform residual reordering with the dense residual index for the remaining datapoints and only retain for the last reordering. The parameter also needs to be tuned.

  3. Reorder with sparse residual index and return : we perform residual reordering with the sparse residual index for the remaining datapoints and return the largest as the final search result.

Reordering is always performed on a small subset of residuals. And its run time is much shorter in practice when compared with the run time of scanning through the sparse and dense data indices. In our similarity search benchmarks, the residual reordering logic consumes less than of the overall search time. On the other hand, it improves the final retrieval recall significantly.

5.1. Retrieval Performance

In the context of similarity search, rank preservation is of great interest and our hybrid approach with residual reordering gives following retrieval performance guarantee:

Proposition 5.1 ().

Given a fixed query , define the gap as , the proposed hybrid search approach achieves a recall@h of at least as long as following condition holds:

Where is the datapoint in terms of inner product and is the datapoint. The probability is defined over the randomness in datapoint generation.

This proposition states the fact that recall is at least the probability of the event that inner product approximation error is less than half the gap between and the largest inner products. Based on the bounds from Proposition 4.2 and 4.3, we can almost always find a large enough that satisfies this condition so long as the inner product distribution exhibits a quickly vanishing upper tail. For most real world datasets, under the setting of , is empirically to achieve recall.

6. Overall Indexing Algorithm

To efficiently approximate inner products between datapoints in the dataset and a query , we index the dataset as follows:

  1. Build an index for the sparse component by pruning with per dimension threshold :

    (6)

    The residual is thus . To efficiently compute the inner product between the query and the residuals, we also build another pruned index of the residuals with parameter .

    (7)
  2. Build an index for the dense component by applying product quantization (PQ) with codebooks for subspaces.

    Similar to sparse indexing, we build another PQ index of the residuals with codebooks for subspaces as:

6.1. Parameter Selection

6.1.1. Dense Index Parameters

The first dense data index is built with codebooks designed to have a relatively low bit rate, i.e., and , which means we store 4 bits for on average 2 dense dimensions. This immediately achieves a data index size reduction of 16x when compared to the original data that requires 32 bits per dimension for a floating point number. And it also enables us to apply in-register table lookup (LUT16) that outperforms alternative arithmetic or in-memory table lookup operations by at least 4x. The second residual index is built with and . Since now we treat each dimension as a subspace, we can directly apply scalar quantization with a distortion of at most of the dynamic range. This residual index is exactly the size of the original dataset and its error in inner product approximation is unnoticeable for our tasks.

6.1.2. Sparse Index Parameters

For fast approximation, we normally set the first threshold from Equation 6 to a relatively high value so that only top s of nonzero values in dimension are kept in the data index. This high threshold reduces the size of the data index to about 2 orders of magnitude smaller than the original dataset, and hence achieves significant speedup in index scan. Empirically, this hyper-sparse index induces little loss in retrieval performance. The second threshold from Equation 7 is set to a relatively low value so that most of the nonzero values are kept in the residual index. The reason is that we only need to access the residual index for a moderate number of datapoints, so its size does not affect computation speed directly like the size of the data index. With this relatively accurate residual index, the sum of the data index approximation and the residual index approximation is almost exact, which helps lift recall to high 90s%.

7. Experiments

7.1. Evaluation Datasets

In this work, we use two public datasets and one large scale industrial dataset for experiments.

7.1.1. Public Datasets

We use hybrid versions of two public datasets Netflix (Bennett et al., 2007) and Movielens (Harper and Konstan, 2015). Each dataset contains a sparse collection of user-movie-rating triplets. Our goal is to look for users in the dataset that have similar movie preferences as the users in the query set. The sparse component of a user comes directly from the set of ratings that the user provides. For dense components, we first assemble the user-movie-rating matrix , whose rows represent users and columns represent movies, and values are the ratings from 1 to 5. We follow the classic collaborative filtering approach (Billsus and Pazzani, 1998) and then perform Singlar Value Decomposition on the sparse matrix . The dense components of all users are given by weighted by . So the combined hybrid vector representation of users is , in which each row is a user embedding. The dense component dimensionality is fixed to be 300. Other details can be found in Table 2. We randomly sample 10k embeddings as the query set and the rest as the dataset.

7.1.2. QuerySim Dataset

We introduce a hybrid dataset that is built from a large set of most frequent queries in web search for the task of identifying similar queries. Each query is processed through two pipelines to generate a sparse and a dense component of its vector representation. The sparse pipeline generates a sparse vector, representing the bag of unigrams and bigrams in the query and each gram’s weight is set to a customized value. The dense pipeline generates a dense vector representation using a neural network similar to (Le and Mikolov, 2014). We fine tuned the relative weights for the sparse and dense inner products to optimize a more comprehensive metric such as the area under ROC. Finally, all datapoints are scaled with the learned relative weights and stored in a database.

The overall dataset contained one billion points and the dimensionality of the hybrid vector was over 1 billion (Table 1). Zooming into the sparse component, we first demonstrate the power-law distribution of the numbers of nonzeros across dimensions in Figure 4(a). We then visualize the long tail in the distribution of values of nonzero entries in the sparse component in Figure 4(b).

#datapoints #Dense Dims #Active Sparse Dims
203
#Avg Sparse nonzeros On-disk Size Update Frequency
134 5.8TB Weekly
Table 1. QuerySim dataset used for the sparse-dense hybrid similarity search benchmark.
(a)
(b)
Figure 5. Statistics on the sparse component of the QuerySim dataset: (a) the power-law distribution of the number of nonzeros for sorted dimensions in log-scale, and (b) the histogram of the nonzero values with median 0.054, 75% percentile 0.12, and 99% percentile 0.69, which justifies the pruning strategy for sparse index.

7.2. Recall and Speedup

Our goal is to compare the speed and recall of top 20 items from the entire dataset for each technique. All single machine benchmarks are conducted on a workstation with an Intel CPU clocked at 3.5GHz, with 64GB main memory. Distributed benchmarks are performed on a cluster with 200 servers that are equipped with an Intel CPU clocked at 2GHz and 128GB main memory. Both CPUs support the AVX2 instruction set.

For comparison we use the following well-known baselines:

Exact methods: with Dense Brute Force

, we pad

’s to the sparse component to make the dataset completely dense; with Sparse Brute Force, we append the sparse representation of the dense component to the end of the sparse component to make the dataset completely sparse. With Sparse Inverted Index, we perform the same conversion as sparse brute force and search with an inverted index instead.

Hashing methods: with Hamming (512 Bits)

, we first project each datapoint onto 512 Radmacher vectors and binarize the projected values with median thresholding. We use these 512-bit binary codes to perform hamming distance search and retrieve top 5K points, from which the required 20 are retrieved via exact search.

Dense only methods: with Dense PQ, Reordering 10k, we apply PQ indexing to only the dense component. From the index, we fetch top 10k datapoints and then return top 20 via exact search.

Sparse only methods: with Sparse Inverted Index, No Reordering, we simply retrieve top 20 from the sparse component with an inverted index and return them. with Sparse Inverted Index, Reordering 20k, we retrieve top 20k from the sparse component with an inverted index and then compute exact inner products of the 20k and return top 20.

Table 2 shows the comparison of different methods on public datasets. It is clear that the proposed method gives very fast search with high recall. The results on our large scale QuerySim dataset are summarized in Table 3. The latency and recall are measured on a random sample of 5M datapoints from the original 1B datapoints in order to compare with other more expensive techniques. The proposed hybrid approach is over 20x faster than the closest competition in the high recall regime.

Scalability: Extrapolating these results to all-pair search scenario (i.e., both query and database sets are the same with 1B billion datapoints), with CPU cores, the exact sparse brute force methods will take about 9 years to complete the 1B1B search on the QuerySim dataset while sparse inverted index will take about 3 months. On the other hand, our proposed hybrid approach is able to complete this under 1 week to match the weekly update requirement from production.

Online Search: To support online retrieval applications with stringent latency requirement, we have also developed a distributed similarity search system. We divide the dataset randomly into 200 shards, and allocate 200 servers, each of which loads a single shard into memory. For a single query, our system achieves recall@20 at an average latency of 79ms.

Dataset Netflix Hybrid Movielens Hybrid
#datapoints
#Dims Dense Sparse Dense Sparse
300 300
Algorithm Time Recall Time Recall
Dense Brute Force 3464 100% 1242 100%
Sparse Brute Force 905 100% 205 100%
Sparse Inverted Index 63.9 100% 15.7 100%
Hamming (512 bits) 16.0 9% 11.5 20%
Dense PQ, Reordering 10k 52.2 98% 29.4 100%
Sparse Inverted Index, No Reordering 22.8 29% 5.1 98%
Sparse Inverted Index, Reordering 20k 96.8 70% 49.0 100%
Hybrid (ours) 18.8 91% 2.6 92%
Table 2. Hybrid search performance with public datasets. All timings are reported in ms (per query) and recall measured at top 20.
Dataset QuerySim
#datapoints (sampled)
#Dims Dense Sparse
203
Algorithm Time (ms) Recall@20
Dense Brute Force OOM OOM
Sparse Brute Force 9655 100%
Sparse Inverted Index 406 100%
Hamming (512 bits) 59.5 0%
Dense PQ, Reordering 10k 39.8 45%
Sparse Inverted Index, No Reordering 58.6 0%
Sparse Inverted Index, Reordering 20k 102 30%
Hybrid (ours) 20.0 91%
Table 3. Hybrid search performance with a sampled 5M datapoints from the QuerySim dataset . The dense brute force method runs out of memory due to billion dimensionality.

8. Conclusion

We have introduced a novel yet challenging search scenario for performing efficient search in high dimensional sparse-dense hybrid spaces. To achieve high performance and accuracy, we proposed a fast technique based on approximating inner product similarity. Extensive optimizations to in-memory index structures for both sparse and dense vectors were also described to take advantage of modern computer architecture. We have shown advantages of our approach in large-scale real-world search scenario, which achieves more than 10x speedup with recall in 90s%.

With this fast and accurate approximation method, we hope that novel information retrieval systems that operate directly on sparse and dense features can be developed to unlock the value of hybrid data at massive scale.

References

  • (1)
  • Andoni et al. (2015) Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and optimal LSH for angular distance. In Advances in Neural Information Processing Systems. 1225–1233.
  • André et al. (2015) Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2015. Cache locality is not enough: high-performance nearest neighbor search with product quantization fast scan. Proceedings of the VLDB Endowment 9, 4 (2015), 288–299.
  • André et al. (2017) Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2017. Accelerated Nearest Neighbor Search with Quick ADC. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR ’17). 159–166.
  • Babenko and Lempitsky (2014) Artem Babenko and Victor Lempitsky. 2014. Additive quantization for extreme vector compression. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 931–938.
  • Babenko and Lempitsky (2016) A. Babenko and V. Lempitsky. 2016. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2055–2063. https://doi.org/10.1109/CVPR.2016.226
  • Bayardo et al. (2007) Roberto J Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web. 131–140.
  • Bennett et al. (2007) James Bennett, Stan Lanning, and Netflix Netflix. 2007. The Netflix Prize. In In KDD Cup and Workshop in conjunction with KDD.
  • Bentley (1975) Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509–517.
  • Billsus and Pazzani (1998) Daniel Billsus and Michael J. Pazzani. 1998. Learning Collaborative Information Filters. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML ’98). 46–54.
  • Blalock and Guttag (2017) Davis W. Blalock and John V. Guttag. 2017. Bolt: Accelerated Data Mining with Fast Vector Compression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 727–735.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS 2016). 7–10.
  • Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory, 2nd Edition. Wiley-Interscience.
  • Dasgupta and Freund (2008) Sanjoy Dasgupta and Yoav Freund. 2008. Random projection trees and low dimensional manifolds. In

    Proceedings of the fortieth annual ACM symposium on Theory of computing

    . ACM, 537–546.
  • Ge et al. (2014) T. Ge, K. He, Q. Ke, and J. Sun. 2014. Optimized Product Quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 4 (April 2014), 744–755. https://doi.org/10.1109/TPAMI.2013.240
  • Gersho and Gray (2012) Allen Gersho and Robert M Gray. 2012. Vector quantization and signal compression. Vol. 159. Springer Science & Business Media.
  • Gray ([n. d.]) Frank Gray. [n. d.]. Pulse code communication. ([n. d.]). US Patent 2,632,058.
  • Guo et al. (2016) Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. 2016. Quantization based Fast Inner Product Search. In

    Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016

    . 482–490.
    http://jmlr.org/proceedings/papers/v51/guo16a.html
  • Harper and Konstan (2015) F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, 19:1–19:19.
  • Indyk and Motwani (1998) Piotr Indyk and Rajeev Motwani. 1998.

    Approximate nearest neighbors: towards removing the curse of dimensionality. In

    Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 604–613.
  • Jegou et al. (2011) Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1 (2011), 117–128.
  • Kalantidis and Avrithis (2014) Yannis Kalantidis and Yannis Avrithis. 2014. Locally optimized product quantization for approximate nearest neighbor search. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2329–2336.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on Machine Learning, Vol. 32. 1188–1196.
  • Ley (2002) Michael Ley. 2002. The DBLP computer science bibliography: Evolution, research issues, perspectives. In String Processing and Information Retrieval. 1–10.
  • Liu et al. (2005) Ting Liu, Andrew W Moore, Ke Yang, and Alexander G Gray. 2005. An investigation of practical approximate nearest neighbor algorithms. In Advances in neural information processing systems. 825–832.
  • Lynch (2008) Clifford Lynch. 2008. Big Data: How do your data grow?. In Nature, Vol. 455. 28–9.
  • Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
  • Martinez et al. (2014) Julieta Martinez, Holger H. Hoos, and James J. Little. 2014. Stacked Quantizers for Compositional Vector Compression. CoRR abs/1411.2173 (2014). http://arxiv.org/abs/1411.2173
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26. 3111–3119.
  • Moffat and Zobel (1996) Alistair Moffat and Justin Zobel. 1996. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems (TOIS) 14, 4 (1996), 349–379.
  • Norouzi and Fleet (2013) Mohammad Norouzi and David J Fleet. 2013.

    Cartesian k-means. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3017–3024.
  • Palangi et al. (2016) Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016.

    Deep Sentence Embedding Using Long Short-term Memory Networks: Analysis and Application to Information Retrieval.

    IEEE/ACM Trans. Audio, Speech and Lang. Proc. 24, 4, 694–707.
  • Shan et al. (2016) Ying Shan, T. Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016. Deep Crossing: Web-Scale Modeling Without Manually Crafted Combinatorial Features. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 255–262.
  • Shrivastava and Li (2014) Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Advances in Neural Information Processing Systems. 2321–2329.
  • Wang et al. (2016) Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. 2016. Learning to hash for indexing big data survey. Proc. IEEE 104, 1 (2016), 34–57.
  • Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. CoRR abs/1708.05123 (2017). http://arxiv.org/abs/1708.05123
  • Wong and Lee (1993) Wai Yee Peter Wong and Dik Lun Lee. 1993. Implementations of partial document ranking using inverted files. Information Processing & Management 29, 5 (1993), 647–669.
  • Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. 3119–3125.
  • Yianilos (1993) Peter N Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA, Vol. 93. 311–321.
  • Zhang et al. (2014) Ting Zhang, Chao Du, and Jingdong Wang. 2014. Composite Quantization for Approximate Nearest Neighbor Search.. In ICML. 838–846.
  • Zobel and Moffat (2006) Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM computing surveys (CSUR) 38, 2 (2006), 6.