1. Introduction
An astronomical amount of data has been produced by human race in recent years compared with the past history (Lynch, 2008). It is valuable to understand and process this data using data mining and machine learning. We cannot fail to notice that the data is generated from diverse sources in practice. For example, it may come from web services, mobile devices, location services or digital imagery. The feature vectors representing these records have varying properties e.g., dimensionality (high or low), sparsity (sparse or dense) and modality (categorical or continuous).
We are interested in developing effective methods to process, index and search large scale datasets of heterogeneous nature. Specifically, we focus on datasets consisting of feature vectors that combine sparse and dense features. Such representations are increasingly popular in several applications. For instance, documents have been traditionally represented as a vector of normalized ngrams (e.g., unigrams and bigrams). It is easy to have millions or billions of ngrams, leading to high dimensional representations. For each document only a few such ngrams are present, yielding a very sparse vector representation. On the contrary, with widespread use of deep learning techniques, it is common to represent text as embeddings e.g., word2vec
(Mikolov et al., 2013) or LSTM hidden states (Palangi et al., 2016). Such embeddings are typically low dimensional (in hundreds) but dense, i.e., all features are nonzero. Both ngram (sparse) and embedding (dense) representations have pros and cons. Embeddings capture correlations among ngram features, which is not possible with direct ngram representations. On the contrary, direct ngrams can "memorize" specific rare features that are helpful in document representations. To get the best of both worlds, concatenations of sparse and dense features are increasingly used in learning tasks. For instance, in a recent “WideandDeep” framework (Cheng et al., 2016), data has both a dense component (from the “deep” neural network) and a sparse component (from the “wide” sparse logistic regression). Fast computation of inner products (or equivalently cosine similarity
^{1}^{1}1Cosine similarity between two vectors is equivalent to their inner product after normalizing the vectors to unit L2 norms.) in sparsedense hybrid spaces is useful in several applications:
Finding similar items in a hybrid dataset such as DBLP (Ley, 2002);

Extreme classification where the classifier makes use of both dense and sparse features while the number of classes is large (commonly seen in recommendation tasks
(Shan et al., 2016; Wang et al., 2017)).
1.1. Insufficiency of the existing methods
Historically, disjoint approaches to efficient inner product search have been applied to sparse and dense datasets. The most popular way to compute inner product for high dimensional sparse vectors is based on variants of inverted indices (Manning et al., 2008) which exploit the sparsity of nonzero (“active”) dimensions in the data. However, the performance of inverted indices degrades quickly when even a few dimensions are dense. On the contrary, search in dense data is usually based on either treebased or hashingbased techniques. For example, KDTree (Bentley, 1975), VPTree (Yianilos, 1993) and SpillTree (Liu et al., 2005), have been widely used for lower data dimensionalities (typically less than one hundred). For higher dimensional data, hashing methods (Wang et al., 2016; Shrivastava and Li, 2014; Andoni et al., 2015) or quantization methods (Jegou et al., 2011; Guo et al., 2016; Babenko and Lempitsky, 2014; Kalantidis and Avrithis, 2014; Zhang et al., 2014; Norouzi and Fleet, 2013) are more popular. But none of these techniques scales to billions of dimensions without significantly degrading search recall.
Another approach is to search the sparse and dense spaces separately, using appropriate data structures, and then combining the results. However, such heuristics can easily fail when the most querysimilar items in the combined space are only middling in the dense and sparse spaces individually. We also note that projection based methods such as LSH
(Indyk and Motwani, 1998) and RPTree (Dasgupta and Freund, 2008) can be applied to hybrid data because they first project the data onto a lowdimensional space, but such techniques often result in sizable recall loss.Currently, no good solutions exist for efficiently searching with highrecall in the hybrid spaces that are the focus of this paper. We highlight the insufficiency of the existing methods by extensive experiments (Section 7). For example, for the largescale QuerySim dataset consisting of a billion datapoints with 200 dense and 1 billion sparse dimensions (with about 200 nonzeros per vector), treating all the dimensions as dense is infeasbile. Moreover, treating all the dimensions as sparse leads to abysmally slow search performance (a few minutes per query). This is because the dense dimensions of the dataset are active in all vectors, leading to full inverted lists for these dimensions. Similarly, projection related methods like LSH give very poor performance even with a large number of bits.
1.2. Contributions
This paper makes the following main contributions:

introduces the challenging problem of efficient search in sparsedense hybrid vector spaces and addresses it by developing a fast approximation of inner products in such spaces,

identifies the bottleneck of fast inner product approximation in sparse components, and designs a novel data structure called cachesorted inverted index to overcome it. Such a data structure makes inner product approximation streamlined with SIMD operations on modern CPU architectures and avoids expensive cachemisses (Section. 3.1),

approximates inner product in dense components by learning datadependent quantization in subspaces, indexed with cachefriendly Look Up Tables (LUT16) and coupled with fast SIMD computations (Section. 4.1),

proposes residual reordering which reduces the impact of approximation error in inner products, improving the accuracy substantially (Section. 5),

demonstrates the effectiveness of the proposed method on two public and one very large industrial datasets, significantly outperforming competitive baselines (Section. 7).
2. Our Approach
2.1. Overview
We start with a highlevel overview of our approach. As mentioned before, we address the challenge of efficient search in hybrid spaces via fast approximation of inner products. Due to the decomposibility of the inner product measure, one can approximate the inner product in sparse and dense spaces separately. Formally, let us consider a hybrid vector , with sparse subvector and dense subvector , where and . Given a query , the inner product of and can be written as,
(1) 
where we refer to and as sparse and dense inner product, respectively. Similarity search in this paper refers to finding item(s) from a hybrid dataset with highest inner product(s) to query , i.e.,
In this work, we approximate sparse and dense inner products independently. Specifically, we propose a novel cachesorted inverted index for sparse innerproduct, and a product code based quantization method with cacheefficient LookUp Table (LUT16) for dense inner product. The errors induced due to approximate innerproduct computations are alleviated by using a second pass reordering with residual indices (Section 5), leading to substantial increase in recall. We start with a description of inverted index and product codes based inner product computation below. The proposed efficient data structures are described in Section 3.
2.2. Inverted Index for Sparse Inner Product
Given a set of sparse vectors consisting of just the sparse dimensions of the hybrid set , an inverted index constructs inverted lists, each corresponding to a dimension from the sparse dimension set . The list consists of indices of all the datapoints whose dimension is nonzero, and whose corresponding values for this dimension, i.e., , where implies element of the vector in . The complete inverted index is defined by . The inner product of sparse component of query, with the vector of dataset can be computed by summing over the nonzero dimensions of , i.e., :
While iterating over each nonzero dimension of , one can accumulate the inner products for all points in set that are active in the inverted list as illustrated in Figure 1. This accumulation based approach was first suggested in information retrieval literature (Wong and Lee, 1993; Moffat and Zobel, 1996; Zobel and Moffat, 2006), and also employed by (Bayardo et al., 2007).
Although the idea of inverted index is fairly simple, it is nontrivial to implement a practically efficient solution. Several implementation heuristics have been proposed in the literature (Wong and Lee, 1993; Moffat and Zobel, 1996; Zobel and Moffat, 2006) but they are mostly based on old computer architectures such as slowness of floating point operations which is no longer an issue with modern CPUs. Another line of heuristics was suggested with regard to making diskbased access more efficient, which is again not relevant to modern distributed systems that tend to use inmemory inverted indices. In this paper, we claim that the major bottleneck in practical inverted index implementation is memory access and propose a new technique called cachesorting which can give severalfold gains in sparse inner product computation. More details are given in Section 3.2.
2.3. Product Codes for Dense Inner Product
For indexing highdimensional dense data, product codes based on quantization in subspaces (Gersho and Gray, 2012) have become very popular. Product codes were used for fast Euclidean distance based search by (Jegou et al., 2011), which along with its variants have regularly claimed top spots on public benchmarks such as GIST1M, SIFT1B (Jegou et al., 2011) and DEEP1B (Babenko and Lempitsky, 2016). Product codes were later extended to the innerproduct similarity measure by (Guo et al., 2016), which showed dramatic improvement in performance over LSH based techniques. Recently product codes have also been used for general matrix compression in (Blalock and Guttag, 2017).
We approximate the inner products for the dense component of hybrid vectors via product quantization similar to (Guo et al., 2016). The dense dimensions are split into blocks of contiguous subvectors and vector quantization is applied on each subspace separately. Here, Vector Quantization (VQ) approximates a dimensional vector by finding its closest quantizer in a codebook :
where is the quantization codebook with codewords.
To index a hybrid vector’s dense component , we first decompose it into subvectors, leading to its product quantization:
(2) 
where denotes the subvector of , is a collection of codebooks, and is the PQ codebook with subquantizers. The codebooks are learned using Means in each subspace independently (Guo et al., 2016).
The indexing (quantization) procedure is illustrated graphically in Figure 2 and the dense inner product is then approximated as inner product of and the product quantization of :
(3) 
3. Efficient Data Structures
In addition to asymptotic complexity, the success of an algorithm often depends upon efficient implementation on modern computing architectures, where memory bandwidth, cache access patterns, and utilization of instruction level parallelism (ILP) can have orderofmagnitude impact on performance. Well implemented algorithms can sometimes even outperform algorithms with lower asymptotic complexity (the well known example being matrix multiplication).
We found these factors to be critical to inner product search. Although we have described our formulation for indexing hybrid datasets, in practice an efficient implementation is nonobvious. In this section, we focus on the computational aspects of implementing hybrid inner product search.
3.1. Memory I/O and Inverted Index
Modern x86 processors arrange memory into a sequence of 64byte “cachelines”. When a program reads from a memory location that is not already cached, the CPU loads the 64byte cacheline containing that address. The CPU never reads or writes fewer than 64 bytes at a time, even if the program only utilizes a small portion of that cacheline.
In practice, query performance for sparse inverted indices is significantly more constrained by the memory bandwidth required to access the accumulator than by the rate at which the CPU performs arithmetic operations. Simply counting the expected number of cachelines touched per query provides an accurate estimation of query time.
Each accumulator cacheline can hold a fixed number of values ; on x86, is for 32bit accumulators, and for 16bit ones. Within a given dataset, each aligned block of consecutive datapoints shares an accumulator cacheline. For a particular dimension, if any of these datapoints is nonzero, all queries active in that dimension will have to access the corresponding cacheline. The cost of processing a second datapoint within the same cacheline is negligible compared to the cost of accessing another cacheline.
The expected query cost can be estimated as follows. For a sparse dataset, , containing datapoints, where is the value of the datapoint on the dimension, and where
is the probability that the
dimension is active in a randomly sampled query vector:Given a sparse dataset, , our goal is to find a permutation for the ordering of datapoints which minimizes Cost(), a process that we call cache sorting.
3.2. Cache Sorting
The general form of this optimization problem is known to be hard to solve. We thus propose a greedy algorithm that produces highly efficient accumulator memory layout, even when compared with stateoftheart approximation algorithms using LP relaxation. Moreover, the greedy approach only takes few seconds even with millions of datapoints, while LP relaxation approaches are much more expensive.
The goal in cache sorting is to reorder the datapoint indices of shared active dimensions into long consecutive blocks as much as possible. This is done by greedily selecting the most active dimension (i.e., the dimension that is nonzero in most data vectors), and partitioning the dataset into two subsets, where the indices of the datapoints that have nonzero values for that dimension are contiguous. This is performed recursively in the order of the dimensions, from the most active to the least active dimension. The details of the algorithm are given in Algorithm 1.
Define as the permutation on dimensions that sorts them from the most active to the least, i.e., for all . Consider indicator variables for a datapoint which indicate whether the most popular dimension is nonzero, . Then the above algorithm is conceptually the same as sorting the indicator variables in decreasing order. In practice, we do not need to explicitly construct these indicator variables and sorting is performed by recursive partitioning. Thus the average complexity of cache sorting is essentially .
There are many conceivable modification to this basic approach (Algorithm. 1). For example, one can imagine performing sorting using graycode sorting order (Gray, [n. d.]), by arranging the permuted indices of datapoints with and to be consecutive. In practice, modified variants often do not make a big differences in performance, and we used the simple decreasing order in our experiments. In addition, we use an optimized implementation of prefix sorting which, instead of sorting the original dataset, takes advantage the inverted index layout, requiring just 16 bytes per datapoint of temporary memory to efficiently compute the datapoint index permutation vector.
3.3. Efficiency of Cache Sorting
Let us view the unsorted sparse component as a sparse matrix with datapoints and dimensions, and suppose each entry of the matrix is independent. Let the dimension of datapoint, denoted as , is nonzero with probability . Similarly the dimension of query vector is nonzero with probability .
Without loss of generality, let us assume that the dimensions are numbered as , and the probability that the entries of a datapoint have nonzero values are sorted in decreasing order, i.e., . For most realworld data, it is commonly observed that these probabilities tend to be distributed according to power law, i.e.,
If a block of values is completely zero, we can simply skip it. Hence, the number of cachelines we need to access for a given dimension can be computed as: . The expected number of cacheline accesses, , is:
(4) 
After learning the permutation using the cachesorting procedure from , let us denote the matrix after permutation as . The number of contiguous blocks in dimension is , and the average number of cachelines each such block occupies is . In the worst case, no two blocks can be fit in one cacheline. Then, an upperbound on the expected number of cachelines accessed is:
(5) 
For simplicity, we assume . In Figure. 3(a), we assume datapoints, and (64 byte cachelines with floats) and plot the expected fraction of cachelines accessed in the unsorted inverted index and the upper bound on the fraction in cachesorted inverted index at each dimension . With cachesorting, the expected number of cacheline accesses as well as the total expected number of accesses (area under the curve) drop significantly across all dimensions.
In Figure. 3(b), we study the effect of parameters , and . As increases, the “active” dimensions will contain more datapoints, resulting in a larger cachesorting impact. The amount of saving also increases with cacheline size , which motivates us to quantize values in the inverted index to reduce the memory footprint.
The analysis above is somewhat pessimistic as it assumes zero correlation between activity patterns of different dimensions. Most realworld datasets have significant correlations among dimensions e.g., several sets of dimensions tend to be active together. Empirically, we have observed over 10x improvement in throughput on several realworld datasets from proposed cachesorting. In addition to the benefits of reducing the total number of cachelines read from main memory, for active dimensions, cachesorting creates long sequential runs of nonzeros that can be encoded efficiently, and computed with vectorized SIMD instructions, further increasing the throughput.
4. Indexing and Search
4.1. Dense Component
As described in Section 2.3, we use product codes based approach for fast inner product approximation in dense space. Given Product Quantization (PQ) codebooks and a dense vector , the indices of nearest centers that each of the subvectors is assigned to are concatenated together into a sequence of integer codes. If there are centers in each subspace codebook, the overall code length is bits.
4.1.1. Asymmetric Distance Computation (ADC)
ADC allows us to compute the inner product between a quantized dense component and an unquantized query with high efficiency and accuracy:
Observe that in each subspace , is essentially the index of one of the vectors from codebook . Thus, to approximate the subproduct , one needs to simply precompute a lookup table for each subspace based on the input query . The subproduct in subspace is then a simple lookup in . The overall inner product is the sum of all subproducts.
4.1.2. AVX2 InRegister Lookup Table
An inmemory lookup table with codewords (LUT256) was proposed in (Jegou et al., 2011; Ge et al., 2014; Martinez et al., 2014). Our implementation uses codewords (LUT16), which is also reported in (André et al., 2015, 2017; Blalock and Guttag, 2017). While this results in slightly lower perbit database encoding efficiency, LUT16 can be implemented via an inregister lookup table using the PSHFB x86 instruction. This instruction was introduced in the SSSE3 instruction set, operating on 128bit registers, and later extended to 256bit registers in the AVX2 instruction set. Each AVX2 PSHUFB instruction performs 32 parallel 16way lookups of 8bit values. To perform accumulation without integer overflow, each register of 328bit values must be extended to registers of 1616bit values, followed by PADDW instructions that perform parallel addition.
In this work, we present a significant optimization of the 8bit to 16bit widthextension operation, described as follows. First, we observe that we can implement widthextension of unsigned integers more efficiently than for signed integers. When constructing lookup tables, we bias the quantized lookup values from [128,+127] to [0, 255], perform unsigned accumulation, then from the final values, we subtract the net effect of this bias.
Unsigned width extension (aka "zeroextension") from 1 register of 8bit values into 2 registers of 16bit values can be accomplished using 2 instructions as follows. Each 16bit slot in the source register contains two of our 8bit values (). The PSRLW instruction bitshifts the source register (), and the PAND instruction zeros out the upper bits ().
We eliminate the PAND instruction, using the original register of 8bit values asis. This produces a result that is wrong by a large but known value; then we restore the correct value during a postprocessing step. This works despite the fact that the unzeroed bits cause our 16bit accumulators to overflow many times. Overflows during addition are perfectly matched by a corresponding underflow during subtraction.
When operating on batches of 3 or more queries, our AVX2 implementation is able to sustain lookupaccumulate operations per clockcycle on Intel Haswell CPUs, including the overhead for postprocessing final inner products. This is 8x better than a LUT256 implementation’s architectural upperbound of two scalar loads per clockcycle. When LUT16 operates on 1 query at a time, the CPU can compute inner products faster than it can stream the PQ codes from the main memory. This constrains singlequery performance to the maximum memory bandwidth available.
4.1.3. Approximation Error from Product Quantization
We first recap the rate distortion bound of parallel Gaussian channels (Cover and Thomas, 2006) from information theory:
Proposition 4.1 ().
The lower bound of the expected squared error (MSE) achievable with bits of quantization for
i.i.d. Gaussian dimensions, each of which has variance
is:In practice, with the Lloyd’s
Means algorithm on Gaussian white noise
(Ge et al., 2014), we can achieve an error within a few percentage of this theoretical lower bound. For inner product computation, we can always whiten the dense component by multiplying with the matrix . At query time, is also multiplied by . This is a special case of QUIPS (Guo et al., 2016), where the query distribution is the same as datapoint distribution. So from this point on, we assume that there exists a small constant , such that with suitable seeding, we can always achieve an MSE . We then present following bound using Azuma’s inequality :Proposition 4.2 ().
The difference between the exact inner product and its approximation is bounded by with probability:
Where , and are the subvectors of , and , respectively under product quantization. The probability is defined over the randomness in the query.
We also note that the upper bound is taken over all datapoint and all subspace .
4.2. Sparse Component
We can generate an even sparser representation of an already sparse dataset by pruning out small entries while maintaining accurate approximation to inner products. Pruning reduces the size of the index and the amount of arithmetic operations during search. An even sparser data also helps cache sorting generate even more continuous memory layout of the index.
We first observe that on average only a small number of dimensions contribute nonzero products to the overall inner product between two sparse vectors. Also, in each dimension, the occurrence of large absolute values is rare across a dataset. These two phenomena combined make the contribution from small entries to the overall inner product much less significant than that from large entries, especially if our goal is to identify largest inner products. Mathematically, we represent a sparse vector as . And given a set of thresholds , one for each dimension, pruning outputs a sparser representation as: .
4.2.1. Approximation Error from Pruning
We now investigate the error in inner product approximation due to pruning. We proceed with the assumption that both and are generated by the following process:

Each dimension is independently nonzero with probability and zero with probability .

If nonzero, the value for dimension is sampled from a distribution , and the support of is .
The error of pruning can be characterized with Chernoff bound:
Proposition 4.3 ().
The difference between the exact inner product and its approximation is bounded by with probability:
Where is the pruning output of and .
The probability of two vectors having nonzero values for the same dimension is typically low, i.e., . So the above probability approaches 1 asymptotically at the rate of , where is the minimum number of entries that need to be removed from the sparse component to cause an error bigger than .
5. Residual Reordering for Improved Search
The index that we build provides a lossy approximation of the original component, and we define residual as the difference between the original component and its index approximation. When retrieving datapoints from the dataset, we can overfetch candidates from the original data index and then use the residual to reorder these overfetched candidates. Due to the simple decomposition , we can obtain the exact inner product by adding the inner product between the query and the residual to the approximate inner product from the index. With the exact inner product, we can refine the candidates and improve our accuracy for top candidates significantly. This allows us to reap the benefit of fast index scanning described in previous sections and also maintain high recall at a small cost due to residual reordering. To further reduce the cost of this residual reordering step, we can also build another index on the residual, which is much more precise than the data index. We apply this paradigm multiple times to achieve the best tradeoff between speed and accuracy:

Overfetch from both sparse and dense data indices: we take the sum of both approximate inner products from sparse and dense components and retain the largest for further reordering. The parameter can be tuned to balance recall and search time.

Reorder with dense residual index and retain : we perform residual reordering with the dense residual index for the remaining datapoints and only retain for the last reordering. The parameter also needs to be tuned.

Reorder with sparse residual index and return : we perform residual reordering with the sparse residual index for the remaining datapoints and return the largest as the final search result.
Reordering is always performed on a small subset of residuals. And its run time is much shorter in practice when compared with the run time of scanning through the sparse and dense data indices. In our similarity search benchmarks, the residual reordering logic consumes less than of the overall search time. On the other hand, it improves the final retrieval recall significantly.
5.1. Retrieval Performance
In the context of similarity search, rank preservation is of great interest and our hybrid approach with residual reordering gives following retrieval performance guarantee:
Proposition 5.1 ().
Given a fixed query , define the gap as , the proposed hybrid search approach achieves a recall@h of at least as long as following condition holds:
Where is the datapoint in terms of inner product and is the datapoint. The probability is defined over the randomness in datapoint generation.
This proposition states the fact that recall is at least the probability of the event that inner product approximation error is less than half the gap between and the largest inner products. Based on the bounds from Proposition 4.2 and 4.3, we can almost always find a large enough that satisfies this condition so long as the inner product distribution exhibits a quickly vanishing upper tail. For most real world datasets, under the setting of , is empirically to achieve recall.
6. Overall Indexing Algorithm
To efficiently approximate inner products between datapoints in the dataset and a query , we index the dataset as follows:

Build an index for the sparse component by pruning with per dimension threshold :
(6) The residual is thus . To efficiently compute the inner product between the query and the residuals, we also build another pruned index of the residuals with parameter .
(7) 
Build an index for the dense component by applying product quantization (PQ) with codebooks for subspaces.
Similar to sparse indexing, we build another PQ index of the residuals with codebooks for subspaces as:
6.1. Parameter Selection
6.1.1. Dense Index Parameters
The first dense data index is built with codebooks designed to have a relatively low bit rate, i.e., and , which means we store 4 bits for on average 2 dense dimensions. This immediately achieves a data index size reduction of 16x when compared to the original data that requires 32 bits per dimension for a floating point number. And it also enables us to apply inregister table lookup (LUT16) that outperforms alternative arithmetic or inmemory table lookup operations by at least 4x. The second residual index is built with and . Since now we treat each dimension as a subspace, we can directly apply scalar quantization with a distortion of at most of the dynamic range. This residual index is exactly the size of the original dataset and its error in inner product approximation is unnoticeable for our tasks.
6.1.2. Sparse Index Parameters
For fast approximation, we normally set the first threshold from Equation 6 to a relatively high value so that only top s of nonzero values in dimension are kept in the data index. This high threshold reduces the size of the data index to about 2 orders of magnitude smaller than the original dataset, and hence achieves significant speedup in index scan. Empirically, this hypersparse index induces little loss in retrieval performance. The second threshold from Equation 7 is set to a relatively low value so that most of the nonzero values are kept in the residual index. The reason is that we only need to access the residual index for a moderate number of datapoints, so its size does not affect computation speed directly like the size of the data index. With this relatively accurate residual index, the sum of the data index approximation and the residual index approximation is almost exact, which helps lift recall to high 90s%.
7. Experiments
7.1. Evaluation Datasets
In this work, we use two public datasets and one large scale industrial dataset for experiments.
7.1.1. Public Datasets
We use hybrid versions of two public datasets Netflix (Bennett et al., 2007) and Movielens (Harper and Konstan, 2015). Each dataset contains a sparse collection of usermovierating triplets. Our goal is to look for users in the dataset that have similar movie preferences as the users in the query set. The sparse component of a user comes directly from the set of ratings that the user provides. For dense components, we first assemble the usermovierating matrix , whose rows represent users and columns represent movies, and values are the ratings from 1 to 5. We follow the classic collaborative filtering approach (Billsus and Pazzani, 1998) and then perform Singlar Value Decomposition on the sparse matrix . The dense components of all users are given by weighted by . So the combined hybrid vector representation of users is , in which each row is a user embedding. The dense component dimensionality is fixed to be 300. Other details can be found in Table 2. We randomly sample 10k embeddings as the query set and the rest as the dataset.
7.1.2. QuerySim Dataset
We introduce a hybrid dataset that is built from a large set of most frequent queries in web search for the task of identifying similar queries. Each query is processed through two pipelines to generate a sparse and a dense component of its vector representation. The sparse pipeline generates a sparse vector, representing the bag of unigrams and bigrams in the query and each gram’s weight is set to a customized value. The dense pipeline generates a dense vector representation using a neural network similar to (Le and Mikolov, 2014). We fine tuned the relative weights for the sparse and dense inner products to optimize a more comprehensive metric such as the area under ROC. Finally, all datapoints are scaled with the learned relative weights and stored in a database.
The overall dataset contained one billion points and the dimensionality of the hybrid vector was over 1 billion (Table 1). Zooming into the sparse component, we first demonstrate the powerlaw distribution of the numbers of nonzeros across dimensions in Figure 4(a). We then visualize the long tail in the distribution of values of nonzero entries in the sparse component in Figure 4(b).
#datapoints  #Dense Dims  #Active Sparse Dims 

203  
#Avg Sparse nonzeros  Ondisk Size  Update Frequency 
134  5.8TB  Weekly 
7.2. Recall and Speedup
Our goal is to compare the speed and recall of top 20 items from the entire dataset for each technique. All single machine benchmarks are conducted on a workstation with an Intel CPU clocked at 3.5GHz, with 64GB main memory. Distributed benchmarks are performed on a cluster with 200 servers that are equipped with an Intel CPU clocked at 2GHz and 128GB main memory. Both CPUs support the AVX2 instruction set.
For comparison we use the following wellknown baselines:
Exact methods: with Dense Brute Force
, we pad
’s to the sparse component to make the dataset completely dense; with Sparse Brute Force, we append the sparse representation of the dense component to the end of the sparse component to make the dataset completely sparse. With Sparse Inverted Index, we perform the same conversion as sparse brute force and search with an inverted index instead.Hashing methods: with Hamming (512 Bits)
, we first project each datapoint onto 512 Radmacher vectors and binarize the projected values with median thresholding. We use these 512bit binary codes to perform hamming distance search and retrieve top 5K points, from which the required 20 are retrieved via exact search.
Dense only methods: with Dense PQ, Reordering 10k, we apply PQ indexing to only the dense component. From the index, we fetch top 10k datapoints and then return top 20 via exact search.
Sparse only methods: with Sparse Inverted Index, No Reordering, we simply retrieve top 20 from the sparse component with an inverted index and return them. with Sparse Inverted Index, Reordering 20k, we retrieve top 20k from the sparse component with an inverted index and then compute exact inner products of the 20k and return top 20.
Table 2 shows the comparison of different methods on public datasets. It is clear that the proposed method gives very fast search with high recall. The results on our large scale QuerySim dataset are summarized in Table 3. The latency and recall are measured on a random sample of 5M datapoints from the original 1B datapoints in order to compare with other more expensive techniques. The proposed hybrid approach is over 20x faster than the closest competition in the high recall regime.
Scalability: Extrapolating these results to allpair search scenario (i.e., both query and database sets are the same with 1B billion datapoints), with CPU cores, the exact sparse brute force methods will take about 9 years to complete the 1B1B search on the QuerySim dataset while sparse inverted index will take about 3 months. On the other hand, our proposed hybrid approach is able to complete this under 1 week to match the weekly update requirement from production.
Online Search: To support online retrieval applications with stringent latency requirement, we have also developed a distributed similarity search system. We divide the dataset randomly into 200 shards, and allocate 200 servers, each of which loads a single shard into memory. For a single query, our system achieves recall@20 at an average latency of 79ms.
Dataset  Netflix Hybrid  Movielens Hybrid  

#datapoints  
#Dims  Dense  Sparse  Dense  Sparse 
300  300  
Algorithm  Time  Recall  Time  Recall 
Dense Brute Force  3464  100%  1242  100% 
Sparse Brute Force  905  100%  205  100% 
Sparse Inverted Index  63.9  100%  15.7  100% 
Hamming (512 bits)  16.0  9%  11.5  20% 
Dense PQ, Reordering 10k  52.2  98%  29.4  100% 
Sparse Inverted Index, No Reordering  22.8  29%  5.1  98% 
Sparse Inverted Index, Reordering 20k  96.8  70%  49.0  100% 
Hybrid (ours)  18.8  91%  2.6  92% 
Dataset  QuerySim  

#datapoints  (sampled)  
#Dims  Dense  Sparse 
203  
Algorithm  Time (ms)  Recall@20 
Dense Brute Force  OOM  OOM 
Sparse Brute Force  9655  100% 
Sparse Inverted Index  406  100% 
Hamming (512 bits)  59.5  0% 
Dense PQ, Reordering 10k  39.8  45% 
Sparse Inverted Index, No Reordering  58.6  0% 
Sparse Inverted Index, Reordering 20k  102  30% 
Hybrid (ours)  20.0  91% 
8. Conclusion
We have introduced a novel yet challenging search scenario for performing efficient search in high dimensional sparsedense hybrid spaces. To achieve high performance and accuracy, we proposed a fast technique based on approximating inner product similarity. Extensive optimizations to inmemory index structures for both sparse and dense vectors were also described to take advantage of modern computer architecture. We have shown advantages of our approach in largescale realworld search scenario, which achieves more than 10x speedup with recall in 90s%.
With this fast and accurate approximation method, we hope that novel information retrieval systems that operate directly on sparse and dense features can be developed to unlock the value of hybrid data at massive scale.
References
 (1)
 Andoni et al. (2015) Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and optimal LSH for angular distance. In Advances in Neural Information Processing Systems. 1225–1233.
 André et al. (2015) Fabien André, AnneMarie Kermarrec, and Nicolas Le Scouarnec. 2015. Cache locality is not enough: highperformance nearest neighbor search with product quantization fast scan. Proceedings of the VLDB Endowment 9, 4 (2015), 288–299.
 André et al. (2017) Fabien André, AnneMarie Kermarrec, and Nicolas Le Scouarnec. 2017. Accelerated Nearest Neighbor Search with Quick ADC. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR ’17). 159–166.
 Babenko and Lempitsky (2014) Artem Babenko and Victor Lempitsky. 2014. Additive quantization for extreme vector compression. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 931–938.
 Babenko and Lempitsky (2016) A. Babenko and V. Lempitsky. 2016. Efficient Indexing of BillionScale Datasets of Deep Descriptors. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2055–2063. https://doi.org/10.1109/CVPR.2016.226
 Bayardo et al. (2007) Roberto J Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web. 131–140.
 Bennett et al. (2007) James Bennett, Stan Lanning, and Netflix Netflix. 2007. The Netflix Prize. In In KDD Cup and Workshop in conjunction with KDD.
 Bentley (1975) Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509–517.
 Billsus and Pazzani (1998) Daniel Billsus and Michael J. Pazzani. 1998. Learning Collaborative Information Filters. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML ’98). 46–54.
 Blalock and Guttag (2017) Davis W. Blalock and John V. Guttag. 2017. Bolt: Accelerated Data Mining with Fast Vector Compression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 727–735.
 Cheng et al. (2016) HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS 2016). 7–10.
 Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory, 2nd Edition. WileyInterscience.

Dasgupta and
Freund (2008)
Sanjoy Dasgupta and Yoav
Freund. 2008.
Random projection trees and low dimensional
manifolds. In
Proceedings of the fortieth annual ACM symposium on Theory of computing
. ACM, 537–546.  Ge et al. (2014) T. Ge, K. He, Q. Ke, and J. Sun. 2014. Optimized Product Quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 4 (April 2014), 744–755. https://doi.org/10.1109/TPAMI.2013.240
 Gersho and Gray (2012) Allen Gersho and Robert M Gray. 2012. Vector quantization and signal compression. Vol. 159. Springer Science & Business Media.
 Gray ([n. d.]) Frank Gray. [n. d.]. Pulse code communication. ([n. d.]). US Patent 2,632,058.

Guo
et al. (2016)
Ruiqi Guo, Sanjiv Kumar,
Krzysztof Choromanski, and David
Simcha. 2016.
Quantization based Fast Inner Product Search. In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 911, 2016
. 482–490. http://jmlr.org/proceedings/papers/v51/guo16a.html  Harper and Konstan (2015) F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, 19:1–19:19.

Indyk and Motwani (1998)
Piotr Indyk and Rajeev
Motwani. 1998.
Approximate nearest neighbors: towards removing the curse of dimensionality. In
Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 604–613.  Jegou et al. (2011) Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1 (2011), 117–128.
 Kalantidis and Avrithis (2014) Yannis Kalantidis and Yannis Avrithis. 2014. Locally optimized product quantization for approximate nearest neighbor search. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2329–2336.
 Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on Machine Learning, Vol. 32. 1188–1196.
 Ley (2002) Michael Ley. 2002. The DBLP computer science bibliography: Evolution, research issues, perspectives. In String Processing and Information Retrieval. 1–10.
 Liu et al. (2005) Ting Liu, Andrew W Moore, Ke Yang, and Alexander G Gray. 2005. An investigation of practical approximate nearest neighbor algorithms. In Advances in neural information processing systems. 825–832.
 Lynch (2008) Clifford Lynch. 2008. Big Data: How do your data grow?. In Nature, Vol. 455. 28–9.
 Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
 Martinez et al. (2014) Julieta Martinez, Holger H. Hoos, and James J. Little. 2014. Stacked Quantizers for Compositional Vector Compression. CoRR abs/1411.2173 (2014). http://arxiv.org/abs/1411.2173
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26. 3111–3119.
 Moffat and Zobel (1996) Alistair Moffat and Justin Zobel. 1996. Selfindexing inverted files for fast text retrieval. ACM Transactions on Information Systems (TOIS) 14, 4 (1996), 349–379.

Norouzi and Fleet (2013)
Mohammad Norouzi and
David J Fleet. 2013.
Cartesian kmeans. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3017–3024. 
Palangi et al. (2016)
Hamid Palangi, Li Deng,
Yelong Shen, Jianfeng Gao,
Xiaodong He, Jianshu Chen,
Xinying Song, and Rabab Ward.
2016.
Deep Sentence Embedding Using Long Shortterm Memory Networks: Analysis and Application to Information Retrieval.
IEEE/ACM Trans. Audio, Speech and Lang. Proc. 24, 4, 694–707.  Shan et al. (2016) Ying Shan, T. Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016. Deep Crossing: WebScale Modeling Without Manually Crafted Combinatorial Features. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 255–262.
 Shrivastava and Li (2014) Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Advances in Neural Information Processing Systems. 2321–2329.
 Wang et al. (2016) Jun Wang, Wei Liu, Sanjiv Kumar, and ShihFu Chang. 2016. Learning to hash for indexing big data survey. Proc. IEEE 104, 1 (2016), 34–57.
 Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. CoRR abs/1708.05123 (2017). http://arxiv.org/abs/1708.05123
 Wong and Lee (1993) Wai Yee Peter Wong and Dik Lun Lee. 1993. Implementations of partial document ranking using inverted files. Information Processing & Management 29, 5 (1993), 647–669.
 Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and TatSeng Chua. 2017. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17. 3119–3125.
 Yianilos (1993) Peter N Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA, Vol. 93. 311–321.
 Zhang et al. (2014) Ting Zhang, Chao Du, and Jingdong Wang. 2014. Composite Quantization for Approximate Nearest Neighbor Search.. In ICML. 838–846.
 Zobel and Moffat (2006) Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM computing surveys (CSUR) 38, 2 (2006), 6.
Comments
There are no comments yet.