Scalable Image Retrieval by Sparse Product Quantization

03/15/2016 ∙ by Qingqun Ning, et al. ∙ Singapore Management University 0

Fast Approximate Nearest Neighbor (ANN) search technique for high-dimensional feature indexing and retrieval is the crux of large-scale image retrieval. A recent promising technique is Product Quantization, which attempts to index high-dimensional image features by decomposing the feature space into a Cartesian product of low dimensional subspaces and quantizing each of them separately. Despite the promising results reported, their quantization approach follows the typical hard assignment of traditional quantization methods, which may result in large quantization errors and thus inferior search performance. Unlike the existing approaches, in this paper, we propose a novel approach called Sparse Product Quantization (SPQ) to encoding the high-dimensional feature vectors into sparse representation. We optimize the sparse representations of the feature vectors by minimizing their quantization errors, making the resulting representation is essentially close to the original data in practice. Experiments show that the proposed SPQ technique is not only able to compress data, but also an effective encoding technique. We obtain state-of-the-art results for ANN search on four public image datasets and the promising results of content-based image retrieval further validate the efficacy of our proposed method.



There are no comments yet.


page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image retrieval is an important technique for many multimedia applications, such as face retrieval [19], object retrieval [7], and landmark identification [6]. For large-scale image retrieval tasks, one of the key components is an effective indexing method for similarity search [47, 5], particularly on high-dimensional feature space [24, 48, 36, 42]. Similarity search, a.k.a.

, nearest neighbor (NN) search, is a fundamental problem. Due to the curse of dimensionality, exact NN search for high-dimensional data is extremely challenging and expensive. To overcome the issue, extensive research efforts have been devoted to approximate nearest neighbor (ANN) search methods, such as hashing 

[46, 22, 8], tree-based methods [39, 27, 28], and vector quantization [13, 15]

, which attempt to find the nearest neighbor with high probability using much less searching time and memory cost.

In this paper, we focus on developing a vector quantization (VQ) method for similarity search, which is a typical approach to effectively encoding the data for ANN search. A codebook is learnt and every feature vector in the database can be represented by one of the most similar vectors in the codebook, typically named as “codeword”. Then VQ directly employs the index of the codeword to represent the original data vector, which typically has only a few bits. In addition, the similarity between query and the data vector in database can be approximated by calculating the distance between query and codebook vector. This greatly reduces the computational cost and searching time.

Fig. 1: The system view of Sparse Product Quantization for ANN search.

In general, VQ requires more bits in order to reduce quantization distortion. Since the size of codebook increases exponentially with respect to the total number of encoded bits, VQ-based method is ineffective for the data with high dimensionality. To tackle this issue, Product Quantization (PQ) [15] has recently been shown a promising paradigm for efficiently indexing the high-dimensional image features. Different from other Hashing-based methods, it decomposes the high-dimensional space into a Cartesian product of low dimensional subspaces and quantize each of them separately. Since the dimensionality of each subspace is relatively small, using a small-sized codebook is sufficient to obtain the satisfied searching performance.

Although computational cost can be effectively reduced by diving the long vector into small segments, PQ may fail to retrieve the exact nearest neighbor of a query with high probability due to the high quantization distortion. As discussed in [10], this will eventually yield lower search accuracy compared to VQ. To deal with this problem, several remedies have recently been proposed. Gong and Lazebnik [12]

presented an iterative quantization approach which maps data onto binary codes for fast retrieval. Cartesian K-means 

[31] and Optimized Product Quantization [9] share the same idea of rotating the original data to minimize the quantization error. These methods including PQ essentially follow the same framework of vector quantization, which all suffer from the inevitable nontrivial quantization distortion.

To address the above limitations, in this paper, we propose a novel approach called Sparse Product Quantization (SPQ) to encoding the high-dimensional vector of image features, where the sparse coding technique is introduced into approximate nearest neighbor search. Motivated by soft assignment [35], we intend to find the sparse representation for each segment of feature vector rather than hard assignment used in PQ. Specifically, a feature vector is decomposed into Cartesian product of the low dimensional subspaces, where the short vector in each subspace is approximated by the linear combination of several vectors from the codebook. Fig. 1

illustrates the overview of SPQ for ANN search. We formulate the encoding stage as a sparse optimization problem and solve it by employing a popular greedy algorithm. The Euclidean distance between two vectors can be efficiently estimated from their sparse product quantization through simple table lookups. Moreover, the proposed method is able to take advantage of the very efficient SSE implementation using SIMD instructions, which can greatly reduce the computational overhead. Thus, the computational time of our presented method is comparable to the PQ method’s while the precision of SPQ outperforms that of PQ at a very large margin. In contrast to the computationally intensive clustering algorithm used in all the VQ-based paradigms, we employ the sparse structure along with the fast stochastic online algorithm 

[25, 26] to efficiently generate the codebook, which optimizes the sparse representation of data vectors according to their quantization errors. Consequently, the proposed representation is essentially close to the original data in practice even with a few basics. The empirical evaluation demonstrates that the presented method yields state-of-the-art ANN search results and outperforms the popular approaches on the application of image retrieval.

The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 introduces basics of VQ and propose our sparse vector quantization. Section 4 presents the proposed sparse product quantization for ANN search. Section 5 discusses our experimental results in detail and finally Section 6 concludes this work.

Ii Related Work

Fast NN search is a fundamental research topic which is extensively studied in literature such as multimedia application, image classification, and machine learning. Our work is related to approximate NN search methods, which can be roughly grouped into three categories: Hashing-based methods 

[46, 4, 44], KD-tree [3], and Vector Quantization work [15, 9].

Hashing-based ANN search approach has received lots of attention. Most of them employ either random projection or the learning-based methods to generate compact binary codes. As a consequence, the similarity between two data vectors is approximately represented by the Hamming distance of their hashed codes. Random projection is an effective approach which preserves pairwise distances for data points. The most representative example is Locality Sensitive Hashing (LSH) [8, 38]. According to the Jonson Lindenstrauss Theorem [17], LSH needs random projections to preserve the pairwise distances, where is the relative error. Hence, LSH needs to employ the code with long bit length in order to boost the projection performance, which leads to both high computational cost and huge storage requirement. On the other hand, learning-based hashing methods [46, 4, 44]

try to learn the structure of input data. Most of these algorithms generate the binary codes by employing the spectral properties of the data affinity matrix, i.e., item-item similarity. Some other hashing methods also employ multi-modal data 

[47] or semantic information [22]. Despite achieving promising gain with relatively short codes, these methods often fail to make significant improvement as code length increases [18].

Fig. 2: 2D toy example of sparse vector quantization. and denote query and vector in gallery. and represent their quantization vectors. Instead of using hard assignment by the nearest center, we employ the sparse representation of the codebook with few word. Thus, is the projection of on line spanned by and . As , the distortion of our method is always smaller than VQ’s. LABEL:sub@fig:adc Asymmetric Distance Computation (ADC); and LABEL:sub@fig:sdc Symmetric Distance Computation (SDC).

The second group of research aims at speeding up the ANN search with KD-tree [3]. The expected complexity of KD-tree search is , while the brute-force search is . Unfortunately, for high dimension data KD-tree are not much more efficient than the brute-force exhaustive search [45] due to the curse of dimensionality. Nevertheless, both randomized KD-trees [40, 21] and hierarchical K-means [29] improve the performance of KD-tree. In particular, these two methods are included in FLANN [27, 28], which automatically selects the best algorithm and optimal parameters depending on the dataset. FLANN is much faster than other publicly available ANN search software. However, KD-tree approaches need fully access to the data and thus cost much more memory in searching stage.

The third group of related work is about Vector Quantization based approaches, which try to approximate data vectors with codewords in the codebook. Jégou et al. [15] proposed an efficient product quantization (PQ) recently. The key of PQ is to decompose the feature space into a Cartesian product of low dimensional subspaces and quantize each one separately using their corresponding predefined codebook. Then, the distance between the query and a vector in gallery set can be computed by either symmetric distance computation (SDC) or asymmetric distance computation (ADC). Also, the inverted file system is employed to conduct non-exhaustive search efficiently. Empirically, PQ has been shown to significantly outperform various hashing-based methods in terms of accuracy. As discussed in [15], the prior knowledge on the underlying structures of input data is essential to VQ. Most recently, Ge et. al [9] consider PQ as an optimization problem that minimizes the quantization distortions by searching for the optimal codebooks and space decomposition. Due to the inherent nature of VQ [13], it is hard for these methods to evaluate the impact of quantization error on the ANN search performance. We should mention that a work called Product Sparse Coding [11] was published recently. However, it substantially differs from our work as it brings a strategy for sparse coding, though we both have relationship with product method and sparse coding.

Finally, our work is closely related to soft-assignment [35], which has been introduced into the context of object retrieval [34] in order to reduce the quantization error. The key idea of soft-assignment is to map the original high-dimensional descriptor to a weighted combination of multiple visual words rather than hard-assigned onto a single word as in previous work [41, 34]. Still, this representation is just incorporated into a standard tf-idf architecture. Despite requiring extra storage and computational cost, soft-assignment always results in lower quantization distortion and thus yields a significant improvement of retrieval performance in practice.

Iii Sparse Vector Quantization

In this section, we first briefly review basics of Vector Quantization, and then introduce the proposed Sparse Vector Quantization (SVQ), followed by discussing the codebook training method for SVQ.

Iii-a Vector Quantization

Vector Quantization (VQ) [13] is a classical technique for data compression. It divides a dataset into some groups, where each vector is represented by the centroid of its corresponding group. More formally, given a vector , VQ maps to the nearest codeword of a pre-trained codebook as follows:


where is a distance metric. In particular, the distance used in this paper is Euclidean distance: . The encoding map is called quantizer which is the most important component of VQ. Therefore, the quantization distortion or reconstruction error of is defined as:


Given the codebook , the quantization of is computed by solving the minimization problem in Eqn. (1). Typically, it can be simply represented by the Euclidean distance between the query and its corresponding codeword in .

In general, there are two kinds of ANN search methods according to different forms of queries. One is called Symmetric Distance Computation (SDC), in which both query and database vectors are quantized into codes. The other is called Asymmetric Distance Computation (ADC), where only the database vectors are quantized.

Iii-B Sparse Vector Quantization

One key limitation of VQ is that it assigns the original vector to the single nearest codeword in the codebook. This hard assignment strategy can lead to relatively large quantization distortion which limits the performance of VQ.

Motivated by the success of soft assignment [35], instead of using the hard assignment as in VQ, we employ the sparse representation of multiple codewords to represent the original feature vector.

Fig. 2 shows a 2D-toy example to illustrate the key idea of our proposed method. Let and denote a query and the vector in gallery set respectively, and and represent the quantization vector for and , respectively. VQ simply sets to point by hard assignment, and similarly sets to . Thus, the quantization distortion for is . In this work, we employ the linear combination of two words and to represent . Therefore, is the projection of on the line spanned by and . It is clear that the quantization distortion by VQ is always larger than that of the sparse quantization, since .

As lies on the line (), we assume . Note that the coefficients and can be easily computed by solving the linear equation. As illustrated in Fig. 2, we can compute the distance as follows:


where denotes the dot product. The above equation calculates the ADC distance. Also, we can calculate SDC distance using the similar approximation method.

Before introducing SVQ, we first give an equivalent formulation of VQ. We stack the codebook into a matrix , in which each of its columns is a word. Let denote the size of codebook , we can rewrite Eqn. (1) as the following optimization problem:


is a dimensional column vector, in which the value of each element is either zero or one. Obviously, the above optimization in Eqn. (4) is equivalent to hard assignment by imposing very strict constraints on variable to choose the nearest word from matrix given the input vector .

As in the above discussion, it can be observed that searching accuracy for ANN is directly related to the bound of Eqn. (4) rather than its solution. To this end, we relax the constraints in Eqn. (4) so as to obtain a lower bound. This will implicitly yield better ANN searching performance. Specifically, we relax the constraint in Eqn. (4) as follows:


where , named as sparse level, denotes the number of codewords selected to encode . It can be seen that such relaxation not only increases norm of sparse representation but also expands the space of . Obviously, Eqn. (4) can be viewed as a special case of Eqn. (5) when . Therefore, we can obtain a lower bound for quantization distortion. Intuitively, the above formulation employs the linear combination of words in codebook rather than using only single word as VQ to approximate the original input vector. Our empirical study shows that using just two words is sufficient to yield significant gain over the hard assignment.

Eqn. (5) is well-known as an NP-Hard problem. To tackle this issue, we take advantage of an effective greedy algorithm called Orthogonal Matching Pursuit (OMP) [25, 26]. OMP updated all the extracted coefficients by computing the orthogonal projection of the vector residual onto the set of codewords selected so far. As is usually set to two, there are at most two non-zero elements in the coefficient vector . As the sparse property of the representation is essential to fast NN search, we thus name our method as Sparse Vector Quantization (SVQ).

Iii-C Codebook Training

Remember that we assume the codebook of each approach has been given in previous analysis. In this part, we will show how to obtain the codebook.

The first common and straightforward method is to find the codebook by directly minimizing the quantization error on the training set . In the case of VQ, the codebook is obtained by solving Eqn. (1) or Eqn. (1) on , and this is equivalent to running an iterative k-means clustering algorithm where the centroids of the resulting clusters are treated as the codebook.

For SVQ, minimizing the quantization error is equal to the following problem:


It is NP-hard. We can alternate between and to solve this problem. When is fixed, we have shown how to solve it in previous section. Notice that here what we care is the codebook . Then we can further relax the contraints by using an -norm constraint which can also yield sparse solutions. In section V-B, we will see that both methods is applicable to our method. When is fixed, it becomes an uncontrained least square problem. In our implementation we employ the stochastic/online optimization algorithm [25, 26] to solve the above optimization problem for learning the codebook, where the learned codebook can be excellently fitted for the sparse coding tasks. Since the algorithm is based on stochastic optimization, it is even faster than conventional k-means clustering method.

In general, VQ-based methods heavily rely on a good codebook, which is important to reduce the quantization distortion. Due to its intrinsic limitedness, k-means is often difficult to generate a good one. In the experiment, we will show that, In contrast to other VQ-based methods such as PQ and OPQ, our proposed SPQ method is not limited to any specific codebook learning method.

Iv Sparse Product Quantization

To facilitate the practical ANN search, we propose an efficient Sparse Product Quantization approach by extending the product quantization with the proposed SVQ technique in order to further reduce the computational overhead.

Iv-a Product Quantization

Following the idea of Product Quantization (PQ) [15], we decompose the high-dimensional space into a Cartesian product of low dimensional subspaces and then perform sparse vector quantization in each subspace separately. Specifically, a vector is viewed as the concatenation of subvectors: and the codebook is defined as: .

For PQ, each subvector is mapped onto a sub-codeword from its corresponding codebook:


where is a quantizer for the -th subvector of . Practically, is equally partitioned so that all subvectors and is a multiple of . Note that each subvector is encoded according to the different codebook. In this case, any word of in codebook will be the concatenation of sub-codewords: , with each .

Let denote the PQ of . Then, the quantization distortion of by PQ is defined as follows:


Usually, we need to quantize a set of vectors rather than single one. Hence, the quantization distortion of is defined as .

As in Eqn. (7), it can be easily observed that PQ divides Eqn. (1) into sub-VQ problems and therefore addresses it separately. Therefore, PQ method enjoys the merit of providing the compact coding scheme for high-dimensional data while yielding accurate result for fast approximate nearest neighbor search. However, the unavoidable quantization error limits its performance of searching accuracy due to the inherent nature of vector quantization [13].

Intuitively, better reconstruction that means having lower quantization distortion indicates better search accuracy . In next section, we will introduce an approach which can effectively reduce the quantization distortion.

Iv-B Sparse Product Quantization

In the proposed sparse vector quantization, we represent each item in the database as follows:


where is a sparse vector with a few non-zero elements.

Motivated by product quantization, in this paper, we employ the proposed SPQ scheme with slight modification by replacing with its subvector . Therefore, we can approximate through the following equation:


We can prove that the quantization distortion of SPQ is upper bounded by that of PQ. Remember that SVQ is a relaxation version of VQ. Thus, the bound for quantization distortion of SVQ is lower than that of VQ. In the case of PQ and SPQ, their distortions are the sum of distortions for each subvector. With respect to each subvector, the situation is equal to that of VQ and SVQ. Thus, we can conclude that the quantization distortion of SPQ is less than or equal to PQ’s.

0:    The database , the codebook size , the subspace number , the sparse level , the query set , the number of NN .
0:    The top ANN indexs , the top ANN distances
0:   Encoding Stage
1:  Sample a subset of .
2:  for each subspace of  do
3:     Using the fast stochastic online algorithm [25] to train a codebook of size on .
4:  end for
5:  for each subspace of  do
6:     Compute sparse coefficient on using Orthogonal Matching Pursuit algorithm, such that
7:  end for
7:   Query Stage
8:  for each query of  do
9:     for each subspace of  do
10:        Precompute lookup table with and .
11:        Using and to compute the approximate distances to the database on this subspace.
12:     end for
13:     Sum up the approximate distances .
14:     Search the top NNs based on and save them to and .
15:  end for
Algorithm 1 ANN Search with Sparse Product Quantization

Iv-C Approximate Nearest Neighbor Search

In the following, we discuss how to apply the proposed SPQ method to conduct ANN search towards large-scale image retrieval tasks. The whole framework of our proposed SPQ approach of ADC version is summarized into Algorithm 1.

In particular, to facilitate ANN search, we encode all the data vectors in the gallery using the proposed SPQ method. Then, we compute the distance between a query and the data in the gallery using two kinds of distance measures: ADC and SDC.

According to the definition, ADC can be formulated as:


To reduce the computational cost for ADC distance computation, we can either normalize or precompute . Since is an essentially sparse vector, it only requires several floating point operations to compute .

In the case of SDC computation, we employ sparse product quantization to approximate the query vector as: , Similarly, SDC is computed as:

For better illustration, Fig. 2 shows a 2D example of distance computation for both ADC and SDC.

Iv-D Complexity Analysis

In the following, we give the detailed analysis on the complexity of our proposed SPQ scheme.

Let denote the dimensionality of each feature vector, denote the total number of items in the whole database, denote the number of subvectors in , and denote the size of each codebook. For a given query, it takes floating point multiplications to search its approximate nearest neighbor in database . Specifically, it requires multiplications for the dot product of each database vector. This is required if the database is not normalized. Also, it takes operations to compute the distance between query vector and vocabulary matrix . We need multiplications to compute the dot product with the database vectors in sparse representation, which is the third term in Eqn. (11).

If all the database vectors have been normalized with unit norm offline, then . Therefore, the overall online time complexity to computing ADC distance can be reduced to . In the task of multimedia information retrieval, the dimensionality of each feature vector is far less than the total number of entries in database: . Thus, the computational complexity of Eqn. (11) can be approximated to . On the other hand, the complexity of brute-force NN search is . Thus, we can obtain substantial speedup using the proposed SPQ scheme. Moreover, our method is able to take advantage of the efficient SSE instructions to further reduce the multiplication computational time. Specifically, the searching time of our proposed SPQ is comparable to the original PQ while the precision of SPQ outperforms that of PQ at large margin. Additionally, the empirical study shows that SPQ is even faster than FLANN [28] with the same recall rate.

Due to the inherent nature of soft-assignment, SPQ consumes more memory cost than hard-assignment methods inevitably. However, it is worth the memory because SPQ brings significant gain on precision improvement. According to the previous studies, FLANN is one of the most popular ANN search techniques that utilize tree structure. However, it fails to work for very large-scale datasets since it must load the whole dataset in memory when building the trees. By contrast, SPQ does not need to load the whole data in memory by employing the efficient inverted file structures, making it potentially more practical than FLANN for large-scale multimedia retrieval.

V Experiment

In this section, we will first introduce our experimental testbed and the background of several state-of-the-art ANN methods we will compare with. Then we discuss the settings of our proposed method and furnish our results comparing with these methods. Finally, we show the application of our method on image retrieval.

V-a Experimental Testbed

To examine the empirical efficacy of the proposed method, we conduct an extensive set of experiments for comprehensive performance evaluations on five datasets, including a synthetic dataset with Gaussian noises and four publicly available image feature collections. Each dataset is partitioned into three parts: training set, gallery set and query set. The details of these testbeds are summarized as follows: 1) SIFT dataset consists of one million local SIFT features [24] with 128 dimensions, in which 100K samples are employed to learn the codebook. All the one million samples are treated as gallery set, and 10K samples are used for evaluation. Note that there is no overlap between the training set and the gallery set, since the former is extracted from Flickr images and the latter is from the INRIA Holidays images [14]; 2) GIST [32] is made of 960-dimensional global features. There are 50K samples used to learn the codebook. Similarly, one million samples in database are viewed as gallery set, and 1K samples are used for query evaluation. They are extracted from the tiny image set [43], Holidays image set, and Holiday with Flickr1M set, respectively; 3) We perform empirical study on MNIST111 as used in OPQ [10], which is a 784-dimensional image set of hand-written digits with totally 70K images. In our experiment, we randomly sample 1K images as the queries and the remaining data are treated as the gallery set. To learn the codebook, we randomly pick 7K from the gallery set; 4) LabelMe dataset [37] contains 22,019 images, where each item is represented by a 512-dimensional GIST descriptor. Following [44]

, we randomly sample 2K images to form the query set and use the remaining data to form the gallery set; 5) We also synthesize a set of 128-dimensional vectors from independent Gaussian distributions. We choose 10K data to learn the codebook. 1M data is employed as gallery set, and 1K samples are used for query. All the compared methods are evaluated on the same dataset for each setting. To make it clear, Table 

I summarizes the statistics of the datasets used in our experiments.

Fig. 3: Setting experiments on SIFT dataset. LABEL:sub@subfig:sparselevel mAP vs. Square Distortion under different sparse levels . The square distortion decreases and mAP increases consistently when the sparse level increases. LABEL:sub@subfig:codebook Performance comparison on different codebook learning methods. Random denotes the random sampling method and L0-learning and L1-learning are the online dictionary learning algorithm with corresponded constraint [25]. Our method perform very similar in real data with various codebook learning methods, except for random sampling method that contains no information of the data. LABEL:sub@subfig:codebookdist The quantization distortion on different codebook methods. It is easy to see the relationship between accuracy and distortion.

We compare our proposed Sparse Product Quantization (SPQ) approach with the following state-of-the-art methods.

  • Product Quantization (PQ [15]) tries to build codebook on Cartesian Product space, which is treated as baseline. IVFPQ refers to the PQ with the inverted file structure. All the results of PQ in the experiment are reproduced from the original implementation 222

  • Optimized Product Quantization (OPQ [9]) aims at finding an optimal space decomposition of PQ, which introduces two different solutions. Due to its superior performance, we only compare with the non-parametric solution using parametric one as a warm start. Similarly, we adopt their own implementation 333 with default settings.

  • Cartesian K-means (CK-means [31]) is yet another method to find the optimal space decomposition for PQ. It is equivalent to OPQ while using the same initialization. The results of CK-means are produced from the publicly available implementation 444 with default setup.

  • Iterative Quantization (ITQ [12]) is an effective binary embedding technique that can also be viewed as a vector quantization method.

  • Order Preserving Hashing (OPH [44]) is a state-of-the-art hashing method that learns similarity-preserving hashing functions.

  • FLANN [28] is the most popular open-source ANN search toolbox based on the framework of searching tree. It is able to automatically select the best algorithm and parameters for a given dataset.

The above methods can be roughly categorized into three groups: (i) VQ-based methods, including PQ, OPQ, and CK-means; (ii) hashing-based methods, including ITQ and OPH; and finally (iii) FLANN that is a searching tree-based method. In the following, we make the comparisons for each group separately.


Dataset SIFT GIST Random MNIST LabelMe


128 960 128 784 512
100K 50K 10K 10K 10K
1M 1M 1M 60K 20,019
10K 1K 10K 1K 2K


TABLE I: Summary of our experimental testbeds

In our empirical study, distortion is employed to measure the reconstruction performance for vector quantization. To evaluate the efficacy of ANN search methods, we employ the conventional performance metrics for multimedia information retrieval, including precision, recall and mAP. Precision means the average proportion of true NNs ranked first in the returned candidates, and recall denotes the proportion of true NNs of all queries is ranked. Moreover, mAP is the mean of Average precision over all the queries, which indicates the overall performance. All of our experiments were carried out on a PC with Intel Core i7-3770 3.4GHz processor and 16GB RAM using single thread.

V-B Settings

We discuss the experimental settings for the proposed SPQ approach in the following.

Sparse Level denotes the number of words to encode a feature vector by our method, which is critical to our method. Fig. (a)a shows mAP with respect to the quantization square distortions under different sparse levels on the SIFT dataset. Clearly, the square distortion decreases consistently when the sparse level increases, and at the same time mAP increases. Moreover, we found that the distortion drops significantly from level one to level two. Since the computational time and memory consumption grow with the sparse level, we set to 2 in the following experiments as a tradeoff between efficiency and accuracy. In section V-C, we will see that this sparse level is good enough to outperform the state-of-the-art methods.

Fig. 4: Performance comparison on different datasets. LABEL:sub@fig:random - LABEL:sub@fig:labelme are the results on dataset Random, SIFT, GIST, and MNIST, respectively. The first row present the performance in terms of Distortion vs. Code Length. The second and third row are performance comparison in terms of Recall vs. R and mAP vs. Code Length ,respectively, when finding the top 50 nearest neighbors. Note that the poor performance of OPQ on MNIST is because we fixed its initilization.

Codebook Training. In what will follow, we study four different kinds of codebook generating methods. They are random sampling, K-means, sparse dictionary learning with -norm, and sparse dictionary learning with -norm. Random sampling method simply generates the codebook via a Gaussian distribution. Both Random method and K-means are learning-free methods. The sparse dictionary learning method is based on an constraint or constraint, both of which can be solved by the online algorithm [25]. We test these codebook methods on SIFT dataset and the results are shown in Fig. (b)b and Fig. (c)c.

From the result, we observe that the performance of random sampling method is much worse than that of other methods, since its codebook contains no information of the gallery set. Surprisingly, K-means clustering and sparse dictionary learning methods perform very similar, which again implies the robustness of our method for different codebook learning algorithms. In the followed experiments, unless explicitly stated, the codebooks are genereated by the sparse method with constraint.

V-C Comparisons with other methods

Comparison with Vector Quantization Methods. Note that our proposed SPQ approach is based on the framework of VQ. To facilitate the comprehensive evaluation, we compare our method with three state-of-the-art VQ-based methods, including PQ, OPQ and CK-means.

We firstly examine the quantization distortion for different methods with various code lengths. As shown in Fig. 4, it can be observed that our proposed SPQ method consistently achieves very low squared distortion on all the datasets compared to the other methods. Then, we evaluate the performance in terms of recall vs. R when searching for different numbers of NNs, and also measure the recall with respect to the total number of returned candidates, i.e., recall vs. R in the result. Moreover, we also measure the recall@100, which denotes the proportion of true NNs in the top 100 returned NN results with various code lengths. Based on whether or not quantizing the queries, there are two different kinds of distance computation methods: ADC and SDC. Fig. 5 shows the experimental results. From the results, it is clear to see that our approach generally outperforms the other competing quantization methods.

As presented in [15], PQ slightly improves the search accuracy by combining an inverted file structure and encoding the residual. We also utilize the inverted file structure (IVFSPQ) and compares with it (IVFPQ) in Fig. 5. We can see that IVFPQ indeed perform slightly better than PQ in both the SIFT and GIST dataset, while our method still outperform both of them. For the efficiency, our proposed SPQ approach is expected to be slightly more computational expensive than PQ. However, as shown in Table III, the empirical time costs for PQ and SPQ when using an inverted file structure (IVFPQ vs. IVFSPQ) are fairly comparable.

Fig. 5: Performance comparison of ADC and SDC on different datasets. LABEL:sub@fig:random - LABEL:sub@fig:labelme are the results on dataset Random, SIFT, GIST, and MNIST, respectively. Performance comparison are in terms of Recall vs. R and mAP vs. Code Length when finding the nearest neighbor. The top two rows are under ADC and the bottom two are under SDC. In the first row, we also compare between PQ and SPQ that employ the inverted file structure, i.e. IVFPQ and IVFSPQ. It can be observed that our method performs better, if not competitive, than other methods.

Comparison with Hashing-based Methods.

We compare our SPQ approach with several state-of-the-art hashing-based ANN search methods, including minimal loss hashing (MLH) [30], iterative quantization hashing (ITQ) [12], order preserving hashing (OPH) [44], locality sensitive hashing (LSH) [8], kernelized supervised hashing (KSH) [23], isotropic hashing (IsoHash) [20], and spectral hashing (SH) [46]. To make a fair comparison, we follow the evaluation protocol in [44], where mAP is employed as performance metric with the ground truth being 50 nearest neighbors. Table II shows the performance evaluation on three datasets, including LabelMe, SIFT and GIST. It can be clearly seen that our proposed SPQ approach significantly outperforms these hashing-based methods at a large margin. To make it clear, we compare searching time with the spectral hashing algorithm (SH [46]), and the results are summarized in Table III. For PQ [15], we have re-implemented the Hamming distance computation in C in order to ensure that all the approaches in our comparisons are optimized appropriately . It can be seen that our proposed method outperforms SH in terms of both efficiency and accuracy.


Dataset Code Approaches
Length SPQ MLH [30] ITQ [12] OPH [44] LSH [8] KSH [23] IsoHash [20] SH [46]


LabelMe 32 47.97 19.91 20.36 21.11 8.87 16.72 18.51 9.28
64 62.14 32.48 32.09 33.94 17.57 24.57 28.35 11.18
128 77.52 45.22 44.66 44.36 32.52 31.45 42.34 13.73
SIFT 32 32.08 3.07 2.69 5.07 1.49 1.26 2.31 4.23
64 69.47 8.11 8.16 13.58 5.68 2.94 7.24 9.81
128 86.39 18.01 17.87 26.00 14.05 5.04 16.53 15.56
GIST 32 4.07 1.74 1.68 2.00 0.56 1.15 1.39 0.68
64 6.40 3.51 3.27 4.12 1.50 2.21 3.25 1.08
128 11.14 5.96 5.14 6.97 3.12 3.83 5.21 1.45


TABLE II: Comparison with Hashing Methods (mAP). Our approach obtains a significant gain over all of these methods on all the datasets.

Comparison with Searching Tree-based Method.

It is interesting to compare our proposed SPQ approach with FLANN, which is known as the most popular open-source toolbox for ANN search. We select the SIFT dataset as the testbed, and evaluate the precision with given searching time. As in PQ [15], we take advantage of an inverted file structure to speed up the SPQ method with the cost of slight performance drop. FLANN includes a re-ranking scheme that computes the exact distances for the candidate nearest neighbors. For the sake of comparison with FLANN, we also add a re-ranking stage to our SPQ method. In practice, while obtaining a precision of , SPQ costs seconds or seconds employing SSE in the search stage. FLANN, however, takes seconds to obtains a precision of . Fig. 6 shows the experimental results. The precision of our method is obtained by re-ranking 50 returned candidates and the extra time cost is less than 0.1 second. It is not difficult to see that our method is faster than FLANN if the precision of re-ranking results is required to be higher than . This is critical since we always pursue higher precision given a fixed period of time. More importantly, our method consumes much less memory cost than the searching tree-based FLANN method: the indexing structure occupies less than 100MB, while FLANN requires more than 250 MB of RAM. We should notice that our result could be further improved with a better inverted method, such as multi-index [2].

Fig. 6: Comparison with FLANN [28]. The precision of our method is obtained by re-ranking 50 returned candidates and the extra time cost is less than 0.1 second. It can be clearly seen that our method is faster than FLANN if the precision of re-ranking results is above .


Approaches Search time (ms/per) Recall@1 (%) Recall@100 (%)


MATLAB/mex PQ [15] 8.8 23.0 92.3
SPQ 21.9 51.9 99.8
IVFPQ [15] 1.3 26.6 92.1
IVFSPQ 1.4 51.2 95.7
SH [46] 2.2 9.5 53.0
C/C++ IVFSPQ 0.4 43.5 94.7
FLANN [28] 0.6 - 84.2


TABLE III: Computational cost and accuracy performance on the SIFT dataset using 64 bits. For better illustration, we also include comparisons of MATLAB implementation with the optimized C++/Mex code in the search stage. For the efficiency, our proposed SPQ approach is expected to be slightly more computational expensive than PQ. However, the empirical time costs for PQ and SPQ when using an inverted file structure (IVFPQ vs. IVFSPQ) are fairly comparable. For other ANN methods, our method outperforms both SH and FLANN on the search time and precision. The source codes of all the methods are publicly available.


accuracy SPQ FLANN visualindex
77.5 77.1 75.0


TABLE IV: The retrieval result in the Oxford dataset

Comparison on Image Retrieval

Image Retrieval [16] is also a popular topic in multimedia application. It aims at retrieving the items containing the target object from a large image corpus. A typical image retrieval system is based on the technique of bag-of-visual-words (BOW) which mathches local features such as SIFT [24]. And ANN search approach is heavily employed by the BOW encoding strategy. In this paper we compare our method with the popular fast ANN approach [28] for image retrieval.

We evaluate on the Oxford 5K dataset. SIFT features are extracted with gravity vector constraints [33] and RootSIFT [1] that use the square root of each component of a SIFT vector is also employed. We build the codebook of 1M visual words for BOW encoding. In our method, we assign 8 bits to each subspace (k=256) and the subspace number is 8. For the fast ANN approach, we followed the setup in Fast Object Retrieval [49] and visualindex 555 In the experiment, we use mean Average Precision (mAP) as the performance metric. The result is shown in Table IV. We can see that our method outperforms the other two approaches.

Fig. 7: Qualitative results between PQ and our proposed methods. LABEL:sub@fig:qualitativepq are the results of PQ and LABEL:sub@fig:qualitativespq are the results of SPQ. For each row, the first image is a query image and the remaining are the ranking results by each method. It is obvious that our method outperforms PQ.

Vi Conclusion and Future Work

In this paper, we propose a novel Sparse Product Quantization approach to encoding high-dimensional feature vectors into sparse representation. Euclidean distance between two vectors can be efficiently estimated from their sparse product quantization using fast table lookups. We optimize the sparse representation of the data vectors by minimizing their quantization errors, making the resulting representation is essentially close to the original data in practice. We have conducted extensive experiments by evaluating the proposed Sparse Product Quantization technique for ANN search on four public image datasets, whose promising experimental results show that our method is fast and accurate, and significantly outperforms several state-of-the-art approaches with large margin. Furthermore, the result on the image retrieval also demonstrates the efficacy of our proposed method.

Despite these promising results, some limitations and future work should be addressed. As many other soft assignment methods, the performance gain of our approach involves with extra storage requirements and computational cost inevitably. For future work, we will study how to compress the coding coefficients. Also, we will extend our technique to other tasks, such as object retrieval.


The work was supported in part by National Natural Science Foundation of China under Grants (61103105 and 91120302).


  • [1] R. Arandjelović and A. Zisserman. Three things everyone should know to improve object retrieval. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2911–2918. IEEE, 2012.
  • [2] A. Babenko and V. Lempitsky. The inverted multi-index. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3069–3076, June 2012.
  • [3] J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 1975.
  • [4] J. Brandt. Transform coding for fast approximate nearest neighbor search in high dimensions. In CVPR, 2010.
  • [5] J. Cai, Q. Liu, F. Chen, D. Joshi, and Q. Tian. Scalable image search with multiple index tables. In Proceedings of International Conference on Multimedia Retrieval, page 407. ACM, 2014.
  • [6] D. Chen, G. Baatz, K. Koser, S. Tsai, R. Vedantham, T. Pylvanainen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, and R. Grzeszczuk. City-scale landmark identification on mobile devices. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 737–744, June 2011.
  • [7] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative feature model for object retrieval. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
  • [8] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, 2004.
  • [9] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. In CVPR, 2013.
  • [10] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization. IEEE Trans. Pattern Anal. Mach. Intell., 2014.
  • [11] T. Ge, K. He, and J. Sun. Product sparse coding. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 2014.
  • [12] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In CVPR, 2011.
  • [13] R. M. Gray. Vector quantization. ASSP Magazine, IEEE, 1984.
  • [14] H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In ECCV, 2008.
  • [15] H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 2011.
  • [16] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
  • [17] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Contemporary mathematics. 1984.
  • [18] A. Joly and O. Buisson. Random maximum margin hashing. In CVPR, 2011.
  • [19] M. Kafai, K. Eshghi, and B. Bhanu. Discrete cosine transform locality-sensitive hashes for face retrieval. Multimedia, IEEE Transactions on, 16(4):1090–1103, 2014.
  • [20] W. Kong and W.-J. Li. Isotropic hashing. In Advances in Neural Information Processing Systems, pages 1646–1654, 2012.
  • [21] V. Lepetit, P. Lagger, and P. Fua. Randomized trees for real-time keypoint recognition. In CVPR, 2005.
  • [22] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu. Spectral hashing with semantically consistent graph for image indexing. Multimedia, IEEE Transactions on, 15(1):141–152, 2013.
  • [23] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2074–2081. IEEE, 2012.
  • [24] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  • [25] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In ICML, 2009.
  • [26] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 2010.
  • [27] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In VISAPP, 2009.
  • [28] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36, 2014.
  • [29] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, 2006.
  • [30] M. Norouzi and D. M. Blei. Minimal loss hashing for compact binary codes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 353–360, 2011.
  • [31] M. Norouzi and D. J. Fleet. Cartesian k-means. In CVPR, 2013.
  • [32] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001.
  • [33] M. Perd’och, O. Chum, and J. Matas. Efficient representation of local geometry for large @scale object retrieval. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 9–16. IEEE, 2009.
  • [34] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.
  • [35] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.
  • [36] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: an efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2564–2571. IEEE, 2011.
  • [37] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and web-based tool for image annotation. IJCV, 2008.
  • [38] G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. The MIT Press, 2006.
  • [39] C. Silpa-Anan and R. Hartley. Optimised kd-trees for fast image descriptor matching. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
  • [40] C. Silpa-Anan and R. Hartley. Optimised kd-trees for fast image descriptor matching. In CVPR, 2008.
  • [41] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, 2003.
  • [42] E. Spyromitros-Xioufis, S. Papadopoulos, I. Y. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. A comprehensive study over vlad and product quantization in large-scale image retrieval. Multimedia, IEEE Transactions on, 16(6):1713–1728, 2014.
  • [43] A. Torralba, R. Fergus, and W. T. Freeman.

    80 million tiny images: A large data set for nonparametric object and scene recognition.

    IEEE Trans. Pattern Anal. Mach. Intell., 2008.
  • [44] J. Wang, J. Wang, N. Yu, and S. Li. Order preserving hashing for approximate nearest neighbor search. In ACM Multimedia, 2013.
  • [45] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, 1998.
  • [46] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances in neural information processing systems, pages 1753–1760, 2009.
  • [47] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang. Sparse multi-modal hashing. Multimedia, IEEE Transactions on, 16(2):427–439, 2014.
  • [48] S. Zhang, Q. Tian, Q. Huang, W. Gao, and Y. Rui. Usb: ultrashort binary descriptor for fast visual matching and retrieval. Image Processing, IEEE Transactions on, 23(8):3671–3683, 2014.
  • [49] Z. Zhong, J. Zhu, and S. Hoi. Fast object retrieval using direct spatial matching. Multimedia, IEEE Transactions on, 17(8):1391–1397, Aug 2015.