I Introduction
Image retrieval is an important technique for many multimedia applications, such as face retrieval [19], object retrieval [7], and landmark identification [6]. For largescale image retrieval tasks, one of the key components is an effective indexing method for similarity search [47, 5], particularly on highdimensional feature space [24, 48, 36, 42]. Similarity search, a.k.a.
, nearest neighbor (NN) search, is a fundamental problem. Due to the curse of dimensionality, exact NN search for highdimensional data is extremely challenging and expensive. To overcome the issue, extensive research efforts have been devoted to approximate nearest neighbor (ANN) search methods, such as hashing
[46, 22, 8], treebased methods [39, 27, 28], and vector quantization [13, 15], which attempt to find the nearest neighbor with high probability using much less searching time and memory cost.
In this paper, we focus on developing a vector quantization (VQ) method for similarity search, which is a typical approach to effectively encoding the data for ANN search. A codebook is learnt and every feature vector in the database can be represented by one of the most similar vectors in the codebook, typically named as “codeword”. Then VQ directly employs the index of the codeword to represent the original data vector, which typically has only a few bits. In addition, the similarity between query and the data vector in database can be approximated by calculating the distance between query and codebook vector. This greatly reduces the computational cost and searching time.
In general, VQ requires more bits in order to reduce quantization distortion. Since the size of codebook increases exponentially with respect to the total number of encoded bits, VQbased method is ineffective for the data with high dimensionality. To tackle this issue, Product Quantization (PQ) [15] has recently been shown a promising paradigm for efficiently indexing the highdimensional image features. Different from other Hashingbased methods, it decomposes the highdimensional space into a Cartesian product of low dimensional subspaces and quantize each of them separately. Since the dimensionality of each subspace is relatively small, using a smallsized codebook is sufficient to obtain the satisfied searching performance.
Although computational cost can be effectively reduced by diving the long vector into small segments, PQ may fail to retrieve the exact nearest neighbor of a query with high probability due to the high quantization distortion. As discussed in [10], this will eventually yield lower search accuracy compared to VQ. To deal with this problem, several remedies have recently been proposed. Gong and Lazebnik [12]
presented an iterative quantization approach which maps data onto binary codes for fast retrieval. Cartesian Kmeans
[31] and Optimized Product Quantization [9] share the same idea of rotating the original data to minimize the quantization error. These methods including PQ essentially follow the same framework of vector quantization, which all suffer from the inevitable nontrivial quantization distortion.To address the above limitations, in this paper, we propose a novel approach called Sparse Product Quantization (SPQ) to encoding the highdimensional vector of image features, where the sparse coding technique is introduced into approximate nearest neighbor search. Motivated by soft assignment [35], we intend to find the sparse representation for each segment of feature vector rather than hard assignment used in PQ. Specifically, a feature vector is decomposed into Cartesian product of the low dimensional subspaces, where the short vector in each subspace is approximated by the linear combination of several vectors from the codebook. Fig. 1
illustrates the overview of SPQ for ANN search. We formulate the encoding stage as a sparse optimization problem and solve it by employing a popular greedy algorithm. The Euclidean distance between two vectors can be efficiently estimated from their sparse product quantization through simple table lookups. Moreover, the proposed method is able to take advantage of the very efficient SSE implementation using SIMD instructions, which can greatly reduce the computational overhead. Thus, the computational time of our presented method is comparable to the PQ method’s while the precision of SPQ outperforms that of PQ at a very large margin. In contrast to the computationally intensive clustering algorithm used in all the VQbased paradigms, we employ the sparse structure along with the fast stochastic online algorithm
[25, 26] to efficiently generate the codebook, which optimizes the sparse representation of data vectors according to their quantization errors. Consequently, the proposed representation is essentially close to the original data in practice even with a few basics. The empirical evaluation demonstrates that the presented method yields stateoftheart ANN search results and outperforms the popular approaches on the application of image retrieval.The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 introduces basics of VQ and propose our sparse vector quantization. Section 4 presents the proposed sparse product quantization for ANN search. Section 5 discusses our experimental results in detail and finally Section 6 concludes this work.
Ii Related Work
Fast NN search is a fundamental research topic which is extensively studied in literature such as multimedia application, image classification, and machine learning. Our work is related to approximate NN search methods, which can be roughly grouped into three categories: Hashingbased methods
[46, 4, 44], KDtree [3], and Vector Quantization work [15, 9].Hashingbased ANN search approach has received lots of attention. Most of them employ either random projection or the learningbased methods to generate compact binary codes. As a consequence, the similarity between two data vectors is approximately represented by the Hamming distance of their hashed codes. Random projection is an effective approach which preserves pairwise distances for data points. The most representative example is Locality Sensitive Hashing (LSH) [8, 38]. According to the Jonson Lindenstrauss Theorem [17], LSH needs random projections to preserve the pairwise distances, where is the relative error. Hence, LSH needs to employ the code with long bit length in order to boost the projection performance, which leads to both high computational cost and huge storage requirement. On the other hand, learningbased hashing methods [46, 4, 44]
try to learn the structure of input data. Most of these algorithms generate the binary codes by employing the spectral properties of the data affinity matrix, i.e., itemitem similarity. Some other hashing methods also employ multimodal data
[47] or semantic information [22]. Despite achieving promising gain with relatively short codes, these methods often fail to make significant improvement as code length increases [18].The second group of research aims at speeding up the ANN search with KDtree [3]. The expected complexity of KDtree search is , while the bruteforce search is . Unfortunately, for high dimension data KDtree are not much more efficient than the bruteforce exhaustive search [45] due to the curse of dimensionality. Nevertheless, both randomized KDtrees [40, 21] and hierarchical Kmeans [29] improve the performance of KDtree. In particular, these two methods are included in FLANN [27, 28], which automatically selects the best algorithm and optimal parameters depending on the dataset. FLANN is much faster than other publicly available ANN search software. However, KDtree approaches need fully access to the data and thus cost much more memory in searching stage.
The third group of related work is about Vector Quantization based approaches, which try to approximate data vectors with codewords in the codebook. Jégou et al. [15] proposed an efficient product quantization (PQ) recently. The key of PQ is to decompose the feature space into a Cartesian product of low dimensional subspaces and quantize each one separately using their corresponding predefined codebook. Then, the distance between the query and a vector in gallery set can be computed by either symmetric distance computation (SDC) or asymmetric distance computation (ADC). Also, the inverted file system is employed to conduct nonexhaustive search efficiently. Empirically, PQ has been shown to significantly outperform various hashingbased methods in terms of accuracy. As discussed in [15], the prior knowledge on the underlying structures of input data is essential to VQ. Most recently, Ge et. al [9] consider PQ as an optimization problem that minimizes the quantization distortions by searching for the optimal codebooks and space decomposition. Due to the inherent nature of VQ [13], it is hard for these methods to evaluate the impact of quantization error on the ANN search performance. We should mention that a work called Product Sparse Coding [11] was published recently. However, it substantially differs from our work as it brings a strategy for sparse coding, though we both have relationship with product method and sparse coding.
Finally, our work is closely related to softassignment [35], which has been introduced into the context of object retrieval [34] in order to reduce the quantization error. The key idea of softassignment is to map the original highdimensional descriptor to a weighted combination of multiple visual words rather than hardassigned onto a single word as in previous work [41, 34]. Still, this representation is just incorporated into a standard tfidf architecture. Despite requiring extra storage and computational cost, softassignment always results in lower quantization distortion and thus yields a significant improvement of retrieval performance in practice.
Iii Sparse Vector Quantization
In this section, we first briefly review basics of Vector Quantization, and then introduce the proposed Sparse Vector Quantization (SVQ), followed by discussing the codebook training method for SVQ.
Iiia Vector Quantization
Vector Quantization (VQ) [13] is a classical technique for data compression. It divides a dataset into some groups, where each vector is represented by the centroid of its corresponding group. More formally, given a vector , VQ maps to the nearest codeword of a pretrained codebook as follows:
(1) 
where is a distance metric. In particular, the distance used in this paper is Euclidean distance: . The encoding map is called quantizer which is the most important component of VQ. Therefore, the quantization distortion or reconstruction error of is defined as:
(2) 
Given the codebook , the quantization of is computed by solving the minimization problem in Eqn. (1). Typically, it can be simply represented by the Euclidean distance between the query and its corresponding codeword in .
In general, there are two kinds of ANN search methods according to different forms of queries. One is called Symmetric Distance Computation (SDC), in which both query and database vectors are quantized into codes. The other is called Asymmetric Distance Computation (ADC), where only the database vectors are quantized.
IiiB Sparse Vector Quantization
One key limitation of VQ is that it assigns the original vector to the single nearest codeword in the codebook. This hard assignment strategy can lead to relatively large quantization distortion which limits the performance of VQ.
Motivated by the success of soft assignment [35], instead of using the hard assignment as in VQ, we employ the sparse representation of multiple codewords to represent the original feature vector.
Fig. 2 shows a 2Dtoy example to illustrate the key idea of our proposed method. Let and denote a query and the vector in gallery set respectively, and and represent the quantization vector for and , respectively. VQ simply sets to point by hard assignment, and similarly sets to . Thus, the quantization distortion for is . In this work, we employ the linear combination of two words and to represent . Therefore, is the projection of on the line spanned by and . It is clear that the quantization distortion by VQ is always larger than that of the sparse quantization, since .
As lies on the line (), we assume . Note that the coefficients and can be easily computed by solving the linear equation. As illustrated in Fig. 2, we can compute the distance as follows:
(3) 
where denotes the dot product. The above equation calculates the ADC distance. Also, we can calculate SDC distance using the similar approximation method.
Before introducing SVQ, we first give an equivalent formulation of VQ. We stack the codebook into a matrix , in which each of its columns is a word. Let denote the size of codebook , we can rewrite Eqn. (1) as the following optimization problem:
(4) 
is a dimensional column vector, in which the value of each element is either zero or one. Obviously, the above optimization in Eqn. (4) is equivalent to hard assignment by imposing very strict constraints on variable to choose the nearest word from matrix given the input vector .
As in the above discussion, it can be observed that searching accuracy for ANN is directly related to the bound of Eqn. (4) rather than its solution. To this end, we relax the constraints in Eqn. (4) so as to obtain a lower bound. This will implicitly yield better ANN searching performance. Specifically, we relax the constraint in Eqn. (4) as follows:
(5) 
where , named as sparse level, denotes the number of codewords selected to encode . It can be seen that such relaxation not only increases norm of sparse representation but also expands the space of . Obviously, Eqn. (4) can be viewed as a special case of Eqn. (5) when . Therefore, we can obtain a lower bound for quantization distortion. Intuitively, the above formulation employs the linear combination of words in codebook rather than using only single word as VQ to approximate the original input vector. Our empirical study shows that using just two words is sufficient to yield significant gain over the hard assignment.
Eqn. (5) is wellknown as an NPHard problem. To tackle this issue, we take advantage of an effective greedy algorithm called Orthogonal Matching Pursuit (OMP) [25, 26]. OMP updated all the extracted coefficients by computing the orthogonal projection of the vector residual onto the set of codewords selected so far. As is usually set to two, there are at most two nonzero elements in the coefficient vector . As the sparse property of the representation is essential to fast NN search, we thus name our method as Sparse Vector Quantization (SVQ).
IiiC Codebook Training
Remember that we assume the codebook of each approach has been given in previous analysis. In this part, we will show how to obtain the codebook.
The first common and straightforward method is to find the codebook by directly minimizing the quantization error on the training set . In the case of VQ, the codebook is obtained by solving Eqn. (1) or Eqn. (1) on , and this is equivalent to running an iterative kmeans clustering algorithm where the centroids of the resulting clusters are treated as the codebook.
For SVQ, minimizing the quantization error is equal to the following problem:
(6) 
It is NPhard. We can alternate between and to solve this problem. When is fixed, we have shown how to solve it in previous section. Notice that here what we care is the codebook . Then we can further relax the contraints by using an norm constraint which can also yield sparse solutions. In section VB, we will see that both methods is applicable to our method. When is fixed, it becomes an uncontrained least square problem. In our implementation we employ the stochastic/online optimization algorithm [25, 26] to solve the above optimization problem for learning the codebook, where the learned codebook can be excellently fitted for the sparse coding tasks. Since the algorithm is based on stochastic optimization, it is even faster than conventional kmeans clustering method.
In general, VQbased methods heavily rely on a good codebook, which is important to reduce the quantization distortion. Due to its intrinsic limitedness, kmeans is often difficult to generate a good one. In the experiment, we will show that, In contrast to other VQbased methods such as PQ and OPQ, our proposed SPQ method is not limited to any specific codebook learning method.
Iv Sparse Product Quantization
To facilitate the practical ANN search, we propose an efficient Sparse Product Quantization approach by extending the product quantization with the proposed SVQ technique in order to further reduce the computational overhead.
Iva Product Quantization
Following the idea of Product Quantization (PQ) [15], we decompose the highdimensional space into a Cartesian product of low dimensional subspaces and then perform sparse vector quantization in each subspace separately. Specifically, a vector is viewed as the concatenation of subvectors: and the codebook is defined as: .
For PQ, each subvector is mapped onto a subcodeword from its corresponding codebook:
(7) 
where is a quantizer for the th subvector of . Practically, is equally partitioned so that all subvectors and is a multiple of . Note that each subvector is encoded according to the different codebook. In this case, any word of in codebook will be the concatenation of subcodewords: , with each .
Let denote the PQ of . Then, the quantization distortion of by PQ is defined as follows:
(8) 
Usually, we need to quantize a set of vectors rather than single one. Hence, the quantization distortion of is defined as .
As in Eqn. (7), it can be easily observed that PQ divides Eqn. (1) into subVQ problems and therefore addresses it separately. Therefore, PQ method enjoys the merit of providing the compact coding scheme for highdimensional data while yielding accurate result for fast approximate nearest neighbor search. However, the unavoidable quantization error limits its performance of searching accuracy due to the inherent nature of vector quantization [13].
Intuitively, better reconstruction that means having lower quantization distortion indicates better search accuracy . In next section, we will introduce an approach which can effectively reduce the quantization distortion.
IvB Sparse Product Quantization
In the proposed sparse vector quantization, we represent each item in the database as follows:
(9) 
where is a sparse vector with a few nonzero elements.
Motivated by product quantization, in this paper, we employ the proposed SPQ scheme with slight modification by replacing with its subvector . Therefore, we can approximate through the following equation:
(10) 
We can prove that the quantization distortion of SPQ is upper bounded by that of PQ. Remember that SVQ is a relaxation version of VQ. Thus, the bound for quantization distortion of SVQ is lower than that of VQ. In the case of PQ and SPQ, their distortions are the sum of distortions for each subvector. With respect to each subvector, the situation is equal to that of VQ and SVQ. Thus, we can conclude that the quantization distortion of SPQ is less than or equal to PQ’s.
IvC Approximate Nearest Neighbor Search
In the following, we discuss how to apply the proposed SPQ method to conduct ANN search towards largescale image retrieval tasks. The whole framework of our proposed SPQ approach of ADC version is summarized into Algorithm 1.
In particular, to facilitate ANN search, we encode all the data vectors in the gallery using the proposed SPQ method. Then, we compute the distance between a query and the data in the gallery using two kinds of distance measures: ADC and SDC.
According to the definition, ADC can be formulated as:
(11)  
To reduce the computational cost for ADC distance computation, we can either normalize or precompute . Since is an essentially sparse vector, it only requires several floating point operations to compute .
In the case of SDC computation, we employ sparse product quantization to approximate the query vector as: , Similarly, SDC is computed as:
For better illustration, Fig. 2 shows a 2D example of distance computation for both ADC and SDC.
IvD Complexity Analysis
In the following, we give the detailed analysis on the complexity of our proposed SPQ scheme.
Let denote the dimensionality of each feature vector, denote the total number of items in the whole database, denote the number of subvectors in , and denote the size of each codebook. For a given query, it takes floating point multiplications to search its approximate nearest neighbor in database . Specifically, it requires multiplications for the dot product of each database vector. This is required if the database is not normalized. Also, it takes operations to compute the distance between query vector and vocabulary matrix . We need multiplications to compute the dot product with the database vectors in sparse representation, which is the third term in Eqn. (11).
If all the database vectors have been normalized with unit norm offline, then . Therefore, the overall online time complexity to computing ADC distance can be reduced to . In the task of multimedia information retrieval, the dimensionality of each feature vector is far less than the total number of entries in database: . Thus, the computational complexity of Eqn. (11) can be approximated to . On the other hand, the complexity of bruteforce NN search is . Thus, we can obtain substantial speedup using the proposed SPQ scheme. Moreover, our method is able to take advantage of the efficient SSE instructions to further reduce the multiplication computational time. Specifically, the searching time of our proposed SPQ is comparable to the original PQ while the precision of SPQ outperforms that of PQ at large margin. Additionally, the empirical study shows that SPQ is even faster than FLANN [28] with the same recall rate.
Due to the inherent nature of softassignment, SPQ consumes more memory cost than hardassignment methods inevitably. However, it is worth the memory because SPQ brings significant gain on precision improvement. According to the previous studies, FLANN is one of the most popular ANN search techniques that utilize tree structure. However, it fails to work for very largescale datasets since it must load the whole dataset in memory when building the trees. By contrast, SPQ does not need to load the whole data in memory by employing the efficient inverted file structures, making it potentially more practical than FLANN for largescale multimedia retrieval.
V Experiment
In this section, we will first introduce our experimental testbed and the background of several stateoftheart ANN methods we will compare with. Then we discuss the settings of our proposed method and furnish our results comparing with these methods. Finally, we show the application of our method on image retrieval.
Va Experimental Testbed
To examine the empirical efficacy of the proposed method, we conduct an extensive set of experiments for comprehensive performance evaluations on five datasets, including a synthetic dataset with Gaussian noises and four publicly available image feature collections. Each dataset is partitioned into three parts: training set, gallery set and query set. The details of these testbeds are summarized as follows: 1) SIFT dataset consists of one million local SIFT features [24] with 128 dimensions, in which 100K samples are employed to learn the codebook. All the one million samples are treated as gallery set, and 10K samples are used for evaluation. Note that there is no overlap between the training set and the gallery set, since the former is extracted from Flickr images and the latter is from the INRIA Holidays images [14]; 2) GIST [32] is made of 960dimensional global features. There are 50K samples used to learn the codebook. Similarly, one million samples in database are viewed as gallery set, and 1K samples are used for query evaluation. They are extracted from the tiny image set [43], Holidays image set, and Holiday with Flickr1M set, respectively; 3) We perform empirical study on MNIST^{1}^{1}1http://yann.lecun.com/exdb/mnist/ as used in OPQ [10], which is a 784dimensional image set of handwritten digits with totally 70K images. In our experiment, we randomly sample 1K images as the queries and the remaining data are treated as the gallery set. To learn the codebook, we randomly pick 7K from the gallery set; 4) LabelMe dataset [37] contains 22,019 images, where each item is represented by a 512dimensional GIST descriptor. Following [44]
, we randomly sample 2K images to form the query set and use the remaining data to form the gallery set; 5) We also synthesize a set of 128dimensional vectors from independent Gaussian distributions. We choose 10K data to learn the codebook. 1M data is employed as gallery set, and 1K samples are used for query. All the compared methods are evaluated on the same dataset for each setting. To make it clear, Table
I summarizes the statistics of the datasets used in our experiments.We compare our proposed Sparse Product Quantization (SPQ) approach with the following stateoftheart methods.

Product Quantization (PQ [15]) tries to build codebook on Cartesian Product space, which is treated as baseline. IVFPQ refers to the PQ with the inverted file structure. All the results of PQ in the experiment are reproduced from the original implementation ^{2}^{2}2http://people.rennes.inria.fr/Herve.Jegou/projects/ann.html.

Optimized Product Quantization (OPQ [9]) aims at finding an optimal space decomposition of PQ, which introduces two different solutions. Due to its superior performance, we only compare with the nonparametric solution using parametric one as a warm start. Similarly, we adopt their own implementation ^{3}^{3}3http://research.microsoft.com/enus/um/people/kahe/cvpr13/index.html with default settings.

Cartesian Kmeans (CKmeans [31]) is yet another method to find the optimal space decomposition for PQ. It is equivalent to OPQ while using the same initialization. The results of CKmeans are produced from the publicly available implementation ^{4}^{4}4https://github.com/norouzi/ckmeans/tree/ with default setup.

Iterative Quantization (ITQ [12]) is an effective binary embedding technique that can also be viewed as a vector quantization method.

Order Preserving Hashing (OPH [44]) is a stateoftheart hashing method that learns similaritypreserving hashing functions.

FLANN [28] is the most popular opensource ANN search toolbox based on the framework of searching tree. It is able to automatically select the best algorithm and parameters for a given dataset.
The above methods can be roughly categorized into three groups: (i) VQbased methods, including PQ, OPQ, and CKmeans; (ii) hashingbased methods, including ITQ and OPH; and finally (iii) FLANN that is a searching treebased method. In the following, we make the comparisons for each group separately.


Dataset  SIFT  GIST  Random  MNIST  LabelMe 


128  960  128  784  512  
100K  50K  10K  10K  10K  
1M  1M  1M  60K  20,019  
10K  1K  10K  1K  2K  

In our empirical study, distortion is employed to measure the reconstruction performance for vector quantization. To evaluate the efficacy of ANN search methods, we employ the conventional performance metrics for multimedia information retrieval, including precision, recall and mAP. Precision means the average proportion of true NNs ranked first in the returned candidates, and recall denotes the proportion of true NNs of all queries is ranked. Moreover, mAP is the mean of Average precision over all the queries, which indicates the overall performance. All of our experiments were carried out on a PC with Intel Core i73770 3.4GHz processor and 16GB RAM using single thread.
VB Settings
We discuss the experimental settings for the proposed SPQ approach in the following.
Sparse Level denotes the number of words to encode a feature vector by our method, which is critical to our method. Fig. (a)a shows mAP with respect to the quantization square distortions under different sparse levels on the SIFT dataset. Clearly, the square distortion decreases consistently when the sparse level increases, and at the same time mAP increases. Moreover, we found that the distortion drops significantly from level one to level two. Since the computational time and memory consumption grow with the sparse level, we set to 2 in the following experiments as a tradeoff between efficiency and accuracy. In section VC, we will see that this sparse level is good enough to outperform the stateoftheart methods.
Codebook Training. In what will follow, we study four different kinds of codebook generating methods. They are random sampling, Kmeans, sparse dictionary learning with norm, and sparse dictionary learning with norm. Random sampling method simply generates the codebook via a Gaussian distribution. Both Random method and Kmeans are learningfree methods. The sparse dictionary learning method is based on an constraint or constraint, both of which can be solved by the online algorithm [25]. We test these codebook methods on SIFT dataset and the results are shown in Fig. (b)b and Fig. (c)c.
From the result, we observe that the performance of random sampling method is much worse than that of other methods, since its codebook contains no information of the gallery set. Surprisingly, Kmeans clustering and sparse dictionary learning methods perform very similar, which again implies the robustness of our method for different codebook learning algorithms. In the followed experiments, unless explicitly stated, the codebooks are genereated by the sparse method with constraint.
VC Comparisons with other methods
Comparison with Vector Quantization Methods. Note that our proposed SPQ approach is based on the framework of VQ. To facilitate the comprehensive evaluation, we compare our method with three stateoftheart VQbased methods, including PQ, OPQ and CKmeans.
We firstly examine the quantization distortion for different methods with various code lengths. As shown in Fig. 4, it can be observed that our proposed SPQ method consistently achieves very low squared distortion on all the datasets compared to the other methods. Then, we evaluate the performance in terms of recall vs. R when searching for different numbers of NNs, and also measure the recall with respect to the total number of returned candidates, i.e., recall vs. R in the result. Moreover, we also measure the recall@100, which denotes the proportion of true NNs in the top 100 returned NN results with various code lengths. Based on whether or not quantizing the queries, there are two different kinds of distance computation methods: ADC and SDC. Fig. 5 shows the experimental results. From the results, it is clear to see that our approach generally outperforms the other competing quantization methods.
As presented in [15], PQ slightly improves the search accuracy by combining an inverted file structure and encoding the residual. We also utilize the inverted file structure (IVFSPQ) and compares with it (IVFPQ) in Fig. 5. We can see that IVFPQ indeed perform slightly better than PQ in both the SIFT and GIST dataset, while our method still outperform both of them. For the efficiency, our proposed SPQ approach is expected to be slightly more computational expensive than PQ. However, as shown in Table III, the empirical time costs for PQ and SPQ when using an inverted file structure (IVFPQ vs. IVFSPQ) are fairly comparable.
Comparison with Hashingbased Methods.
We compare our SPQ approach with several stateoftheart hashingbased ANN search methods, including minimal loss hashing (MLH) [30], iterative quantization hashing (ITQ) [12], order preserving hashing (OPH) [44], locality sensitive hashing (LSH) [8], kernelized supervised hashing (KSH) [23], isotropic hashing (IsoHash) [20], and spectral hashing (SH) [46]. To make a fair comparison, we follow the evaluation protocol in [44], where mAP is employed as performance metric with the ground truth being 50 nearest neighbors. Table II shows the performance evaluation on three datasets, including LabelMe, SIFT and GIST. It can be clearly seen that our proposed SPQ approach significantly outperforms these hashingbased methods at a large margin. To make it clear, we compare searching time with the spectral hashing algorithm (SH [46]), and the results are summarized in Table III. For PQ [15], we have reimplemented the Hamming distance computation in C in order to ensure that all the approaches in our comparisons are optimized appropriately . It can be seen that our proposed method outperforms SH in terms of both efficiency and accuracy.


Dataset  Code  Approaches  
Length  SPQ  MLH [30]  ITQ [12]  OPH [44]  LSH [8]  KSH [23]  IsoHash [20]  SH [46]  


LabelMe  32  47.97  19.91  20.36  21.11  8.87  16.72  18.51  9.28 
64  62.14  32.48  32.09  33.94  17.57  24.57  28.35  11.18  
128  77.52  45.22  44.66  44.36  32.52  31.45  42.34  13.73  
SIFT  32  32.08  3.07  2.69  5.07  1.49  1.26  2.31  4.23 
64  69.47  8.11  8.16  13.58  5.68  2.94  7.24  9.81  
128  86.39  18.01  17.87  26.00  14.05  5.04  16.53  15.56  
GIST  32  4.07  1.74  1.68  2.00  0.56  1.15  1.39  0.68 
64  6.40  3.51  3.27  4.12  1.50  2.21  3.25  1.08  
128  11.14  5.96  5.14  6.97  3.12  3.83  5.21  1.45  

Comparison with Searching Treebased Method.
It is interesting to compare our proposed SPQ approach with FLANN, which is known as the most popular opensource toolbox for ANN search. We select the SIFT dataset as the testbed, and evaluate the precision with given searching time. As in PQ [15], we take advantage of an inverted file structure to speed up the SPQ method with the cost of slight performance drop. FLANN includes a reranking scheme that computes the exact distances for the candidate nearest neighbors. For the sake of comparison with FLANN, we also add a reranking stage to our SPQ method. In practice, while obtaining a precision of , SPQ costs seconds or seconds employing SSE in the search stage. FLANN, however, takes seconds to obtains a precision of . Fig. 6 shows the experimental results. The precision of our method is obtained by reranking 50 returned candidates and the extra time cost is less than 0.1 second. It is not difficult to see that our method is faster than FLANN if the precision of reranking results is required to be higher than . This is critical since we always pursue higher precision given a fixed period of time. More importantly, our method consumes much less memory cost than the searching treebased FLANN method: the indexing structure occupies less than 100MB, while FLANN requires more than 250 MB of RAM. We should notice that our result could be further improved with a better inverted method, such as multiindex [2].


Approaches  Search time (ms/per)  Recall@1 (%)  Recall@100 (%)  


MATLAB/mex  PQ [15]  8.8  23.0  92.3 
SPQ  21.9  51.9  99.8  
IVFPQ [15]  1.3  26.6  92.1  
IVFSPQ  1.4  51.2  95.7  
SH [46]  2.2  9.5  53.0  
C/C++  IVFSPQ  0.4  43.5  94.7 
FLANN [28]  0.6    84.2  



accuracy  SPQ  FLANN  visualindex 
77.5  77.1  75.0  

Comparison on Image Retrieval
Image Retrieval [16] is also a popular topic in multimedia application. It aims at retrieving the items containing the target object from a large image corpus. A typical image retrieval system is based on the technique of bagofvisualwords (BOW) which mathches local features such as SIFT [24]. And ANN search approach is heavily employed by the BOW encoding strategy. In this paper we compare our method with the popular fast ANN approach [28] for image retrieval.
We evaluate on the Oxford 5K dataset. SIFT features are extracted with gravity vector constraints [33] and RootSIFT [1] that use the square root of each component of a SIFT vector is also employed. We build the codebook of 1M visual words for BOW encoding. In our method, we assign 8 bits to each subspace (k=256) and the subspace number is 8. For the fast ANN approach, we followed the setup in Fast Object Retrieval [49] and visualindex ^{5}^{5}5https://github.com/vedaldi/visualindex. In the experiment, we use mean Average Precision (mAP) as the performance metric. The result is shown in Table IV. We can see that our method outperforms the other two approaches.
Vi Conclusion and Future Work
In this paper, we propose a novel Sparse Product Quantization approach to encoding highdimensional feature vectors into sparse representation. Euclidean distance between two vectors can be efficiently estimated from their sparse product quantization using fast table lookups. We optimize the sparse representation of the data vectors by minimizing their quantization errors, making the resulting representation is essentially close to the original data in practice. We have conducted extensive experiments by evaluating the proposed Sparse Product Quantization technique for ANN search on four public image datasets, whose promising experimental results show that our method is fast and accurate, and significantly outperforms several stateoftheart approaches with large margin. Furthermore, the result on the image retrieval also demonstrates the efficacy of our proposed method.
Despite these promising results, some limitations and future work should be addressed. As many other soft assignment methods, the performance gain of our approach involves with extra storage requirements and computational cost inevitably. For future work, we will study how to compress the coding coefficients. Also, we will extend our technique to other tasks, such as object retrieval.
Acknowledgment
The work was supported in part by National Natural Science Foundation of China under Grants (61103105 and 91120302).
References
 [1] R. Arandjelović and A. Zisserman. Three things everyone should know to improve object retrieval. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2911–2918. IEEE, 2012.
 [2] A. Babenko and V. Lempitsky. The inverted multiindex. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3069–3076, June 2012.
 [3] J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 1975.
 [4] J. Brandt. Transform coding for fast approximate nearest neighbor search in high dimensions. In CVPR, 2010.
 [5] J. Cai, Q. Liu, F. Chen, D. Joshi, and Q. Tian. Scalable image search with multiple index tables. In Proceedings of International Conference on Multimedia Retrieval, page 407. ACM, 2014.
 [6] D. Chen, G. Baatz, K. Koser, S. Tsai, R. Vedantham, T. Pylvanainen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, and R. Grzeszczuk. Cityscale landmark identification on mobile devices. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 737–744, June 2011.
 [7] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative feature model for object retrieval. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
 [8] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In SCG, 2004.
 [9] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. In CVPR, 2013.
 [10] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization. IEEE Trans. Pattern Anal. Mach. Intell., 2014.
 [11] T. Ge, K. He, and J. Sun. Product sparse coding. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 2014.
 [12] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In CVPR, 2011.
 [13] R. M. Gray. Vector quantization. ASSP Magazine, IEEE, 1984.
 [14] H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In ECCV, 2008.
 [15] H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 2011.
 [16] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
 [17] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Contemporary mathematics. 1984.
 [18] A. Joly and O. Buisson. Random maximum margin hashing. In CVPR, 2011.
 [19] M. Kafai, K. Eshghi, and B. Bhanu. Discrete cosine transform localitysensitive hashes for face retrieval. Multimedia, IEEE Transactions on, 16(4):1090–1103, 2014.
 [20] W. Kong and W.J. Li. Isotropic hashing. In Advances in Neural Information Processing Systems, pages 1646–1654, 2012.
 [21] V. Lepetit, P. Lagger, and P. Fua. Randomized trees for realtime keypoint recognition. In CVPR, 2005.
 [22] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu. Spectral hashing with semantically consistent graph for image indexing. Multimedia, IEEE Transactions on, 15(1):141–152, 2013.
 [23] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang. Supervised hashing with kernels. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2074–2081. IEEE, 2012.
 [24] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
 [25] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In ICML, 2009.
 [26] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 2010.
 [27] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In VISAPP, 2009.
 [28] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36, 2014.
 [29] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, 2006.
 [30] M. Norouzi and D. M. Blei. Minimal loss hashing for compact binary codes. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 353–360, 2011.
 [31] M. Norouzi and D. J. Fleet. Cartesian kmeans. In CVPR, 2013.
 [32] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001.
 [33] M. Perd’och, O. Chum, and J. Matas. Efficient representation of local geometry for large @scale object retrieval. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 9–16. IEEE, 2009.
 [34] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.
 [35] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.
 [36] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: an efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2564–2571. IEEE, 2011.
 [37] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and webbased tool for image annotation. IJCV, 2008.
 [38] G. Shakhnarovich, T. Darrell, and P. Indyk. NearestNeighbor Methods in Learning and Vision: Theory and Practice. The MIT Press, 2006.
 [39] C. SilpaAnan and R. Hartley. Optimised kdtrees for fast image descriptor matching. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
 [40] C. SilpaAnan and R. Hartley. Optimised kdtrees for fast image descriptor matching. In CVPR, 2008.
 [41] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, 2003.
 [42] E. SpyromitrosXioufis, S. Papadopoulos, I. Y. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. A comprehensive study over vlad and product quantization in largescale image retrieval. Multimedia, IEEE Transactions on, 16(6):1713–1728, 2014.

[43]
A. Torralba, R. Fergus, and W. T. Freeman.
80 million tiny images: A large data set for nonparametric object and scene recognition.
IEEE Trans. Pattern Anal. Mach. Intell., 2008.  [44] J. Wang, J. Wang, N. Yu, and S. Li. Order preserving hashing for approximate nearest neighbor search. In ACM Multimedia, 2013.
 [45] R. Weber, H.J. Schek, and S. Blott. A quantitative analysis and performance study for similaritysearch methods in highdimensional spaces. In VLDB, 1998.
 [46] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances in neural information processing systems, pages 1753–1760, 2009.
 [47] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang. Sparse multimodal hashing. Multimedia, IEEE Transactions on, 16(2):427–439, 2014.
 [48] S. Zhang, Q. Tian, Q. Huang, W. Gao, and Y. Rui. Usb: ultrashort binary descriptor for fast visual matching and retrieval. Image Processing, IEEE Transactions on, 23(8):3671–3683, 2014.
 [49] Z. Zhong, J. Zhu, and S. Hoi. Fast object retrieval using direct spatial matching. Multimedia, IEEE Transactions on, 17(8):1391–1397, Aug 2015.
Comments
There are no comments yet.