Very accessible code for my MSc thesis. Cheap quantization method for ANN search also known as Enhanced Residual VQ.
Recently, Babenko and Lempitsky introduced Additive Quantization (AQ), a generalization of Product Quantization (PQ) where a non-independent set of codebooks is used to compress vectors into small binary codes. Unfortunately, under this scheme encoding cannot be done independently in each codebook, and optimal encoding is an NP-hard problem. In this paper, we observe that PQ and AQ are both compositional quantizers that lie on the extremes of the codebook dependence-independence assumption, and explore an intermediate approach that exploits a hierarchical structure in the codebooks. This results in a method that achieves quantization error on par with or lower than AQ, while being several orders of magnitude faster. We perform a complexity analysis of PQ, AQ and our method, and evaluate our approach on standard benchmarks of SIFT and GIST descriptors, as well as on new datasets of features obtained from state-of-the-art convolutional neural networks.READ FULL TEXT VIEW PDF
We present a new approach to learn compressible representations in deep
Similarity search retrieves the nearest neighbors of a query vector from...
With the explosive growth of image databases, deep hashing, which learns...
We tackle the problem of unsupervised visual descriptors compression, wh...
Vectors of data are at the heart of machine learning and data mining.
We consider the problem of learned transform compression where we learn ...
Very accessible code for my MSc thesis. Cheap quantization method for ANN search also known as Enhanced Residual VQ.
Vector quantization has established itself as a default approach to scale applications such as visual recognition and image retrieval. Quantization is usually performed on large datasets of local descriptors (e.g., SIFT ), or global representations (e.g., VLAD  or Fisher vectors ). Recent work has also explored the performance-vs.-compression trade-off in state-of-the-art features obtained from deep convolutional neural networks 
Vector quantization is usually posed as the search for a set of codewords (i.e., a codebook
) that minimize quantization error. The problem can be solved in a straightforward manner with the k-means algorithm which, unfortunately, scales poorly for large codebooks. While larger codebooks achieve lower quantization error, the downside is that encoding and search times scale linearly with codebook size.
Several algorithms, such as kd-trees and hierarchical k-means, alleviate the search and encoding problems by indexing the codebook with complex data structures , achieving sublinear search time as a trade-off for recall. These approaches, however, have a large memory footprint, since all the uncompressed vectors must be kept in memory.
Another line of research considers approaches with an emphasis on low memory usage, compressing vectors into small binary codes. While for a long time hashing approaches were the dominant trend [10, 19], they were shown to be largely outperformed by Product Quantization (PQ) . PQ is a compositional vector compression algorithm that decomposes the data into orthogonal subspaces and quantizes each subspace independently. As a result, vectors can be encoded independently in each subspace, and distances between uncompressed queries and the database can be efficiently computed through a series of table lookups. This combination of small memory footprint, low quantization error and fast search makes PQ a very attractive approach for scaling computer vision applications.
Recently, Babenko and Lempitsky  introduced Additive Quantization (AQ), a generalization of PQ that retains its compositional nature, but is able to handle subcodebooks of the same dimensionality as the input vectors. With a few caveats, AQ can also be used for fast approximate nearest neighbour search and consistently achieves lower quantization error than PQ. However, since the codebooks are no longer pairwise orthogonal (i.e., no longer independent), encoding cannot be done independently in each subspace. In , beam search was proposed as a solution to this problem, but this results in very slow encoding, which greatly limits the scalability of the proposed solution.
In this paper, we first analyze PQ and AQ as compositional quantizers, under a framework that makes the simplifying assumptions of PQ w.r.t. AQ rather evident. We next investigate the computational complexity implications resulting from the differences between AQ and PQ, and finally derive an intermediate approach that retains the expressive power of AQ, while being only slightly slower than PQ.
Our approach compares favourably to AQ in 3 ways: (i) it consistently achieves similar or lower quantization error (and therefore, lower error than PQ), (ii) it is several orders of magnitude faster and (iii), it is also simpler to implement.
We introduce some notation mostly following . We review the vector quantization problem, the scalability approaches proposed by PQ and AQ, and discuss their advantages and disadvantages.
Given a set of vectors , the objective of vector quantization is to minimize the quantization error, i.e., to determine
where contains cluster centers, and is subject to the constraints and . That is, may only index into one entry of . is usually referred to as a codebook, and is called a code.
If we let contain all the , and similarly let contain all the codes, the problem can be expressed more succinctly as determining
Without further constraints, one may solve expression 2 using the k-means algorithm, which alternatively solves for (typically exhaustively computing the distance to the clusters in for each point in ) and (finding the mean of each cluster) until convergence. The performance of k-means is better as the size of the codebook, , grows larger but, unfortunately, the algorithm is infeasible for large codebook sizes (for example, clusters would far exceed the memory capacity of current machines). The challenge is thus to handle large codebooks that achieve low quantization error while having low memory overhead.
One way of scaling the codebook size looks at compositional models, where smaller subcodebooks can be combined in different ways to potentially represent an exponential number of clusters. Compositional quantization can be formulated similarly to k-means, but restricted to a series of constraints that introduce interesting computational trade-offs. The objective function of compositional quantization can be expressed as
that is, the vector can be approximated not only by a single codeword indexed by its code , but by the addition of its encodings in a series of codebooks. We refer to the as subcodebooks, and similarly call the subcodes. We let each subcodebook contain cluster centres: , and each subcode remains limited to having only one non-zero entry: , . Since each may take a value in the range, and there are subcodes, the resulting number of possible cluster combinations is equal to , i.e., superlinear in . Now we can more succinctly write expression 3 as
that is, is blockwise diagonal :
where the entries are the only non-zero components of . This constraint assumes that the data in was generated from a series of mutually independent subspaces (those spanned by the subcodebooks ), which is rarely the case in practice. There are, however, some advantages to this formulation.
The subcodebook independence of PQ offers 3 main advantages,
Under the orthogonality constraint we can efficiently learn the subcodebooks by independently running k-means on dimensions. The complexity of k-means is for datapoints, cluster centres, dimensions and iterations. PQ solves -dimensional k-means problems with cluster centres each, resulting in a complexity of ; i.e., training PQ is as complex as solving a k-means problem with cluster centres.
Once training is done, the encoding of the database can also be performed efficiently in (in line with k-means), which is essential for very large databases.
Distance computation between a query and a encoded vector is efficient because the subcodebooks are orthogonal, and therefore the total distance is equal to the sum of the distances in each subspace : , where , and . These distances can be precomputed for each query and quickly evaluated with table lookups. This is called Asymmetric Distance Computation in  and is the mechanism that makes PQ attractive for fast approximate nearest neighbour search.
One of the main disadvantages of PQ is that
is forced to fit in a model that assumes that the data was generated from statistically independent subspaces. Lower quantization error can be achieved if more degrees of freedom are added to the model. In particular, since rotation is a distance-preserving operation, it seems natural to experiment with codebook rotations that minimize quantization error. In OPQ, the objective function becomes
where and are expanded as in Eq. 4, and belongs to the Special Orthogonal Group . In this sense, PQ is a special case of OPQ where is the
-dimensional identity matrix:. Independently, Ge et al.  and Norouzi & Fleet  proposed an iterative method similar to Iterative Quantization  that optimizes in expression 7. Notice, however, that the orthogonality constraint is maintained from PQ to OPQ.
Lower quantization error can be achieved if the independence assumption is not enforced, at the cost of more complex encoding and distance computation. These trade-offs were first introduced in  and called Additive Quantization (AQ). We briefly review AQ here.
In AQ, the subspaces spanned by the subcodebooks are not mutually orthogonal (i.e., not mutually independent). Formally, and although not explicitly stated in , AQ solves the formulation of Eq. 3 without any further constraints. This makes of AQ a strictly more general model than PQ/OPQ. However, this complexity comes at a cost.
The subcodebook dependence of AQ comes with 3 main disadvantages with respect to PQ/OPQ,
The distance between a query and a encoded vector cannot be computed with table lookups. However, it can be found using the identity
where the first term is a constant and does not affect the query ranking; the second term can be precomputed and stored for fast evaluation with table lookups, and the third term can either be precomputed and quantized for each vector in the database (at an additional memory cost), or can be computed on the fly as
where the terms can also be precomputed and retrieved in table lookups. Thus, AQ has either a time ( vs. lookups) or memory overhead (for storing the quantized result of Eq. 9) during distance computation with respect to PQ. Although this may sound as a major problem for AQ, it was shown in  that sometimes the distortion error gain can be high enough that allocating memory from the code budget to store the result of Eq. 9 results in better recall and faster distance computation compared to PQ/OPQ. This motivates us to look for better solutions to the AQ formulation.
For a given set of subcodebooks and a vector , encoding amounts to choosing the optimal set of codes that minimize quantization error . Unfortunately, without the orthogonality constraint the choice of cannot be made independently in each subcodebook. This means that, in order to guarantee optimality, the search for the best encoding must be done over a combinatorial space of codewords. Moreover, it was shown in  that this problem is equivalent to inference on a fully connected pairwise Markov Random Field, which is well-known to be NP-hard .
Since brute force search is not possible, one must settle for a heuristic search method. Beam search was proposed as a solution in, resulting in rather slow encoding. Beam search is done in iterations. At iteration the distance is computed from each of the candidate solutions to the set of plausible candidates (in the codebooks that have not contributed to the candidate solution). At the end of the iteration we have candidate solutions, from which the top are kept as seeds for the next iteration . The complexity of this process is , where is the search depth. As we will show, this makes the original solution of AQ impractical for very large databases.
Training consists of learning the subcodebooks and subcodebook assignments that minimize expression 3. A typical approach is to use coordinate descent by fixing the subcodebooks while updating the codes (encoding), and later fixing while updating (codebook update). As a side effect of slow encoding, we find that training is also very slow in AQ. While this might seem as a minor weakness of AQ (since training is usually done off-line, without tight time constraints), having faster training also means that for a fixed time budget we can handle larger amounts of training data. In the quantization setting, this means that we can use a larger sample to better capture the underlying distribution of the database.
In , codebook update is done by solving the over-constrained least-squares problem that arises from Eq. 4 when holding fixed and solving for . Fortunately, this decomposes into independent subproblems of equations over variables . This corresponds to an optimal codebook update in the least squares sense. We find that compared to encoding this step is rather fast, and thus focus on speeding up encoding.
Within the subcodebook dependence-independence framework introduced in Section 2, we can see that PQ and OPQ assume subcodebook independence, while AQ embraces the dependence and tries to solve a more complex problem. As we will show next, there is a fertile middle ground between these approaches. We propose a hierarchical assumption, which has the advantage of being fast to solve while maintaining the expressive power of AQ. We now introduce our proposed approach to compositional quantization.
Due to the superior performance of AQ, we want to maintain its key property: subcodebook dependence. However, we look for a representation that can compete with PQ in terms of fast training and good scalability, for which fast encoding is essential. We propose to use a hierarchy of quantizers (see Figure 1, left), where the vector is sequentially compressed in a coarse-to-fine manner.
Fast encoding is at the heart of our approach. We assume that the subcodebooks have a hierarchical structure, where gives the coarsest quantization and the finest. Encoding is done greedily. In the first step, we choose the code that most minimizes the quantization error . Since all the subcodebooks are small, the search for can be done exhaustively (as in k-means).
Next, we compute the first residual . We now quantize using the codewords in , choosing the one that minimizes the quantization error . This process is repeated until we run out of codebooks to quantize residuals, with the last residual being equal to the total quantization error (see Figure 1, right). Now it is clear that we satisfy our first desired property, as the representation is additive in the encodings: , and the codewords all are -dimensional (i.e., not independent of each other).
The complexity of this step is for subcodebooks, each having subcodewords, and a vector of dimensionality . This corresponds to a slight increase in computation with respect to PQ (), but is much faster than AQ (). Given that encoding is only slightly more expensive than PQ, we can say that we have also achieved our second desired property.
The goal of initialization is to create a coarse-to-fine set of codebooks. This can be achieved by simply performing k-means on , obtaining residuals by subtracting the assigned codewords, and then performing k-means on the residuals until we run out of codebooks.
Formally, in the first step we obtain from the cluster centres computed by k-means on , and we obtain residuals by subtracting . In the second step we obtain from k-means on , and the residuals are refined to . This process continues until we run out of codebooks (notice how this both is analogous to, and naturally gives rise to, the fast encoding proposed before). By the end of this initialization, we have an initial set of codebooks that have a hierarchical structure, and with which encoding can be performed in a greedy manner.
The computational cost of this step is that of running k-means on vectors times, i.e., for subcodebooks of size , dimensionality and k-means iterations.
The initial set of codebooks can be further optimized with coordinate descent. This step is based on the observation that, during initialization, we assume that in order to learn codebook we only need to know codebooks . However, after initialization all the codebooks are fixed. This allows us to fine-tune each codebook given the value of the rest.
Although it is tempting to use the least-squares-optimal codebook update proposed in , we have found that this tends to destroy the hierarchical subcodebook structure resulting from initialization. Without a hierarchical structure encoding cannot be done fast, which is one of the key properties that we wish to maintain. We therefore propose an ad hoc codebook refinement technique that preserves the hierarchical structure in the codebooks.
Let us define as the approximation of from its encoding
Now, let us define as an approximation to the original dataset obtained using the learned codebooks and codes , except for , i.e.,
We can now see that the optimal value of given the rest of the codebooksis obtained by running k-means on , i.e., the residual after removing the contribution of the rest of the codebooks. Since we already know the cluster membership to (i.e., we know ) either from initialization or the previous iteration, we need to update only the cluster centres instead of restarting k-means (similar to how OPQ updates the codebooks given an updated rotation [9, 15]).
Enforcing codebook hierarchy is of the essence. Therefore, we run our codebook update in a top-down manner. We first update and update all codes. Next, we update and update codes again. We repeat the process until we have updated , followed by a final update of the codes. Updating the codes after each codebook update ensures that the codebook hierarchy is maintained. A round of updates from codebooks 1 to amounts to one iteration of our codebook refinement.
The algorithm involves encoding using codebooks in the first pass, in the second pass, in the third pass and so on until only one set of codes is updated. This means that the time complexity of the codebook refinement procedure is quadratic in the number of codebooks. This is a significant increase with respect to PQ/OPQ, which are linear in during their training, but also represents an important reduction against the cubic scaling of AQ. Also, notice that the training usually has to be done only once with a small data sample, and database encoding remains efficient.
Our main interest is to reduce quantization error because it has been demonstrated to lead to better retrieval recall, mean average precision and classification performance [3, 9, 11, 15]. We also demonstrate two applications of our method: (i) approximate nearest neighbour search and (ii) classification performance with compressed features. In all our experiments we use codebooks of size ; this means that 2, 4, 8 and 16 codebooks generate codes of 16, 32, 64 and 128 bits.
We test our method on three datasets. The first two are SIFT1M and GIST1M, introduced in . SIFT1M consists of 128-dimensional SIFT  descriptors, and GIST1M consists of 960-dimensional GIST 
descriptors. Since hand-crafted features are consistently being replaced by features obtained from deep convolutional neural networks, we also consider a dataset of deep features: ConvNet1M-128. We obtained ConvNet1M-128 by computing 128-dimensional deep learning features on the ILSVRC-2012 training dataset using the CNN-M-128 network provided by Chatfield et al.  and then subsampling equally at random from all classes. This network follows the architecture proposed by Zeiler and Fergus , with the exception that the last fully-connected layer was reduced from 4096 to 128 units. It has been shown that this intra-net compression has a minimal effect on classification performance , and exhibits state-of-the-art accuracy on image retrieval . However, to the best of our knowledge we are the first to benchmark quantization techniques on deep learning features. We obtained the features from a central image crop without further data augmentation. In the three datasets 100 000 vectors are given for training, 10 000 for query and 1 000 000 for database.
We compare against 3 baselines. The first one is AQ as proposed by Babenko and Lempitsky , which consists of beam search for encoding and a least-squares codebook update in an iterative manner. As in , we set the beam search depth to 16 during training and to 64 for the database encoding. Although  does not mention the number of iterations used during training, we found that 10 iterations reproduce the results reported by the authors and, as we will show, this is already several orders of magnitude slower than our approach. Since encoding scales cubically with the number of codebooks, for code lengths of 64 and 128 bits (8 and 16 codebooks respectively) we use the hybrid APQ algorithm suggested in , where the dataset is first preprocessed with OPQ, and then groups of 4 subcodebooks are refined independently with AQ. APQ was proposed for practical reasons, as otherwise AQ would require several days to complete given more than 4 subcodebooks: the need for this approximation starts to show the poor scalability of AQ. Since no code for AQ is available, we wrote our own implementation and incorporated the optimizations suggested in . We will make all our code available, including this baseline.
The second baseline is Optimized Product Quantization [9, 15], which was briefly introduced in Section 2. We use the publicly available implementation by Norouzi & Fleet111https://github.com/norouzi/ckmeans, and set the number of optimization iterations to 100. The third baseline is Product Quantization . We slightly modified the OPQ code to create this baseline. We also use 100 iterations in PQ.
Our main quantization results are shown on Figure 2. First, we observe that our method has a performance similar to AQ on SIFT1M and GIST1M. This is already good news, given the better scalability of our method. Moreover, we note that SQ obtains a large advantage on the deep features of ConvNet1M-128 when using 8 and 16 codebooks. We find this result rather encouraging, as deep features are likely to replace hand-crafted descriptors such as SIFT and GIST in the foreseeable future.
OPQ achieves a large gain compared to PQ in GIST1M, and this gap is only slightly improved by AQ and SQ. Since both SIFT1M and ConvNet1M-128 have low dimensionality (128), and GIST1M has high-dimensional descriptors (960), it remains unclear whether the advantages of AQ and SQ are only restricted to low-dimensional descriptors. We investigate this question by benchmarking the methods on 1024-, 2048- and 4096-dimensional deep features obtained in a similar manner to ConvNet1M-128, but using using the CNN-M-1024, CNN-M-2048 and CNN-M networks from  respectively. The quantization results on these datasets are shown on Figure 3. While the PQ-to-OPQ gap is still present for high-dimensional features, we see that AQ and SQ maintain a performance gap from OPQ similar to that observed on the 128-dimensional features. Moreover, our method remains the clear winner for 8 and 16 codebooks, and largely competitive with AQ for 4 codebooks. These results suggest that codebook independence hurts the compression of deep features particularly badly and motivates more research of compositional quantization methods that follow the formulation of expression 3.
We demonstrate the performance of our method on fast search of nearest neighbours with recall@N curves 
. These curves represent the probability of the truenearest neighbours being in a retrieved list of neighbours for varying . We set and observe little variability for other values. Our main results are shown on Figure 4. As expected, lower quantization error lets us achieve higher recall on SIFT1M and GIST1M, although on GIST1M OPQ and AQ achieve very competitive performance. On ConvNet1M-128, our method was slightly outperformed by AQ; however, this trend is reversed for longer codes, consistent with the quantization error of Fig. 2. We show results on longer codes in the supplementary material.
We study the trade-off in classification performance vs. compression rate on the ILSVRC-2012 dataset using deep learning features. We trained a linear SVM on the 1.2 million uncompressed examples provided, and preprocessed the features with L2 normalization, which was found to improve performance in . The 50 000 images in the validation set were preprocessed similarly and compressed before evaluation. This scenario is particularly useful when one wants to search for objects in large unlabelled datasets [1, 4]
, and in retrieval scenarios where classifiers are applied to large collections of images in search for high scores[6, 18]. Notice that in this scenario, the only operation needed between the support vectors and the database descriptors is a dot product; as opposed to distance computation, this can be done with lookups in AQ and SQ, the same as for PQ and OPQ. We report the classification error taking into account the top 5 predictions.
Classification results are shown on Figure 5. We observe a similar trend to that seen in our quantization results, with PQ and OPQ consistently outperformed by AQ and SQ. Using 128-dimensional features our method performs similarly to AQ using 4 codebooks, but shows better performance for larger code sizes. Using 1024-dimensional features AQ and SQ are practically equivalent but, curiously, it seems like the 128-dimensional features are more amenable to compression: for all compression rates the 128-dimensional features outperform the 1024-dimensional features ( vs. in top-5 error), even though when uncompressed the 1024-dimensional features perform slightly better ( vs. ). This suggests that, if quantization is planned as part of a large-scale classification pipeline, low-dimensional features should be preferred over high-dimensional ones. It is also noticeable that for extreme compression rates (e.g., 32 bits) PQ and OPQ have error rates in the 35-45% range, while AQ and SQ degrade more gracefully and maintain a 25-30% error rate.
Figure 6 shows the running time for training and database encoding for PQ/OPQ, APQ and SQ on the ConvNet1M-128 dataset using 8 codebooks (64 bits). All measurements were taken on a machine with a 3.20 GHz processor using a single core. We can see that SQ obtains most of its performance advantage out of initialization, but codebook refinement is still responsible for a 20% decrease to the final quantization error (0.12 to 0.10). We also see that APQ largely improves upon its OPQ initialization, but these iterations are extremely expensive compared to PQ/OPQ, and 3 iterations take almost as much computation as the entire SQ optimization. Beyond training (which arguably is not too big of a problem, since it only has to be done once), encoding the database with the learned codebooks is extremely expensive with APQ (9.2 hours), while for PQ/OPQ and SQ it stays in the 5-20 second range. Projecting these numbers to the encoding of a dataset with 1 billion features such as SIFT1B  suggests that PQ/OPQ would need about 1.5 hours to complete, and SQ would need around 6 hours; however, APQ would need around 1.05 years (!). Although all these methods are highly parallelizable, these numbers highlight the importance of fast encoding for good scalability.
We have introduced Stacked Quantizers as an effective and efficient approach to compositional vector compression. After analyzing PQ and AQ in terms of their codebook assumptions, we derived a method that combines the best of both worlds, being only slightly more complex than PQ, while maintaining the representational power of AQ. We have demonstrated state-of-the-art performance on datasets of SIFT, GIST and, perhaps most importantly, deep convolutional features.
. We also plan to investigate the use of optimization approaches that have proven useful in network-like architectures such as stochastic gradient descent and conjugate gradient.
This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Institute for Computing, Information and Cognitive Systems (ICICS) at UBC, and enabled in part by WestGrid and Compute / Calcul Canada.