1 Introduction
Similarity search, which finds objects in a dataset that are similar to a query object, is a common operation in machine learning. Similarity is often quantified via a straightforward distance computation between embeddings of the objects in a vector representation space. Implementation of similarity search becomes computationally expensive when information explosion and the curse of dimensionality conspire to make exhaustive search impractical. In this case, hashing and Vector Quantization (VQ) are often used to simplify and streamline the computations. As VQ methods generally outperform hashing techniques
[18], our focus is on this class of similarity search techniques.For a dataset , VQ learns a quantizer and simplifies the search by assigning each to an element of the quantizer, denoted by . A query needs only to be compared against rather than to all of , to approximately find its nearest neighbors. Normally, however maintaining a tolerable quantization error would require an impractically large value of . In practice, VQ methods learn quantizers, denoted by for , and combine them to reconstruct an approximation of the dataset. A dataset element is usually quantized to a sum of elements of the quantizers, that is, , where . This approach allows for reduction of the quantization error with smaller quantizer sets and compression of the dataset through a lossy encoding of embeddings, and given certain properties for , we have
(1) 
This principle is often used for fast similarity search. Since , we can precompute all values of and, during search time, perform only summations for each dataset element . Faster searches in existing works are often achieved by reducing . However, reducing can increase quantization error, so techniques are proposed to minimize this effect, for example, jointly learning the embedding method and the quantizers (Section 2). Nevertheless, as aggressive quantization necessarily sacrifices information, requirements of search accuracy place an implicit lower bound on , thus limiting maximum attainable search speed.
In this work, we propose Interleaved Composite Quantization (ICQ), a scheme that enables fast searches without the need of reducing . ICQ dedicates a subset of the quantizers, say , to fast, approximate distance comparisons. This is done by learning a probabilistic model of the distribution of the dataset which guides clustering of the quantizers such that the following equation implies that is potentially closer to than .
(2) 
In this comparison, the margin accounts for the variability of the remaining quantizers. We refine distance comparisons using equation 1 only when necessary. Since in many realworld cases, most dataset elements are far more distant from a random query than its nearest neighbors, the comparison (2) suffices to prune neighbor candidates, reducing the number of operations to for most dataset elements compared to for the previous works.
ICQ is the first similarity search method based on crude distance comparisons. Search speed improvement of ICQ over previous works depends critically on two key contributions. First, we learn the probabilistic model of the dataset distribution that is used to cluster the quantizers for fast distance comparisons. Both the learning of this model and the clustering can be performed jointly with the learning of the embedding method and the quantizers. As such, ICQ can benefit from joint methods proposed in the literature for improving quantization error (Section 2). Second, we propose a new twostep similarity search method, in which we first perform fast, crude distance comparisons and then refine these comparisons only when necessary. In this way, we perform fast searches without increasing quantization error.
2 Related Works
Before gaining prevalence in similarity search, VQ methods were studied as a form of lossy encoding [5, 4] for compression. [7] first introduced Product Quantization (PQ) for nearest neighbor search with asymmetric distance comparison (equation 1). In this method, each of quantizes one of predetermined, orthogonal subspaces of , comprising consecutive dimensions, i.e. its elements had exactly
consecutive nonzero elements. The quantizer for each subspace was learned using kmeans on the projection of the dataset onto it.
Since search speed in PQ methods has been often determined by the size of the quantizers, extensions of this method have tried to reduce the quantization error using fewer quantizers. [3] proposed Optimized PQ (OPQ) which learns a rotation matrix that better reorders the dimensions and aligns the dataset with the PQ subspaces. Concurrently, [15] proposed orthogonal kmeans which combines a similar rotation matrix and a translation vector. Locally Optimized PQ further reduced quantization error by learning several rotation matrices that are locally optimized and is learned jointly with the quantizers. Norouzi et al. achieved a similar effect as the rotation methods by learning quantizers orthogonal to each other but not necessarily aligned with the dimensions of the embedding space.
Additive methods further relaxed the constraints on learning quantizers by allowing all quantizers to span across all of . [21] proposed Composite Quantization (CQ) which required quantizers only to have a constant inner product instead of being orthogonal. Supervised Quantization [17] jointly learned CQ quantizers and a linear mapping for embedding the dataset, combining the embedding and quantization.
Recently, there has been several methods that jointly learn embedding models and quantizers. [10] proposed jointly learning a DNN model and PQ quantizers in an endtoend manner by adding a fully connected layer that learns quantizer assignments. PQN [19] similarly learns an endtoend network but removes the need for a fully connected layer.
Unlike previous works which trade quantization error for speed. ICQ instead, by clustering the quantizers and fast, crude distance comparisons based on equation 2 avoids reducing the number of quantizers and increasing the quantization error. Furthermore, as the key components of ICQ, that is the probabilistic model of dataset distribution and the clustering of the quantizers (section 1) can be seamlessly integrated into the quantization process, it can benefit from many existing methods for further reducing quantization error, e.g. jointly learning the embeddings and the quantizers.
3 Methodology
Fast similarity search through quantization requires two steps. First, the raw dataset is mapped into the semantically meaningful set of embeddings using the model . This is done by minimizing the error term , which can be defined as the classification loss, the triplet loss, or a similar measure of accuracy. Second, the quantizers are learned such that quantization error is minimized. Still, as quantization can work against search accuracy, it is often advantageous to combine the two steps [10, 18, 19, 17] and perform quantizationaware embedding. This is equivalent to solving problem of equation 3 below. We note that, while the overall loss in these methods is a weighted sum of accuracy loss and quantization loss, since these two losses may comprise multiple, differently weighted terms themselves, here we assume these weights are included in the definitions.
(3) 
In order for a subset of the quantizers () to reliably compare proximities, they need to capture a substantial portion of the variability of the data. We realize this in two steps. First, we identify a subspace of
with high dataset variance. Then, we cluster quantizers such that a small subset of them are dedicated to quantizing that subspace. To identify the highvariance subspace we incorporate the variances of the
, into our model training and quantization. Normally, in the realworld data, there is a high variance in the distribution of itself [9]. We use the largest values ofto identify the highvariance subspace and allocate a few of the quantizers to it. In the rest of this section, we define an augmented loss function for this purpose, describe the learning procedure using this loss function, and discuss its features.
3.1 Augmented Loss function
We define a loss function based on the distribution of the variances . Variances in realworld data often follow a multimodal distribution which we approximate using the prior defined as below:
This prior comprises a bimodal mixture model. The first, major mode represented using the normal distribution
centered around zero to allow for pruning of redundant features. We rely on the classification loss to prevent aggressive pruning of informative features. The second, minor mode is represented using the skew normal distribution
with a fixed, negative , since its asymmetry would attract towards higher values. By choosing a higher mixing proportion for the major mode (), we encourage only a few high value variances. The parameter denotes the set of trinable parameters of this distribution: . Then, we only need to minimize the negative log likelihood of this distribution:(4) 
Minimizing this loss, identifies a small subspace corresponding to the most highvariance dimensions. Specifically, for being the onehot base vector of in the th direction, we define
(5) 
We further wish to similarly cluster the quantizers into two groups. A few quantizers, the set , should be allocated to quantize mapped to . These will be used for fast distance comparisons. The remaining quantizers are used for the complementary subspace to reduce quantization error. Consequently, quantizers in the two groups are orthogonal. We formalize this condition below:
This condition is similar to the PQ method where perpendicularity of quantizers is achieved by zeroing out elements. However, unlike in PQ where zero elements are consecutive, here their location are flexibly chosen by the algorithm, interleaved among the nonzero elements. Because of this, we refer to this method of quantization the Interleaved Composite Quantization (ICQ). For simplicity of the optimization, we sum the above set of equations into the condition below:
(6) 
Where, indicates elementwise multiplication, is a vector of size with all elements and is defined as:
(7) 
We use and
as a soft constraints in learning the quantizers. While this might not fully satisfy the original constraint, it is sufficient. That is because we ultimately need crude estimations of distance. Based on this, we learn the ICQ quantizers following the below optimization:
The two terms and are still necessary to ensure accuracy and quantization error. Furthermore, here is the concatenation of , the trainable parameters of . After the quantization, the set of quantizers is obtained according to the definition below.
(8) 
3.2 Optimization
Optimizing the embedding model together with the quantizers can be a difficult task due to the complex relationship of the parameters for which we optimize. As such, early approaches opted to these tasks separately [17]. Recent methods have facilitated this process by allowing optimization of the loss through gradient descent methods based on batch learning [10, 19]. ICQ can use both approaches with respect to the model and quantizer parameters. The main difference in optimization for ICQ is the addition of optimization for prior parameters and computation of variances.
Optimization for
can be mainly performed using Expectation Maximization (EM) or gradient descent. EM methods exist that can be applied to Skew Normal distributions
[14]. For simplicity of the optimization, we choose to use gradient methods.Computation of the normally requires computation of all using the model parameters . This can be impractically expensive since and as a result
change constantly during training and is incompatible with batch learning. To get around this issue, for each batch in an epoch, we estimate the dataset variance using the variances of all the batches observed thus far. We compute this estimate using online variance computation as shown below:
(9) 
Here, and are estimates of the variance and mean of the dataset up to batch number and and are the sample variance and mean of this batch. This online computation of variance requires a small sample variance computation, each time and is thus very fast and requires almost no additional memory. Furthermore, as we gather more information about the dataset in each epoch, we improve our estimate of the dataset variance.
3.3 Learning Robustness
While most parameters in the proposed loss are learned, there are three key parameters that we have chosen to fix, and our choice of these values can affect the outcome of the optimization. These parameters are: , , and .
The parameter controls the skewness of the skew normal function. Large absolute values thus can essentially create a halfnormal distribution. However, our only requirement is that the skew normal mode is roughly . This can be satisfied so long as this mode is sufficiently asymmetrical by setting the value of , for example.
Conversely, the loss can be sensitive to the weights and . That is because the case where all variances
belong to only one of the modes is in the feasible set. If all fall into the one mode, the method tries to minimize or maximize all variances and introduce noise to the data. We get around this issue by adding the probability of the minor mode to the loss and then choosing a small
.(10) 
The new component in is the probability of the second mode and is added for robustness. Adding this component guarantees that the second mode is not emptied out to delete useful information.
3.4 Search Operation
The search process is a key differentiator of the proposed method and existing techniques. The search operation divides the normal search process into the two steps of approximate distance comparisons and accurate comparisons. In the conventional search, we normally maintain a list of the nearest neighbors to a query and update this list while sifting through the dataset. If a new dataset element is closer to the query than the furthest element in the list, i.e. satisfies equation 1, it replaces that element in the list. While in concept this process is serial, it can be parallelized efficiently [20, 8].
We also maintain a list of the nearest neighbors, but when testing a new dataset element, we first perform fast distance comparisons with the furthest element according to equation 2. If this equation is satisfied, we perform exact comparisons according to equation 1. During the crude distance comparisons, the parameter in equation 2 accounts for the distance uncertainty of the two dataset elements in the subspace . Thus, we use the variance of the dataset in this subspace.
(11) 
4 Evaluations
We evaluate the performance of our ICQ method on both synthetic and realworld datasets, and compare with prominent methods proposed in the literature, specifically, Supervised Quantization (SQ) [17] and Product Quantization Network (PQN) [19]. SQ uses a linear mapping for embedding and Composite Quantization (CQ) for quantization, while PQN — the current state of the art — uses CNN for embedding and PQ for quantization. For comparison with each method, we learn a similar embedding method and replace the quantization step with ICQ. For these comparisons, we use the widely used Mean Average Precision (MAP) and show that the proposed method generally is able to outperform existing methods.
4.1 Setup
We use both synthetic and realworld datasets to verify the performance of the proposed method. The synthetic datasets were generated using the method of [6]. This method gives us control over the number of informative and redundant features in the generated dataset (Table 1). We expect better performance in our approach when the number of informative features is smaller than the chosen subspace dimension . We also use MNIST [2] and CIFAR10 [11] datasets to evaluate the performance of the proposed method on realworld datasets.
Dataset  # training  # test  # features  # Informative 

Dataset 1  10000  1000  64  32 
Dataset 2  10000  1000  64  16 
Dataset 3  10000  1000  64  8 
In experiments on the synthetic datasets, we compare our approach with SQ [17]. We use the same quantizer size as SQ () and fix the subspace dimension to
. Other hyperparameters are chosen in the manner described by
[17]. Each comparison is performed for the same codelength and quantizer size.In the experiments on realworld datasets, we compare our method with both SQ and PQN. For each experiment, we use the same code length as the baseline. We perform two sets of comparisons. First, we train the model on all of the training set and use all of the test set for evaluations. These experiments show the advantage of using the proposed method under the same evaluation conditions as the baselines. Recently, however, [16] showed that it is important to evaluate supervised encoding methods under conditions in which not all classes are known. As such, our second set of tests makes use of the unknownclasses setup suggested by [16], choosing of the classes at random for training on the training set, and testing over the remaining classes. We show that under this condition, we can also outperform existing methods.
4.2 Experiments
Synthetic Dataset Comparison
For the accuracy test of the synthetic datasets, we compare our approach with SQ both when combined with PQ as well as CQ. Results are shown in Figure 1 and Figure 2, respectively. Each point in these diagrams is obtained by first training the coding using one code length, and then averaging the number of operations required to perform a search over the test dataset.
Figure 1 shows that, overall, ICQ is able to perform faster searches and, for the same speed, it achieves higher precisions, due to the efficiency of its approximate distance comparisons. To verify that these improvements are in fact a result of the proposed technique and not the different quantization, we compared our technique with SQ based on CQ. Results are shown in Figure 2. While the SQCQ produces good results when the number of informative features and the embedding space dimension () are close (dataset 3), ICQ performs better when there are many informative features due to the high effective dimensionality of the proposed technique.
Realworld Dataset Comparison
For the real datasets, we compare our results with previous works for different numbers of quantizers (). First, similar to the previous experiments, compare ICQ combined with linear mapping with SQ
CQ. We also use the same setup to compare ICQ against methods with deep learning embedding methods. Then, we compare ICQ with PQN when both use CNN for embedding. Finally, we show that ICQ outperforms previous method for the case where some classes are excluded during training.
We have included the results comparing with SQ for the MNIST and CIFAR10 datasets in Figure 3. As Figures (a)a and (c)c show for , both approaches have the same computation load. That is because the proposed approach has to utilize both quantizers to quantize the whole embedding space and thus skips crude distance estimation. Looking at Figures (b)b and (d)d, this case sacrifices significant precision in both ICQ and SQ. Increasing the number of quantizers helps improve precision in both SQ and ICQ, but imposes significantly lower computation costs for ICQ. Figures (a)a and (c)c show that with more quantizers, the computation cost gap between the two approaches increases and when using quantizers where precision peaks, ICQ is significantly faster.
Next, we compare ICQ combined with the same linear embedding with several recently proposed quantization methods, and show that even without a sophisticated embedding method, the proposed approach can provide competitive results, improving search speed without shorter codes. Thus, for the sake of a fair comparison between ICQ and existing methods, we compute an effective code length for ICQ as follows: For ICQ using the code length , the effective code length is chosen to be the code length that SQ would have to use to achieve the same search speed. That is, assuming and are the Average Ops for code length under ICQ and SQ, is defined as,
(12) 
We compare the MAP over the CIFAR10 dataset achieved by ICQ for different effective code lengths against Deep Quantization Network (DQN) [1] and Deep Product Quantization (DPQ) [10] in Figure 4. ICQ can outperforms both SQ and DQN in effective code length. Further, it outperforms DPQ in large code lengths with a simpler embedding method.
Finally, we compare our method with PQN [19]. For this comparison, we use a CNN for embedding, similar to PQN, and replace the quantization method, PQ, with ICQ. We use LeNet [13] for MNIST and AlexNet [12] for CIFAR10 with 512 and 1024dimensional embeddings, respectively. Since, PQN trains on triplets of inputs, we randomly generate 400K triplets in each case. We perform all comparisons for the same code length. Results are shown in Figure 5 for MNIST and CIFAR10, demonstrating the advantage of the proposed method over the stateoftheart. In contrast to the comparison with SQ, we have an advantage over PQN even for smaller code lengths. That is because ICQ, similar to CQ which was used by SQ, allows quantizers to be dense in the embedding space. On the other hand, PQ quantizers are sparse, so they normally incur higher quantization errors. More importantly, is increased, we can see for the same code lengths, the proposed method performs searches faster, which is the result of the twostep search operations. This advantage persists even if we combine PQN with CQ. In summary, the proposed method consistently outperforms PQN.
Comparison over unseen classes
We use the methodology proposed by [16] to further evaluate ICQ combined with linear mapping, and compare the results with SQ. In these experiments we leave out three randomly selected classes during training, and report search accuracy over these three classes. The results of these experiments for coding with different encoding lengths over MNIST and CIFAR10 are shown in Figure 6
. We see that ICQ also outperforms the baseline in classifying unseen classes.
5 Conclusion
We proposed Interleaved Composite Quantization (ICQ), a method which enables fast similarity search. The proposed technique learns a prior on the distribution of the variance of the dataset, and by incorporating this prior into learning the quantizers, it is able to perform fast approximate distance calculations. This approximation is sufficient for similarity search in most cases; the computationally heavy exact comparisons can be reserved to disambiguate corner cases. We tested the ICQ over several datasets and showed that it can consistently outperform the current state of the art.
References

[1]
(2016)
Deep quantization network for efficient image retrieval.
. In AAAI, pp. 3457–3463. Cited by: §4.2. 
[2]
(2012)
The mnist database of handwritten digit images for machine learning research [best of the web]
. IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: §4.1. 
[3]
(2013)
Optimized product quantization for approximate nearest neighbor search.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2946–2953. Cited by: §2.  [4] (2012) Vector quantization and signal compression. Vol. 159, Springer Science & Business Media. Cited by: §2.
 [5] (1990) Vector quantization. Readings in speech recognition 1 (2), pp. 75–100. Cited by: §2.
 [6] (2003) Design of experiments of the nips 2003 variable selection benchmark. Cited by: §4.1.
 [7] (2010) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §2.
 [8] (2017) Billionscale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §3.4.
 [9] (201406) Locally optimized product quantization for approximate nearest neighbor search. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 2329–2336. External Links: Document, ISSN 10636919 Cited by: §3.
 [10] (2017) In defense of product quantization. arXiv preprint arXiv:1711.08589. Cited by: §2, §3.2, §3, §4.2.
 [11] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
 [12] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.2.
 [13] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
 [14] (2007) Finite micture modelling using the skew normal distribution. Statistica Sinica 17 (3), pp. 909–927. External Links: ISSN 10170405, 19968507, Link Cited by: §3.2.
 [15] (2013) Cartesian kmeans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3017–3024. Cited by: §2.
 [16] (2016) How should we evaluate supervised hashing?. CoRR abs/1609.06753. External Links: Link, 1609.06753 Cited by: §4.1, §4.2.
 [17] (2016) Supervised quantization for similarity search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2018–2026. Cited by: §2, §3.2, §3, §4.1, §4.
 [18] (2017) Multiscale quantization for fast similarity search. In Advances in Neural Information Processing Systems, pp. 5745–5755. Cited by: §1, §3.
 [19] (2018) Product quantization network for fast image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 186–201. Cited by: §2, §3.2, §3, §4.2, §4.
 [20] (2018) Efficient largescale approximate nearest neighbor search on opencl fpga. 2018 Conference on Computer Vision and Pattern Recognition (CVPR’18). Cited by: §3.4.
 [21] (2014) Composite quantization for approximate nearest neighbor search.. In ICML, Vol. 2, pp. 3. Cited by: §2.
Comments
There are no comments yet.