Supervised deep learning-based hash and vector quantization are enabling fast and large-scale image retrieval systems. By fully exploiting label annotations, they are achieving outstanding retrieval performances compared to the conventional methods. However, it is painstaking to assign labels precisely for a vast amount of training data, and also, the annotation process is error-prone. To tackle these issues, we propose the first deep unsupervised image retrieval method dubbed Self-supervised Product Quantization (SPQ) network, which is label-free and trained in a self-supervised manner. We design a Cross Quantized Contrastive learning strategy that jointly learns codewords and deep visual descriptors by comparing individually transformed images (views). Our method analyzes the image contents to extract descriptive features, allowing us to understand image representations for accurate retrieval. By conducting extensive experiments on benchmarks, we demonstrate that the proposed method yields state-of-the-art results even without supervised pretraining.READ FULL TEXT VIEW PDF
The Vector Quantized-Variational Autoencoder (VQ-VAE) provides an
We tackle the problem of unsupervised visual descriptors compression, wh...
Using features extracted from networks pretrained on ImageNet is a commo...
Product Quantization (PQ) has long been a mainstream for generating an
Automated animal censuses with aerial imagery are a vital ingredient tow...
Deep learning-based image retrieval has been emphasized in computer visi...
Transformers have shown outstanding results for natural language
Approximate Nearest Neighbor (ANN) search has received much attention in image retrieval research due to its low storage cost and fast search speed. There are two mainstream approaches in the ANN research, one is Hashing , and the other is Vector Quantization (VQ) . Both methods aim to transform high-dimensional image data into compact binary codes while preserving the semantic similarity, where the difference lies in measuring the distance between the binary codes.
In the case of hashing methods [7, 43, 17, 15, 32], the distance between binary codes is calculated using the Hamming distance, i.e., a simple XOR operation. However, this approach has a limitation that the distance can be represented with only a few distinct values, where the complex distance representation is incapable. To alleviate this issue, VQ-based methods [23, 13, 24, 2, 48, 3, 49] have been proposed, exploiting quantized real-valued vectors in distance measurement instead. Among these, Product Quantization (PQ)  is one of the best methods, delivering the retrieval results very fast and accurately.
The essence of PQ is to decompose a high-dimensional space of feature vectors (image descriptors) into a Cartesian product of several subspaces. Then, each of the image descriptors is divided into several subvectors according to the subspaces, and the subvectors are clustered to form centroids. As a result, Codebook of each subspace is configured with corresponding centroids (codewords), which are regarded as quantized representations of the images. The distance between two different binary codes in the PQ scheme is asymmetrically approximated by utilizing real-valued codewords with look-up table, resulting in richer distance representations than the hashing.
Recently, supervised deep hashing methods [44, 5, 22, 29, 47] show promising results for large-scale image retrieval systems. However, since binary hash codes cannot be directly applied to learn deep continuous representations, performance degradation is inevitable compared to the retrieval using real vectors. To address this problem, quantization-based deep image retrieval approaches have been proposed in [20, 4, 46, 31, 26, 21]. By introducing differentiable quantization methods on continuous deep image feature vectors (deep descriptors), direct learning of deep representations is allowed in the real-valued space.
Although deep supervised image retrieval systems provide outstanding performances, they need expensive training data annotations. Hence, deep unsupervised hashing methods have also been proposed [30, 11, 39, 50, 41, 14, 36, 45, 37], which investigate the image similarity to discover semantically distinguishable binary codes without annotations. However, while quantization-based methods have advantages over hashing-based ones, only limited studies exist that adopt quantization for deep unsupervised retrieval. For example,  employed pre-extracted visual descriptors instead of images for the unsupervised quantization.
In this paper, we propose the first unsupervised end-to-end deep quantization-based image retrieval method; Self-supervised Product Quantization (SPQ) network, which jointly learns the feature extractor and the codewords. As shown in Figure 1, the main idea of SPQ is based on self-supervised contrastive learning [8, 40, 6]. We regard that two different “views” (individually transformed outputs) of a single image are correlated, and conversely, the views generated from other images are uncorrelated. To train PQ codewords, we introduce Cross Quantized Contrastive learning, which maximizes the cross-similarity between the correlated deep descriptor and the product quantized descriptor. This strategy leads both deep descriptors and PQ codewords to become discriminative, allowing the SPQ framework to achieve high retrieval accuracy.
To demonstrate the efficiency of our proposal, we conduct experiments under various training conditions. Specifically, unlike previous methods that utilize pretrained model weights learned from a large labeled dataset, we conduct experiments with “truly” unsupervised settings where human supervision is excluded. Despite the absence of label information, SPQ achieves state-of-the-art performance.
The contributions of our work are summarized as:
To the best of our knowledge, SPQ is the first deep unsupervised quantization-based image retrieval scheme, where both feature extraction and quantization are included in a single framework and trained in a self-supervised fashion.
By introducing cross quantized contrastive learning strategy, the deep descriptors and the PQ codewords are jointly learned from two different views, delivering discriminative representations to obtain high retrieval scores.
Extensive experiments on fast image retrieval protocol datasets verify that our SPQ shows state-of-the-art retrieval performance even for the truly unsupervised settings.
This section categorizes image retrieval algorithms regarding whether or not deep learning is utilized (conventional versus deep methods) and briefly explains the approaches. For a more comprehensive understanding, refer to a survey paper .
Conventional methods. One of the most common strategies for fast image retrieval is hashing. For some examples, Locality Sensitivity Hashing (LSH)  employed random linear projections to hash. Spectral Hashing (SH)  and Discrete Graph Hashing (DGH) 
exploited graph-based approaches to preserve data similarity of the original feature space. K-means Hashing (KMH) and Iterative Quantization (ITQ)  focused on minimizing quantization errors that occur when mapping the original feature to discrete binary codes.
Another fast image retrieval strategy is vector quantization. There are Product Quantization (PQ)  and its improved variants; Optimized PQ (OPQ) , Locally Optimized PQ (LOPQ) , and methods with different quantizers, such as Additive , Composite , Tree , and Sparse Composite Quantizers 
. Our SPQ belongs to the PQ family, where the deep feature data space is divided into several disjoint subspaces. The divided deep subvectors are then trained with our proposed loss function to find the optimal codewords.
Supervised deep convolutional neural network (CNN)-based hashing approaches[44, 5, 22, 29, 47] have shown superior performance in many image retrieval tasks. There are also quantization-based deep image retrieval methods [4, 26], which use pretrained CNNs and fine-tune the network to train robust codewords together. For improvement, the metric learning schemes are applied in [46, 31, 21] to learn codewords and deep representations together with the pairwise semantic similarity. Note that we also utilize a type of metric learning, i.e., contrastive learning; however, our method requires no label information in learning the codewords.
Regarding unsupervised deep image retrieval, most works are based on hashing. To be specific, generative mechanisms are utilized in [11, 39, 50, 14], and graph-based techniques are employed in [36, 37]. Notably, DeepBit  has a similar concept to SPQ in that the distance between the transformed image and the original one is minimized. However, the hash code representation has a limitation in that only a simple rotational transformation is exploited. In terms of deep quantization, there only exists a study dubbed Unsupervised Neural Quantization (UNQ) , which uses pre-extracted visual descriptors instead of employing the image itself to find the codewords.
To improve the quality of image descriptors and codewords for unsupervised deep PQ-based retrieval, we configure SPQ with a feature extractor to explore the entire image information. Then, we jointly learn every component of SPQ in a self-supervised fashion. Similar to [8, 40, 6], the full knowledge of the dataset is augmented with several transformations such as crop and resize, flip, color distortion, and Gaussian blurring. By cross-contrasting differently augmented images, both image descriptors and codewords become discriminative to achieve a high retrieval score.
The goal of an image retrieval model is to learn a mapping where denotes the overall system, is an image included in a dataset of training samples, and is a -bits binary code . As shown in Figure 2, of the SPQ contains a deep CNN-based feature extractor which outputs a compact deep descriptor (feature vector) . Any CNN architecture can be exploited as a feature extractor, as long as it can handle fully connected layer, e.g. AlexNet , VGG , or ResNet . We configure the baseline network architecture with ResNet50 that generally shows outstanding performance in image representation learning, and details are reported in section 4.2.
Regarding the quantization for fast image retrieval, employs codebooks in the quantization head of , where consists of codewords as . PQ is conducted in by dividing the deep feature space into the Cartesian product of multiple subspaces. Every codebook of corresponding subspace exhibits several distinctive characteristics representing the image dataset . Each codeword belonging to the codebook infers a clustered centroid of a divided deep descriptor, which aims to hold a local pattern that frequently occurs. During quantization, similar properties between images are shared by being assigned to the same codeword, whereas distinguishable features have different codewords. As a result, various distance representations for efficient image retrieval are achieved.
For better understanding, we briefly describe the training scheme of SPQ in Algorithm 1, where and denote trainable parameters of feature extractor and quantization head, respectively, and is a learning rate. In this case, represents a collection of codebooks. The training scheme and quantization process of SPQ are detailed below.
First of all, to conduct deep learning with and in an end-to-end manner, and make the whole codewords contribute to training, we need to solve the infeasible derivative calculation of hard assignment quantization. For this, following the method in , we introduce soft quantization on the quantization head with the soft quantizer as:
where is a non-negative temperature parameter that scales the input of the softmax, and denotes the squared Euclidean distance to measure the similarity between inputs. In this fashion, the sub-quantized descriptor can be regarded as an exponential weighted sum of the codewords that belong to . Note that the entire codewords in the codebook are utilized to approximate the quantized output, where the closest codeword contributes the most.
Besides, unlike previous deep PQ approaches [46, 21], we exclude intra-normalization  which is known to minimize the impact of burst visual features when concatenating sub-quantized descriptors to obtain the whole product quantized descriptor . Since our SPQ is trained without any human supervision, which assists in finding distinct features, we focus on catching dominant visual features rather than balancing the influence of every codebook.
To learn the deep descriptors and the codewords together, we propose cross quantized contrastive learning scheme. Inspired by contrastive learning [8, 40, 6], we attempt to compare the deep descriptors and the product quantized descriptors of various views (transformed images). As observed in Figure 2, the deep descriptor and the product quantized descriptor are treated as correlated if the views are originated from the same image, whereas uncorrelated if the views are originated from the different images. Note that, to increase the generalization capacity of the codewords, the correlation between the deep descriptor and the quantized descriptor of itself ( and ) is ignored. This is because the contribution of other codewords decreases when the agreement between the subvector and the nearest codeword is maximized.
For a given mini-batch of size , we randomly sample examples from the database and apply a random combination of augmentation techniques to each image twice to generate data points (views). Inspired from [9, 8, 6], we take into account that two separate views of the same image are correlated, and the other views originating from different images within a mini-batch are uncorrelated. On this assumption, we design a cross quantized contrastive loss function to learn the correlated pair of examples as:
denotes a cosine similarity betweenand , is a non-negative temperature parameter, and is an indicator that evaluates to 1 iff . Notably, to reduce redundancy between and which are similar to each other, the loss is computed for the half of the uncorrelated samples in the batch. The cosine similarity is used as a distance measure to avoid the norm deviations between and .
Concerning the data augmentation for generating various views, we employ five popular techniques: (1) resized crop to treat local, global, and adjacent views, (2) horizontal flip to handle mirrored inputs, (3) color jitter to deal with color distortions, (4) grayscale to focus more on intensity, and (5) Gaussian blur to cope with noise in the image. The default setup is directly taken from , where all transformations are randomly applied in a sequential manner (1-5). Exceptionally, we modify color jitter strength as 0.5 to fit in SPQ, following the empirical observation. In the end, SPQ is capable of interpreting contents in the image by contrasting different views of images in a self-supervised way.
Image retrieval is performed in two steps, similar to PQ . First, the retrieval database consisting of binary codes is configured with a dataset of gallery images. By employing , the deep descriptor is obtained from and divided into equal-length subvectors as . Then, the nearest codeword of each subvector is searched by computing the squared Euclidean distance () between the subvector and every codeword in the codebook . Then, the index of the nearest codeword is formatted as a binary code to generate a sub-binary code . Finally, all the sub-binary codes are concatenated to generate the -bits binary code , where . We repeat this process for all gallery images to build a binary encoded retrieval database.
Moving on to the retrieval stage, we apply the same and dividing process on a query image to extract and a set of its subvectors as . The Euclidean distance is utilized to measure the similarity between the subvectors and every codeword of all codebooks to construct a pre-computed look-up table. The distance calculation between the query and the gallery is asymmetrically approximated and accelerated by summing up the look-up results.
To evaluate the performance of SPQ, we conduct comprehensive experiments on three public benchmark datasets, following experimental protocols in recent unsupervised deep image retrieval methods [50, 45, 37].
CIFAR-10  contains 60,000 images with the size of in 10 class labels, and each class has 6,000 images. We select 5,000 images per class as a training set, 100 images per class as a query set. The entire training set of 50,000 images are utilized to build a retrieval database.
FLICKR25K  consists 25,000 images with various resolutions collected from the Flickr website. Every image is manually annotated with at least one of the 24 semantic labels. We randomly take 2,000 images as a query set and employ the remaining 23,000 images to build a retrieval database, of which 5,000 images are utilized for training.
NUS-WIDE  has nearly 270,000 images with various resolutions in 81 unique labels, where each image belongs to one or more labels. We pick out images containing the 21 most frequent categories to perform experiments with a total of 169,643. We randomly choose a total of 10,500 images as a training set with each category being at least 500, a total of 2,100 images as a query set with each category being at least 100, and the rest images as a retrieval database.
Evaluation Metrics. We employ mean Average Precision (mAP) to evaluate the retrieval performance. Specifically, in the case of multi-label image retrieval on FLICKR25K and NUS-WIDE dataset, it is considered relevant even if only one of the labels matches. We vary the number of bits allocated to the binary code as to measure the mAP scores of the retrieval approaches, mAP@1,000 for CIFAR-10 dataset and mAP@5,000 for FLICKR25K and NUS-WIDE datasets, following the evaluation method in [37, 45]. In addition, by employing 64-bits hash codes of different algorithms, we draw Precision-Recall curves (PR) to compare the precisions at different recall levels and report Precision curves with respect to 1,000 top returned samples (P@1,000) to contrast the ratio of results retrieved correctly.
Implementation details. There are three baseline approaches that we categorize to make a comparison. Specifically, (1) shallow methods without deep learning, based on hashing: LSH , SH , ITQ , and based on product quantization: PQ , OPQ  LOPQ , (2) deep semi-unsupervised methods: DeepBit , GreedyHash , DVB , DistillHash , TBH , and (3) deep truly unsupervised methods: SGH , HashGAN , BinGAN , BGAN . The terms “semi” and “truly” indicate whether the pretrained model weights are utilized or not. Both semi and truly training conditions can be applied to SPQ; however, we take the truly unsupervised model that has the advantage of not requiring human supervision as the baseline.
To evaluate the shallow and deep semi-unsupervised methods, we employ ImageNet pretrained model weights of AlexNet or VGG16  to utilize features, following the experimental settings of [45, 36, 37]. Since those models take only fixed-size inputs, we need to resize all images to , by upscaling the small images and downsampling the large ones. In the case of evaluating deep truly unsupervised methods, including SPQ, the same resized images of FLICKR25K and NUS-WIDE datasets are used for simplicity, and the original resolution images of CIFAR-10 are used to reduce computational load.
For network training, we adopt Adam  and decay the learning rate with the cosine scheduling without restarts  and set the batch size as 256. We fix the dimension of the subvector and the codeword to , and also the number of codewords to . Consequently, the number codebooks is changed to because -bits are needed to obtain -bits binary code. The temperature parameter and are set as 5 and 0.5, respectively. Data augmentation is operated with Kornia 
library, and each transformation is applied with the same probability as the settings in.
The mAP results on three different image retrieval datasets are listed in Table 2, showing that SPQ substantially outperforms all the compared methods in every bit-length. Additionally, referring to Figures 3 and 4, SPQ is demonstrated to be the most desirable retrieval system.
First of all, compared to the best shallow method LOPQ 
, SPQ reveals a performance improvement of more than 46%p, 13%p, and 11.6%p in the average mAP on CIFAR-10, FLICKR25K, and NUS-WIDE, respectively. The reason for the more pronounced difference for CIFAR-10 is because the shallow methods involve an unnecessary upscaling process to utilize the ImageNet pretrained deep features. SPQ has an advantage over shallow methods in that various and suitable neural architectures can be accommodated for feature extraction and end-to-end learning.
Second, in contrast to the best deep semi-unsupervised method TBH , SPQ yields 23%p, 4.6%p, and 3.9%p higher average mAP scores on CIFAR-10, FLICKR25K, and NUS-WIDE, respectively. Even in the absence of prior information such as pretrained model weights, SPQ well distinguishes the contents within the image by comparing multiple views of training samples.
Lastly, even with the truly unsupervised setup, SPQ achieves state-of-the-art retrieval accuracy. Specifically, unlike previous hashing-based truly unsupervised methods, SPQ introduces differentiable product quantization to the unsupervised image retrieval system for the first time. By considering cross-similarity between different views in a self-supervised way, deep descriptors and codewords are allowed to be discriminative.
We configure five variants of SPQ to investigate: (1) SPQ-C that replaces cross-quantized contrastive learning with contrastive learning by comparing and , (2) SPQ-H, employing hard quantization instead of soft quantization, (3) SPQ-Q that employs standard vector quantization, which does not divide the feature space and directly utilize entire feature vector to build the codebook, (4) SPQ-S that exploits pretrained model weights to conduct deep semi-unsupervised image retrieval, and (5)SPQ-V, utilizing VGG16 network architecture as the baseline.
As reported in Table 3, we can observe that each component of SPQ contributes sufficiently to performance improvement. Comparison with SPQ-C confirms that considering cross-similarity rather than comparing quantized outputs delivers more efficient image retrieval results. From the results of SPQ-H, we find that soft quantization is more suitable for learning codewords. The retrieval outcomes with SPQ-Q, which shows the biggest performance gap with SPQ, explain that product quantization leads to accomplishing precise search results by increasing the amount of distance representation. Notably, SPQ-S, which utilizes ImageNet pretrained model weights for network initialization, outperforms truly unsupervised SPQ. In this observation, we can see that although SPQ demonstrates the best retrieval accuracy without any human guidance, better results can be obtained with some label information. Although SPQ-V is inferior to ResNet-based SPQ, its performance still surpasses existing state-of-the-art retrieval algorithms, which proves the excellence of the PQ-based self-supervised learning scheme.
Besides, we explore an hyper-parameter ( and ) sensitivity according to the color jitter strength in Figure 6. In general, the difference in performance due to the change of hyper-parameters is insignificant; however, the effect of the color jitter strength is pronounced. As a result, we confirm that SPQ is robust to hyper-parameters, and input data preparation is an important factor.
As illustrated in Figure 5, we employ t-SNE  to examine the distribution of deep representations of BinGAN, TBH, and our SPQ, where BinGAN and SPQ are trained under the truly unsupervised setting. Nonetheless, our SPQ scatters data samples most distinctly where each color denotes a different class label. Furthermore, we show the actual returned images in Figure 7. Interestingly, not only images of the same category but also images with visually similar contents are retrieved, like a cat appears in the dog retrieval results.
In this paper, we have proposed a novel deep self-supervised learning-based fast image retrieval method, Self-supervised Product Quantization (SPQ) network. By employing a product quantization scheme, we built the first end-to-end unsupervised learning framework for image retrieval. We introduced a cross quantized contrastive learning strategy to learn the deep representations and codewords to discriminate the image contents while clustering local patterns at the same time. Despite the absence of any supervised label information, our SPQ yields state-of-the-art retrieval results on three large-scale benchmark datasets. As future research, we expect performance gain by contrasting more views within a batch, which needs a better computing environment. Our code is publicly available athttps://github.com/youngkyunJang/SPQ.
This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2021R1A2C2007220) and in part by IITP grant funded by the Korea government [No. 2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)].
Similarity estimation techniques from rounding algorithms. In STOC, pp. 380–388. Cited by: §1, §2, §4.2, Table 2.
On sampling strategies for neural network-based collaborative filtering. In ACM SIGKDD, pp. 767–776. Cited by: §3.2.
Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §4.2.
Journal of Machine Learning Research9 (Nov), pp. 2579–2605. Cited by: §4.4.2.
Binary generative adversarial networks for image retrieval. In AAAI, Cited by: §1, §2, §4.2, Table 2.
Curl: contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136. Cited by: §1, §2, §3.2.