With the significant increase of the mass media contents, image retrieval has become the highly-concerned spot. Image retrieval concentrates on searching similar images from large-scale database. The direct way is to use reliable kNN (k-nearest neighbor) techniques, which usually perform brute-force searching on database. ANN (approximate nearest neighbor) search is an optimized algorithm which is actually practicable against kNN search. The main idea of ANN search is to find a compact representation of raw featuresi.e. a binary code with fixed length, which can retain structure of raw feature space and dramatically improve the computation speed.
Recently, hashing methods have been widely used in ANN search. They usually learn a hamming space which is refined to maintain similarity between features [Liu et al.2016, Zhu et al.2016, Song et al.2018c, Song et al.2018b, Song et al.2018a]. Since the computation of hamming distance is super fast, hashing methods have huge advantages on ANN search. However, hashing methods lack accuracy on feature restoration. Methods based on quantization require a codebook to store some representative features. Therefore, the main goal of quantization is to reserve more information of feature space in codebook. Then, they try to find a combination of codewords to approximate raw features and only to store indexes of these codewords.
Quantization is originated from vector quantization[Gersho and Gray2012], which first clusters features by using -means algorithm, and the set of clustering centers is used as codebook. Each data point is represented by index of its corresponding center. In order to decrease computation cost of -means, product quantization [Jegou et al.2011] and optimized product quantization [Ge et al.2013] split whole feature space into a set of sub-regions and perform similar algorithm on each subspace respectively. Such initial quantization methods construct restrictions and well-designed codebooks to accelerate calculation. Composite quantization [Zhang et al.2014b] and additive quantization [Babenko and Lempitsky2014]
remove the subspace limitation and use the sum of a few codewords to represent raw feature. In the deep learning era, people proposed some end-to-end deep neural networks to perform image feature learning and quantization together. Deep quantization network[Cao et al.2016] use AlexNet to learn well-separated image features and use OPQ to quantize features. Deep visual-semantic quantization [Cao et al.2017] and deep triplet quantization [Liu et al.2018] quantize features by CQ. Different from these works, Deep product quantization [Klein and Wolf2017], product quantization network [Yu et al.2018b] and Generative adversarial product quantization [Yu et al.2018a] proposed a differentiable method to represent quantization as operations of neural network, so that gradient descent can be applied to quantization.Despite their successes, PQ and its variants have several issues. First, to generate binary codes with different code lengths, a retraining is usually unavoidable. Second, it is tricky for the decomposition of high-dimensional vector space. Different decomposition strategies may result in huge performance differences. To tackle these issues, we propose a deep quantization method called deep recurrent quantization, which constructs codebook that can be used recurrently to generate sequential binary codes. Extensive experiments show our method outperforms state-of-the-art methods even though they use larger codebooks.
Quantization-based image retrieval tasks are defined as follows: Given a set of images which contain images of height , width and channel . We first use a CNN, e.g. AlexNet and VGG to learn a hyper representation of images, where is the dimension of feature vectors. Then we apply quantization on these feature vectors, to learn a codebook which contains codewords and each of them has dimensions. Feature vectors are then compressed to compact binary codes where indicates the code length.
2.1 Integrate Quantization To Deep Learning Architectures
During the procedure of quantization, to pick a closest codeword from feature representation is to compute the distance between codewords and features and find the minimum one, which can be described as:
where x are the features of a data point, and is the -th codeword, is quantization function and is quantized feature. Therefore, is the approximation of x. Meanwhile, we collect the index of codeword as the quantized code, which is described as:
Since b is in the range of -
, then all the codes can be binarized to a code length of. Then, the original feature x can be compressed to an extremely short binary code.
However, the formulation of codeword is non-differentiable, i.e., does not exist. It cannot be directed integrated into deep learning architectures. To tackle this issue, inspired by NetVLAD [Arandjelovic et al.2016], we use a convex combination of codewords to approximate features, which is defined as follows:
Here, indicates the confidences of each codewords w.r.t. x, i.e. the closer one codeword is to a feature x, the higher will be. Then, is approximated by , which is the weighted sum of all codewords. We define as hard quantization and as soft quantization.
3 Proposed Method
The whole network architecture of our deep recurrent quantization (DRQ) is demonstrated in Fig. 1
. DRQ contains two main parts: feature extraction module and quantization module. In feature extraction module, we apply intermediate supervision on top of CNN, to guide the learning of semantic-embedded visual features. In quantization module, we design a recurrent quantization block and integrate it into deep learning architecture which can be trained end-to-end.
3.1 Intermediate Supervision for Features
To get the feature representation of images, we use AlexNet to extract features from the last linear layer. To leverage the clustering performance i.e., to let the images with the same label have higher similarity and vice versa, we apply two losses with intermediate supervision. Specifically, we first collect a triplet in dataset which contains an anchor image , a positive sample and a negative sample w.r.t. anchor (for multi-label images, we define a positive image as one which shares at least one label with anchor, and a negative image as one which does not share any label with an anchor), and feed them into AlexNet to obtain the 4096-d features from layer. Then we add two linear layers of 1748-d and of 300-d. We concatenate and to get a final feature x of 2048-d. Since we feed the triplet into the network, the output features are represented as .
We apply two supervised objective function on these layers: 1) Adaptive margin loss , which is from DVSQ [Cao et al.2017] and applied to outputs of triplet, and 2) Triplet loss defined to final feature x, which is a concatenated feature of and . is defined as:
Triplet loss can adjust features to adapt to clustering, which uses the triplet of . It is defined as:
3.2 Recurrent Quantization Block
In recurrent quantization model, we adopt a shared codebook that contains codewords. We denote the level of quantization code as , which indicates how many iterations the codebook is reused. For each level, we pick a proper codeword as the approximation of feature vectors, and we take the index of picked codeword as the quantization code. For example, if we set , the index range of each level quantization code is , represented as a binary code of = bits. The total length of quantization code is . The position is the index of first level codeword, is the index of second level, etc. Therefore, the feature vector can be approximated by a combination of a few codewords in the codebook.
As we described in Sec. 2.1, to perform a quantization, input x and codebook C are necessary. Output is the code b. Inspired by the hierarchical codebooks in stacked quantizer [Martinez et al.2014], we observe the residual of x can be used as an input to the next quantizer. Therefore, a basic idea is to perform quantization step-by-step:
Specifically, is quantized feature explained above, and are residuals of x. We put to the next quantization to get the which approximates . Therefore, can be described as an approximation of x, which is much preciser than . Notice that processing of is similar. If we use a shared codebook, the computation in Eq. 7 can be rewritten recurrently:
And the soft and hard quantization of is defined as:
The unfolded structure of recurrent quantization is depicted in Fig. 2. Here, is the raw features and initial codebook. is a shared learnable parameter with random initialization. In iteration , we compute to find the best-fitted codeword, then we use to compute residual of and treat residual as next input . Since the residual is one or more order of magnitudes lower than , the next input should be much smaller than codewords in codebook, so we use as a scale factor to adjust the norm of codebook in order to fit the new input. In next iteration , we use the scaled codebook to complete another similar computation. Finally, we learn a codebook C, a scale factor and sequential binary codes . The hard and soft quantization of x can be computed as:
By reusing codebook C, we can reduce the number of parameters by times.
3.2.1 Objective Function
Since and are approximation of feature x, we define a distortion error as:
where is the distortion error between and x at iteration and is the distortion error between and x. We sum distortions for each level and the total distortion error is:
We also design a joint central error to align and :
In DRQ, there are two main losses: (1) and which refine features, (2) , which control the quantization effectiveness. We split the training procedure into three stages. Firstly, we minimize together to pre-train our preceding neural network. Then, we add recurrent quantization block into network but only perform one recurrent iteration i.e. set and optimize together. This is to get an initial codebook which are optimized for short binary codes. Finally, we set to a specified value and optimize the whole network with all losses, until it converges or we reach the max number of training iterations.
To validate the effectiveness and efficiency of our adopted deep recurrent quantization, we perform extensive experiments on three public datasets: CIFAR-10, NUS-WIDE and ImageNet. Since existing methods use different settings, to make a thorough comparison with them, we follow these works and compare with them using separate setting
s. We implement our model with Tensorflow, using a pre-trained AlexNet and construct intermediate layers on top of thelayer. Meanwhile, we randomly initialize codebook with specified and , which will be described below. We use Adam optimizer with learning rate for training.
4.1 Comparison Results Using Setting 1
We first conduct results and make comparisons with state-of-the art methods on two benchmark datasets: CIFAR-10 and NUS-WIDE. CIFAR-10 [Krizhevsky and Hinton2009] is a public dataset labeled in 10 classes. It consists of 50,000 images for training and 10,000 images for validation. We follow [Yu et al.2018b] to combine the training and validation set together, and randomly sample 5,000 images per class as database. The remaining 10,000 images are used as queries. Meanwhile, we use the whole database to train the network. NUS-WIDE [Chua et al.2009] is a public dataset consisting of 81 concepts, and each image is annotated with one or more concepts. We follow [Yu et al.2018b]
to use the subset of 195,834 images from the 21 most frequent concepts. We randomly sample 1,000 images per concept as the query set, and use the remaining images as the database. Furthermore, we randomly sample 5,000 images per concept from the database as the training set. We use mean Average Precision (mAP@5000) as the evaluation metric.
|Method||16 bits||24 bits||36 bits||48 bits|
|Method||12 bits||24 bits||36 bits||48 bits|
|8 bits||16 bits||24 bits||32 bits||8 bits||16 bits||24 bits||32 bits||8 bits||16 bit||24 bits||32 bits|
On CIFAR-10, we compare our DRQ with a few state-of-the-art methods, including DRSCH [Zhang et al.2015], DSCH [Zhang et al.2015], DSRH [Zhao et al.2015], VDSH [Zhang et al.2016], DPSH [Li et al.2015], DTSH [Li et al.2015], DTSH [Wang et al.2016], DSDH [Li et al.2017] and PQNet [Yu et al.2018b], using 16, 24, 36, 48 bits. We set and . The results on CIFAR dataset are shown in Table 1. Results show our network achieves comparable mAP performance against state-of-the-art methods, i.e., PQNet. Our mAP is only 0.3%-0.5% lower than PQNet. Results also show our performance is stable with variable bit-lengths. Our method only get 0.1% decrease when bit-length shrinks to 16 bits. Noticed that our recurrent quantization only use a single codebook with codewords, and therefore our method requires much less parameters compared with other methods, as shown in Tab. 4.
On NUS-WIDE dataset, we compare our method with a few shallow and deep methods. Shallow methods include SH [Salakhutdinov and Hinton2009], ITQ [Gong et al.2013], LFH [Zhang et al.2014a], KSH [Liu et al.2012], SDH [Shen et al.2015], FASTH [Lin et al.2014]. Deep methods include CNNH [Xia et al.2014], NINH [Lai et al.2015a], DHN [Zhu et al.2016], DQN [Cao et al.2016], DPSH [Li et al.2015], DTSH [Wang et al.2016], DSDH [Li et al.2017] and PQNet [Yu et al.2018b]. The results are generated in 12, 24, 36, 48 bits. We fix to generate 48 bits codes, and then slice these codes to get shorter binary codes. The results are shown in Table 2. On NUS-WIDE, our method achieves the highest mAP compared with state-of-the-art methods when the code length is longer than 12 bits. Noticed that our method uses a shared codebook for all code lengths, and it is trained once.
The codebook size w.r.t. code-length comparison between multiple methods is shown in Tab. 4. Our method obtains the smallest codebook size compared with the other methods. Also, to generate binary codes with different lengths, our methods is trained once.
|Methods||8 bits||16 bits||24 bits||32 bits||40 bits||48 bits|
4.2 Comparison Results Using Setting 2
Following DTQ [Liu et al.2018], on CIFAR-10, we combine the training and validation set together, and randomly select 500 images per class as the training set, 100 images per class as the query set. The remaining images are used as the database. On NUS-WIDE, we use the subset of 195,834 images from the 21 most frequent concepts. We randomly sample 5,000 images as the query set, and use the remaining images as the database. Furthermore, we randomly select 10,000 images from the database as the training set. On ImageNet, we follow [Cao et al.2017] to randomly choose 100 classes. We use all the images of these classes in the training set as the database, and use all the images of these classes in the validation set as the queries. Furthermore, we randomly select 100 images for each class in the database for training. We compare our method with 11 classical hash or quantization methods, including 5 shallow methods: ITQ-CCA [Gong et al.2013], BRE [Kulis and Darrell2009], KSH [Liu et al.2012], SDH [Shen et al.2015] and SQ [Martinez et al.2014], and 6 deep architecture: CNNH [Xia et al.2014], DNNH [Lai et al.2015b], DHN [Zhu et al.2016], DSH [Liu et al.2016], DVSQ [Cao et al.2017], DTQ [Liu et al.2018].
We use mAP@54000 on CIFAR-10 and mAP@5000 on NUS-WIDE and ImageNet. We use 8, 16, 24, 32-bits codes by setting . We also use precision-recall curve and precision@R (returned results) curve to evaluate the retrieval quality. The results are shown in Table 3, Fig. 3 and Fig. 4. It can be observed that: 1) Our DRQ significantly outperforms the other methods in CIFAR-10 and NUS-WIDE datasets. Specifically, it outperforms the best counterpart (DTQ) by 1.8%, 3.5%, 4.2% and 4.5% on CIFAR-10, and by 1.2%, 1.7% and 2.0% on NUS-WIDE dataset. DRQ is outperformed by DTQ on NUS-WIDE for 8-bit codes. The possible reason is that in DRQ, the codebooks are shared by different code lengths, and it may lose some accuracy especially for short binary codes. On ImageNet, our method is outperformed by DTQ, which may be caused by the random selection. Also, our DRQ requires much less parameters than DTQ, and our model is only trained once. 2) With the increase of code length, the performance of most indexing methods is improved accordingly. For our DPQ, the mAP increased by 3.4%, 7.3% and 3.6% for CIFAR-10, NUS-WIDE and ImageNet dataset respectively. This verifies that our DRQ can generate sequential binary codes to gradually improve the search accuracy. 3) The performances of precision-recall curves for different methods are consistent with their performances of mAP. The precision curves represent the retrieval precision with respect to number of return results.
4.3 Ablation Study
In this subsection, we study the effect of each part in our architecture using the following settings. (1) Unsupervised quantization: we remove the and their associated losses, and use the raw output. (2) Remove : we remove the and , and change the dimension of to 2048-d and apply on it directly, to validate the role of . (3) Remove : we remove the and and change to 300-d with , to validate the role of . (4) only: we remove the construction of and losses associated with it, to validate the role of . (5) Remove : we remove the in recurrent quantization to validate the effectiveness of joint central loss. (6) Intermediate supervision: we change the construction of x by only using layer’s output, which is 300-d and also apply to , to validate the effectiveness of the concatenation of . We perform ablation study on NUS-WIDE and show the results in Tab. 5. In general, DRQ performs the best, and ‘Remove ’ ranks the second. By removing , the supervision information is still utilized in . That is why the mAP drop is not that significant. The unsupervised architecture also achieves good results, indicating the promising performance of pretrained AlexNet. Notice that if we remove any of the objective functions in the structure, mAP will have a huge loss. This indicates the effectiveness of each part of our DRQ. We get the worst results when we remove the concat, this may be because that 300-dim features will cause significant information loss.
|Structure||8 bits||16 bits||24 bits||32 bits|
4.4 Qualitatively Results
To qualitatively validate the performance of quantization methods, we also perform t-SNE visualization on DTQ, DVSQ and DRQ, and show the results in Fig.5. Visualizations are created on CIFAR-10 dataset, we randomly sample 5,000 images from database and collect the 32 bits quantized features. Our DRQ have a similar performance to DTQ, and they both show distinct clusters in their visualization, which is much better than DVSQ. Our unsupervised structure also has a promising performance, and the data points are concretely clustered. However, some of the data points with different labels are wrongly clustered together. This indicates the importance of supervision information.
In this paper, we propose a Deep Recurrent Quantization (DRQ) architecture to generate sequential binary codes. When the model is trained once, a sequence of binary codes can be generated and the code length can be easily controlled by adjusting the number of recurrent iterations. A shared codebook and a scalar factor is designed to be the learnable weights in the deep recurrent quantization block, and the whole framework can be trained in an end-to-end manner. Experimental results on the benchmark datasets show that our model achieves comparable or even better performance compared with the state-of-the-art for image retrieval, but with much less parameters and training time.
This work is supported by the Fundamental Research Funds for the Central Universities (Grant No. ZYGX2014J063, No. ZYGX2016J085), the National Natural Science Foundation of China (Grant No. 61772116, No. 61872064, No. 61632007, No. 61602049).
- [Arandjelovic et al.2016] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR, pages 5297–5307, 2016.
- [Babenko and Lempitsky2014] Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector compression. In CVPR, pages 931–938, 2014.
- [Cao et al.2016] Yue Cao, Mingsheng Long, Jianmin Wang, Han Zhu, and Qingfu Wen. Deep quantization network for efficient image retrieval. In AAAI, pages 3457–3463, 2016.
- [Cao et al.2017] Yue Cao, Mingsheng Long, Jianmin Wang, and Shichen Liu. Deep visual-semantic quantization for efficient image retrieval. In CVPR, volume 2, 2017.
- [Chua et al.2009] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. Nus-wide: a real-world web image database from national university of singapore. In CIVR, page 48, 2009.
- [Ge et al.2013] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization for approximate nearest neighbor search. In CVPR, pages 2946–2953, 2013.
- [Gersho and Gray2012] Allen Gersho and Robert M Gray. Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012.
- [Gong et al.2013] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
- [Jegou et al.2011] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2011.
- [Klein and Wolf2017] Benjamin Klein and Lior Wolf. In defense of product quantization. arXiv preprint arXiv:1711.08589, 2017.
- [Krizhevsky and Hinton2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- [Kulis and Darrell2009] Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, pages 1042–1050, 2009.
- [Lai et al.2015a] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, pages 3270–3278, 2015.
- [Lai et al.2015b] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, pages 3270–3278, 2015.
- [Li et al.2015] Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. Feature learning based deep supervised hashing with pairwise labels. arXiv preprint arXiv:1511.03855, 2015.
- [Li et al.2017] Qi Li, Zhenan Sun, Ran He, and Tieniu Tan. Deep supervised discrete hashing. In NIPS, pages 2482–2491, 2017.
- [Lin et al.2014] Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton Van den Hengel, and David Suter. In CVPR, pages 1963–1970, 2014.
- [Liu et al.2012] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels. In CVPR, pages 2074–2081, 2012.
- [Liu et al.2016] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Deep supervised hashing for fast image retrieval. In CVPR, pages 2064–2072, 2016.
- [Liu et al.2018] Bin Liu, Yue Cao, Mingsheng Long, Jianmin Wang, and Jingdong Wang. Deep triplet quantization. In ACM MM, pages 755–763. ACM, 2018.
- [Martinez et al.2014] Julieta Martinez, Holger H Hoos, and James J Little. Stacked quantizers for compositional vector compression. arXiv preprint arXiv:1411.2173, 2014.
- [Salakhutdinov and Hinton2009] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009.
- [Shen et al.2015] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. Supervised discrete hashing. In CVPR, pages 37–45, 2015.
- [Song et al.2018a] Jingkuan Song, Lianli Gao, Li Liu, Xiaofeng Zhu, and Nicu Sebe. Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recognition, 75:175 – 187, 2018. Distance Metric Learning for Pattern Recognition.
- [Song et al.2018b] Jingkuan Song, Tao He, Lianli Gao, Xing Xu, Alan Hanjalic, and Heng Tao Shen. Binary generative adversarial networks for image retrieval. In AAAI, pages 394–401, 2018.
- [Song et al.2018c] Jingkuan Song, Hanwang Zhang, Xiangpeng Li, Lianli Gao, Meng Wang, and Richang Hong. Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing, 27(7):3210–3221, July 2018.
- [Wang et al.2016] Xiaofang Wang, Yi Shi, and Kris M Kitani. Deep supervised hashing with triplet labels. In ACCV, pages 70–84. Springer, 2016.
- [Xia et al.2014] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, page 2, 2014.
- [Yu et al.2018a] Litao Yu, Yongsheng Gao, and Jun Zhou. Generative adversarial product quantisation. In ACM MM, pages 861–869, 2018.
- [Yu et al.2018b] Tan Yu, Junsong Yuan, Chen Fang, and Hailin Jin. Product quantization network for fast image retrieval. In ECCV, pages 191–206. Springer, 2018.
- [Zhang et al.2014a] Peichao Zhang, Wei Zhang, Wu-Jun Li, and Minyi Guo. Supervised hashing with latent factor models. In ACM SIGIR, pages 173–182. ACM, 2014.
- [Zhang et al.2014b] Ting Zhang, Chao Du, and Jingdong Wang. Composite quantization for approximate nearest neighbor search. In ICML, pages 838–846, 2014.
- [Zhang et al.2015] Ruimao Zhang, Liang Lin, Rui Zhang, Wangmeng Zuo, and Lei Zhang. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Transactions on Image Processing, 24(12):4766–4779, 2015.
- [Zhang et al.2016] Ziming Zhang, Yuting Chen, and Venkatesh Saligrama. Efficient training of very deep neural networks for supervised hashing. In CVPR, pages 1487–1495, 2016.
- [Zhao et al.2015] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. Deep semantic ranking based hashing for multi-label image retrieval. In CVPR, pages 1556–1564, 2015.
- [Zhu et al.2016] Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. Deep hashing network for efficient similarity retrieval. In AAAI, pages 2415–2421, 2016.