A Dense-Depth Representation for VLAD descriptors in Content-Based Image Retrieval

08/15/2018 ∙ by Federico Magliani, et al. ∙ 0

The recent advances brought by deep learning allowed to improve the performance in image retrieval tasks. Through the many convolutional layers, available in a Convolutional Neural Network (CNN), it is possible to obtain a hierarchy of features from the evaluated image. At every step, the patches extracted are smaller than the previous levels and more representative. Following this idea, this paper introduces a new detector applied on the feature maps extracted from pre-trained CNN. Specifically, this approach lets to increase the number of features in order to increase the performance of the aggregation algorithms like the most famous and used VLAD embedding. The proposed approach is tested on different public datasets: Holidays, Oxford5k, Paris6k and UKB.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent growth of available images and videos motivated researchers to work on Content-Based Image Retrieval (CBIR). There are many type of tasks in CBIR: the most studied is the instance-level image search, that consists in retrieving the most similar images starting from an image, used as a query. This task presents several challenges in terms of accuracy, search time and memory occupancy. Another relevant problem lies in the images themselves, which may present noisy features (e.g., trees, person, cars, …), different lightning conditions, viewpoints and resolution.

The image retrieval systems are based on a pipeline usually composed by: extraction of local features from the image, aggregation of the extracted features in a compact representation and retrieval of the most similar images. Initially, the focus was on the feature aggregation step and hence different types of embeddings were proposed in order to reduce the memory used and obtain a more representative global descriptor. Recently, due to the excellent results obtained in many tasks of computer vision, the deep learning approaches have become dominant also in image retrieval tasks. Particularly, Convolutional Neural Networks (CNNs) are adopted for the feature detection and description phase. They allow to densely extract features from images, that are better than the ones extracted with hand-crafted methods like SIFT

[16] or SURF [4], because they can catch more details on the images through the high number of convolution layers.

Following the recent advances, this paper introduces a dense-depth detector applied on CNN codes extracted from InceptionV3 [30] network. This strategy augments the number of features in order to reach higher accuracy with a variant of VLAD [13] descriptors, called locVLAD [17]

. It outperforms the previous VLAD implementation on several public benchmarks, thanks also to the use of Z-score normalization

[32]. Furthermore a complete comparison and analysis among the other variants of VLAD is presented.

This paper is organized as follows. Section 2 introduces the general techniques used in the state of the art. Section 3.1

reviews the methods used for feature extraction. Next, while Section

3.2 exposes the Dense-Depth Representation, Section 3.3 describes VLAD algorithm. Section 4 reports the experimental results on four public datasets: Holidays, Oxford5k, Paris6k and UKB. Finally, concluding remarks are reported.

2 Related work

In the last years, the problem of Content-Based Image Retrieval was addressed in many different ways. The first technique that has been developed, was the Bag of Words (BoW) [28]. It is a simple method that reaches good results, but consumes a large amount of memory. After the development of the BoW approach, researchers tried to overcome its weaknesses and implemented several embedding techniques: Hamming embedding [12]

, Fisher Vector

[22] and VLAD [13]. VLAD [13]

is the most used embedding techniques that tries to reduce the dimensionality of features, whilst preserving the recognition accuracy. A VLAD vector is a concatenation of the sum of the difference between the features and the relative closest centers, computed by using K-means clustering. There are many different variants of VLAD presented in the literatur in order to solve the weakness of the VLAD descriptors: CVLAD

[35], CEVLAD [36], HVLAD [6], FVLAD [14], gVLAD [32] and locVLAD [17]. CEVLAD [36] applies entropy for the aggreation of the features, FVLAD [14] modifies the aggregation steps using two codebooks: a descriptor one and a residual one, HVLAD [6] introduces a hierarchy of codebooks, that allows to create a more robust version of VLAD descriptors. Then, gVLAD [32] creates different VLAD using the orientation of the features, that are concatenated. This process increases the performance, but requires extra time.

Recently, with the development of new powerful GPUs, the deep learning approach has shown its superior performance in many tasks of image retrieval. Arandjelovic et al., in [1]

, applied a VLAD layer at the end of a CNN architecture, showing that the CNN-based pipeline reaches excellent results in the retrieval task. Another improvement of deep learning techniques is in the feature extraction phase. This process is known as “transfer learning” and consists in tuning the parameters trained in one feature space in order to work in another feature space

[21]. Some methods that use transfer learning are: Spatial pooling [25], MOP-CNN [7], Neural codes [3], Ng et al. [34], CCS [33], OC [26], R-MAC [31], Gordo et al. [8] and Magliani et al. [18]. Also, fine-tuning global descriptors [9] on a similar image dataset, allows to highly improve accuracy results, but with an extra time effort due to the training phase on the new dataset.

3 System Architecture

In the following subsections, the proposed approach for features extraction and encoding using CNN features and locVLAD embedding is described.

3.1 CNN codes

CNN codes are feature vectors extracted from pre-trained networks using the knowledge gained while solving one problem and applying it to a different, yet related, one through the process of transfer learning. There are several different pre-trained CNN architectures (like VGG16 [27], GoogLeNet [29], ResNet [10]) that allow to easily extract features from their layers. The choice of the layers depends on the type of problem and the selected network. Obviously, the deeper the network is, the better the results obtained by the extracted features are.

In this paper, the selected network is the recent Inception V3 [30]

, because it allows to obtain a more proper representation than VGG16 thanks to the concatenation of different convolutions. From it, the CNN codes in this paper are extracted from the 8th inception pooling layer (called mixed8 in the Keras implementation). Both the network and the layers have been chosen since they achieved the best results in our experiments. An ablation analysis of different networks has been conducted and it is reported in the Section


3.2 Dense-Depth representation

Since VLAD-based embedding works better with dense representations, we introduced a novel representation scheme.

Figure 1: Dense-depth detector

The features extracted from mixed8 are grouped into a larger set of features of lower dimensionality in order to augment the VLAD descriptive quality. Given a feature map of dimension ( = width, = height and = depth), we split it along the depth axis in order to obtain a high number of features of lower dimension, as it can be seen in Fig. 1. Splitting along allows to mantain the geometrical information of the feature maps, because features that have different position on are not aggregated. Following this method, the number of descriptors changes from to . The indicates the dimension of every single descriptor obtained after the split along the depth axis. This value needs to be a trade-off between the number of features and their discriminative quality. As an example, a feature map of 8x8x1280, with , will be transformed in a set of 8*8*10 = 640 descriptors of 128D. Thereinafter, we will refer to this as Dense-Depth Representation (DDR).

After the extraction, the CNN codes are normalized using the root square normalization described in [2].

We will demonstrate in the experiments that the newly proposed DDR achieves higher performance.

3.3 VLAD and locVLAD

Starting from a codebook of -dimensional visual features, generated from K-means clustering on the CNN codes, every local descriptor , extracted from the image, is assigned to the closest cluster center of the codebook:


where is a proper -dimensional distance measure and is the size of the descriptors ( = 128 in the aforementioned example).

The VLAD vector is obtained by computing the sum of residuals, that is the difference between the feature descriptor and the corresponding cluster center:


The last step is the concatenation of the features, resulting in the unnormalized VLAD vector . This vector often includes repeated (duplicated) features that make the recognition more difficult. To solve this problem, the vector is normalized using the Z-score normalization [32], which consists in a residual normalization that is an independent residual norm of every visual word :


Next, the resulting vector is further normalized as follows:


where and

represent the mean and the standard deviation of the vector, respectively. Finally, the vector is further


Recently, with the introduction of locVLAD [17], the accuracy in retrieval has been increased in an intuitive way. The VLAD vector is calculated through the mean of two different VLAD descriptors: one computed on the whole image and one computed only on the central portion of the image. The idea behind this method is that the most important and useful features in the images are often in the central region. LocVLAD is applied only to the query images, because the database images contain different views of the same landmark, as well as zoomed views.

At the end, VLAD descriptors are PCA-whitened [11] to 128D.

4 Experimental results

The proposed approach has been extensively tested on public datasets in order to evaluate the accuracy against the state of the art.

4.1 Datasets and evaluation metrics

The performance is measured on four public image datasets:

  • Holidays [12] is composed by 1491 images representing the holidays photos of different locations, subdivided in 500 classes. The database images are 991 and the query images are 500, one for every class.

  • Oxford5k [23] is composed by 5062 images of Oxford landmarks. The classes are 11 and the queries are 55 (5 for each class).

  • Paris6k [24] is composed by 6412 images of landmarks of Paris, France. The classes are 11 and the queries are 55 (5 for each class).

  • UKB [19] is composed by 10200 images of diverse categories such as animals, plants, etc., subdivided in 2550 classes. Every class is composed by 4 images. All the images are used as database images and only one for category is used as a query image.

The vocabulary for the VLAD descriptors was created on Paris. Instead, when testing on Paris, the vocabulary was created on Oxford.

For the evaluation of the retrieval performance, mean Average Precision (mAP) was adopted. Instead, for the calculation of the distances between the query VLAD descriptor and the database VLAD descriptors, distance was employed.

In terms of actual implementation, the detector and descriptor used is CNN-based described in section 3.1, that runs on a NVIDIA GeForce GTX 1070 GPU mounted on a computer with 8-core and 3.40GHz CPU. All experiments have been run on 4 threads. To implement the feature extractor system, the Keras library was used.

4.2 Results on Holidays

Table 1

reports the results obtained extracting the features with CNN pre-trained on ImageNet

[5]. Different CNN architectures have been tested: VGG19 [27] and InceptionV3 [30].

All the experiments were executed using a vocabulary of visual words, created on Paris dataset through the application of K-Means [15] clustering technique, initialized following the K-Means++ [20] strategy. At the beginning, the initial configuration was to extract features from the block4_pool of VGG19 with an input image of the dimensions equal to 224x224, that is the predefined input image of VGG19. Changing, the dimension of the input image, the results were improved. Also, the choice of the layer in which extract the feature maps is important: from block4_pool to block5_pool of VGG19 there was an improvement equal to 2%. The breakthorugh was the use of InceptionV3. Thanks to the depth of this CNN architecture, the feature maps extracted allowed to create a more representative VLAD descriptor. The first experiment executed on InceptionV3 reached a mAP equal to 81.55%, which is almost 4% more than VGG19. After we found the best configuration for the parameters: Network and Layer we focused on the other parameters. The application of locVLAD instead of VLAD for the feature aggregation phase allowed to improve the performance (improvement of 3%). Also, the root square normalization slightly improved the retrieval accuracy. Following the idea that features extracted on the image resized to square not respect the aspect ratio it is not a good idea, we modified the input image of InceptionV3 allowing to have an input of variable size, that was adaptable to the different dimensions of the images. Finally, the application of PCA-whitening reduced the dimension of the descriptors and removed the noisy values, also improving the performance.

Network Layer Input image DDR Root square norm. Encoding PCA-whit. mAP
VGG19 block4_pool 224x244 VLAD 74.33
VGG19 block4_pool 550x550 VLAD 75.95
VGG19 block5_pool 550x550 VLAD 77.80
InceptionV3 mixed_8 450x450 VLAD 81.55
InceptionV3 mixed_8 450x450 locVLAD 84.55
InceptionV3 mixed_8 562x562 locVLAD 85.34
InceptionV3 mixed_8 562x562 locVLAD 85.98
InceptionV3 mixed_8 562x662* locVLAD 86.70
InceptionV3 mixed_8 562x662* locVLAD 128D 87.38
InceptionV3 mixed_8 562x662* locVLAD 128D 85.63
InceptionV3 mixed_8 562x662* locVLAD 256D 89.93
InceptionV3 mixed_8 562x662* locVLAD 512D 90.46
Table 1: Results on Holidays. * indicates cases where the image aspect ratio is mantained.

It is worth to note that all the VLAD and locVLAD vectors are then finally normalized using Z-score normalization. Furthermore, the application of DDR allowed to higly improve the retrieval performance, as reported in the fourth last row of the Table 1.

4.3 Comparison with the state of the art on Holidays, Oxford5k, Paris6k and UKB

Method Dimension Oxford5k Paris6k Holidays UKB
VLAD [13] 4096 37.80 38.60 55.60 3.18
CEVLAD [36] 128 53.00 - 68.10 3.093
FVLAD [14] 128 - - 62.20 3.43
HVLAD [6] 128 - - 64.00 3.40
gVLAD [32] 128 60.00 - 77.90 -
Ng et al. [34] 128 55.80 58.30 83.60 -
DDR locVLAD 128 57.52 64.70 87.38 3.70
NetVLAD [1] 512 59.00 70.20 82.90 -
DDR locVLAD 512 61.46 71.88 90.46 3.76
Table 2: Comparison of state-of-the-art methods on different public CBIR datasets.

Table 2 reports the comparison with VLAD approaches on some public datasets.

All the descriptors of the VLAD-based methods are PCA-whitened, except the firt one. Our method outperforms the others on all the presented datasets, reaching good results on the public benchmarks, in particular on the Holidays dataset. Unfortunately, on Oxford5k, gVLAD obtained an accuracy slightly better than our due to the concatenation of different VLAD, calculated following the orientation of local features. On the other hand, augmenting the dimension of the the descriptor from 128D to 512D, our method performs better than the others even on Oxford5k. The high accuracy is the result of the focus on the central features of the images due to locVLAD and the dense representation of DDR.

5 Conclusions

The paper presents a new Dense-Depth Representation that allows, combined with locVLAD descriptors, to outperform the state of the art related to VLAD descriptors on several public benchmarks. The combination of the dense representation (DDR) and the focus on the most important part of the images of locVLAD allowed to obtain a better representation of each image and, therefore, to improve the retrieval accuracy without the need to augment the dimension of the descriptor that still remains 512D. The future work will be on a different embedding like R-MAC. Also, the application of fine-tuning could help to improve the final accuracy results.

Acknowledgment. This work is partially funded by Regione Emilia Romagna under the “Piano triennale alte competenze per la ricerca, il trasferimento tecnologico e l’imprenditorialità”.


  • [1] Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5297–5307

  • [2] Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 2911–2918
  • [3] Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: European conference on computer vision, Springer (2014) 584–599
  • [4] Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: European conference on computer vision, Springer (2006) 404–417
  • [5] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE (2009) 248–255
  • [6] Eggert, C., Romberg, S., Lienhart, R.: Improving VLAD: hierarchical coding and a refined local coordinate system. ICIP (2014) 3018—3022
  • [7] Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision, Springer (2014) 392–407
  • [8] Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: Learning global representations for image search. In: European Conference on Computer Vision, Springer (2016) 241–257
  • [9] Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124(2) (2017) 237–254
  • [10] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
  • [11] Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening. Computer Vision–ECCV 2012 (2012) 774–787
  • [12] Jégou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometry consistency for large scale image search-extended version. Proceedings of the 10th European Conference on Computer Vision: Part I (2008) 304 – 317
  • [13] Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. CVPR (2010) 3304–3311
  • [14] Liu, Z., Wang, S., Tian, Q.: Fine-residual VLAD for image retrieval. Neurocomputing (2016)
  • [15] Lloyd, S.: Least squares quantization in pcm. IEEE transactions on information theory 28(2) (1982) 129–137
  • [16] Lowe, D.G.: Object recognition from local scale-invariant features. In: Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Volume 2., Ieee (1999) 1150–1157
  • [17] Magliani, F., Bidgoli, N.M., Prati, A.: A location-aware embedding technique for accurate landmark recognition. ICDSC (2017) 9 – 14
  • [18] Magliani, F., Prati, A.: An accurate retrieval through R-MAC+ descriptors for landmark recognition. arXiv preprint arXiv:1806.08565 (2018)
  • [19] Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Computer vision and pattern recognition, 2006 IEEE computer society conference on. Volume 2., Ieee (2006) 2161–2168
  • [20] Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, IEEE (2006) 165–176
  • [21] Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22(10) (2010) 1345–1359
  • [22] Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE (2010) 3384–3391
  • [23] Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2007)
  • [24] Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving particular object retrieval in alrge scale image databases. CVPR (2008)
  • [25] Razavian, A.S., Sullivan, J., Carlsson, S., Maki, A.: Visual instance retrieval with deep convolutional networks. ITE Transactions on Media Technology and Applications 4(3) (2016) 251–258
  • [26] Reddy Mopuri, K., Venkatesh Babu, R.:

    Object level deep feature pooling for compact image representation.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2015) 62–70
  • [27] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [28] Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. ICCV 2 (2003) 1470–1477
  • [29] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., et al.: Going deeper with convolutions, Cvpr (2015) 1–9
  • [30] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2818–2826
  • [31] Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015)
  • [32] Wang, Z., Di, W., Bhardwaj, A., Jagadeesh, V., Piramuthu, R.: Geometric VLAD for large scale image search. arXiv preprint arXiv:1403.3829 (2014)
  • [33] Yan, K., Wang, Y., Liang, D., Huang, T., Tian, Y.: CNN vs. SIFT for image retrieval: Alternative or complementary? In: Proceedings of the 2016 ACM on Multimedia Conference, ACM (2016) 407–411
  • [34] Yue-Hei Ng, J., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2015) 53–61
  • [35] Zhao, W.L., Jégou, H., Gravier, G.: Oriented pooling for dense and non-dense rotation-invariant features. BMVC (2013)
  • [36] Zhou, Q., Wang, C., Liu, P., Li, Q., Wang, Y., Chen, S.: Distribution entropy boosted VLAD for image retrieval. Entropy Vol.18, No. 8 (2016)