The recent growth of available images and videos motivated researchers to work on Content-Based Image Retrieval (CBIR). There are many type of tasks in CBIR: the most studied is the instance-level image search, that consists in retrieving the most similar images starting from an image, used as a query. This task presents several challenges in terms of accuracy, search time and memory occupancy. Another relevant problem lies in the images themselves, which may present noisy features (e.g., trees, person, cars, …), different lightning conditions, viewpoints and resolution.
The image retrieval systems are based on a pipeline usually composed by: extraction of local features from the image, aggregation of the extracted features in a compact representation and retrieval of the most similar images. Initially, the focus was on the feature aggregation step and hence different types of embeddings were proposed in order to reduce the memory used and obtain a more representative global descriptor. Recently, due to the excellent results obtained in many tasks of computer vision, the deep learning approaches have become dominant also in image retrieval tasks. Particularly, Convolutional Neural Networks (CNNs) are adopted for the feature detection and description phase. They allow to densely extract features from images, that are better than the ones extracted with hand-crafted methods like SIFT or SURF , because they can catch more details on the images through the high number of convolution layers.
Following the recent advances, this paper introduces a dense-depth detector applied on CNN codes extracted from InceptionV3  network. This strategy augments the number of features in order to reach higher accuracy with a variant of VLAD  descriptors, called locVLAD 
. It outperforms the previous VLAD implementation on several public benchmarks, thanks also to the use of Z-score normalization. Furthermore a complete comparison and analysis among the other variants of VLAD is presented.
reviews the methods used for feature extraction. Next, while Section3.2 exposes the Dense-Depth Representation, Section 3.3 describes VLAD algorithm. Section 4 reports the experimental results on four public datasets: Holidays, Oxford5k, Paris6k and UKB. Finally, concluding remarks are reported.
2 Related work
In the last years, the problem of Content-Based Image Retrieval was addressed in many different ways. The first technique that has been developed, was the Bag of Words (BoW) . It is a simple method that reaches good results, but consumes a large amount of memory. After the development of the BoW approach, researchers tried to overcome its weaknesses and implemented several embedding techniques: Hamming embedding 
, Fisher Vector and VLAD . VLAD 
is the most used embedding techniques that tries to reduce the dimensionality of features, whilst preserving the recognition accuracy. A VLAD vector is a concatenation of the sum of the difference between the features and the relative closest centers, computed by using K-means clustering. There are many different variants of VLAD presented in the literatur in order to solve the weakness of the VLAD descriptors: CVLAD, CEVLAD , HVLAD , FVLAD , gVLAD  and locVLAD . CEVLAD  applies entropy for the aggreation of the features, FVLAD  modifies the aggregation steps using two codebooks: a descriptor one and a residual one, HVLAD  introduces a hierarchy of codebooks, that allows to create a more robust version of VLAD descriptors. Then, gVLAD  creates different VLAD using the orientation of the features, that are concatenated. This process increases the performance, but requires extra time.
Recently, with the development of new powerful GPUs, the deep learning approach has shown its superior performance in many tasks of image retrieval. Arandjelovic et al., in 
, applied a VLAD layer at the end of a CNN architecture, showing that the CNN-based pipeline reaches excellent results in the retrieval task. Another improvement of deep learning techniques is in the feature extraction phase. This process is known as “transfer learning” and consists in tuning the parameters trained in one feature space in order to work in another feature space. Some methods that use transfer learning are: Spatial pooling , MOP-CNN , Neural codes , Ng et al. , CCS , OC , R-MAC , Gordo et al.  and Magliani et al. . Also, fine-tuning global descriptors  on a similar image dataset, allows to highly improve accuracy results, but with an extra time effort due to the training phase on the new dataset.
3 System Architecture
In the following subsections, the proposed approach for features extraction and encoding using CNN features and locVLAD embedding is described.
3.1 CNN codes
CNN codes are feature vectors extracted from pre-trained networks using the knowledge gained while solving one problem and applying it to a different, yet related, one through the process of transfer learning. There are several different pre-trained CNN architectures (like VGG16 , GoogLeNet , ResNet ) that allow to easily extract features from their layers. The choice of the layers depends on the type of problem and the selected network. Obviously, the deeper the network is, the better the results obtained by the extracted features are.
In this paper, the selected network is the recent Inception V3 
, because it allows to obtain a more proper representation than VGG16 thanks to the concatenation of different convolutions. From it, the CNN codes in this paper are extracted from the 8th inception pooling layer (called mixed8 in the Keras implementation). Both the network and the layers have been chosen since they achieved the best results in our experiments. An ablation analysis of different networks has been conducted and it is reported in the Section4.
3.2 Dense-Depth representation
Since VLAD-based embedding works better with dense representations, we introduced a novel representation scheme.
The features extracted from mixed8 are grouped into a larger set of features of lower dimensionality in order to augment the VLAD descriptive quality. Given a feature map of dimension ( = width, = height and = depth), we split it along the depth axis in order to obtain a high number of features of lower dimension, as it can be seen in Fig. 1. Splitting along allows to mantain the geometrical information of the feature maps, because features that have different position on are not aggregated. Following this method, the number of descriptors changes from to . The indicates the dimension of every single descriptor obtained after the split along the depth axis. This value needs to be a trade-off between the number of features and their discriminative quality. As an example, a feature map of 8x8x1280, with , will be transformed in a set of 8*8*10 = 640 descriptors of 128D. Thereinafter, we will refer to this as Dense-Depth Representation (DDR).
After the extraction, the CNN codes are normalized using the root square normalization described in .
We will demonstrate in the experiments that the newly proposed DDR achieves higher performance.
3.3 VLAD and locVLAD
Starting from a codebook of -dimensional visual features, generated from K-means clustering on the CNN codes, every local descriptor , extracted from the image, is assigned to the closest cluster center of the codebook:
where is a proper -dimensional distance measure and is the size of the descriptors ( = 128 in the aforementioned example).
The VLAD vector is obtained by computing the sum of residuals, that is the difference between the feature descriptor and the corresponding cluster center:
The last step is the concatenation of the features, resulting in the unnormalized VLAD vector . This vector often includes repeated (duplicated) features that make the recognition more difficult. To solve this problem, the vector is normalized using the Z-score normalization , which consists in a residual normalization that is an independent residual norm of every visual word :
Next, the resulting vector is further normalized as follows:
represent the mean and the standard deviation of the vector, respectively. Finally, the vector is furthernormalized.
Recently, with the introduction of locVLAD , the accuracy in retrieval has been increased in an intuitive way. The VLAD vector is calculated through the mean of two different VLAD descriptors: one computed on the whole image and one computed only on the central portion of the image. The idea behind this method is that the most important and useful features in the images are often in the central region. LocVLAD is applied only to the query images, because the database images contain different views of the same landmark, as well as zoomed views.
At the end, VLAD descriptors are PCA-whitened  to 128D.
4 Experimental results
The proposed approach has been extensively tested on public datasets in order to evaluate the accuracy against the state of the art.
4.1 Datasets and evaluation metrics
The performance is measured on four public image datasets:
Holidays  is composed by 1491 images representing the holidays photos of different locations, subdivided in 500 classes. The database images are 991 and the query images are 500, one for every class.
Oxford5k  is composed by 5062 images of Oxford landmarks. The classes are 11 and the queries are 55 (5 for each class).
Paris6k  is composed by 6412 images of landmarks of Paris, France. The classes are 11 and the queries are 55 (5 for each class).
UKB  is composed by 10200 images of diverse categories such as animals, plants, etc., subdivided in 2550 classes. Every class is composed by 4 images. All the images are used as database images and only one for category is used as a query image.
The vocabulary for the VLAD descriptors was created on Paris. Instead, when testing on Paris, the vocabulary was created on Oxford.
For the evaluation of the retrieval performance, mean Average Precision (mAP) was adopted. Instead, for the calculation of the distances between the query VLAD descriptor and the database VLAD descriptors, distance was employed.
In terms of actual implementation, the detector and descriptor used is CNN-based described in section 3.1, that runs on a NVIDIA GeForce GTX 1070 GPU mounted on a computer with 8-core and 3.40GHz CPU. All experiments have been run on 4 threads. To implement the feature extractor system, the Keras library was used.
4.2 Results on Holidays
reports the results obtained extracting the features with CNN pre-trained on ImageNet. Different CNN architectures have been tested: VGG19  and InceptionV3 .
All the experiments were executed using a vocabulary of visual words, created on Paris dataset through the application of K-Means  clustering technique, initialized following the K-Means++  strategy. At the beginning, the initial configuration was to extract features from the block4_pool of VGG19 with an input image of the dimensions equal to 224x224, that is the predefined input image of VGG19. Changing, the dimension of the input image, the results were improved. Also, the choice of the layer in which extract the feature maps is important: from block4_pool to block5_pool of VGG19 there was an improvement equal to 2%. The breakthorugh was the use of InceptionV3. Thanks to the depth of this CNN architecture, the feature maps extracted allowed to create a more representative VLAD descriptor. The first experiment executed on InceptionV3 reached a mAP equal to 81.55%, which is almost 4% more than VGG19. After we found the best configuration for the parameters: Network and Layer we focused on the other parameters. The application of locVLAD instead of VLAD for the feature aggregation phase allowed to improve the performance (improvement of 3%). Also, the root square normalization slightly improved the retrieval accuracy. Following the idea that features extracted on the image resized to square not respect the aspect ratio it is not a good idea, we modified the input image of InceptionV3 allowing to have an input of variable size, that was adaptable to the different dimensions of the images. Finally, the application of PCA-whitening reduced the dimension of the descriptors and removed the noisy values, also improving the performance.
|Network||Layer||Input image||DDR||Root square norm.||Encoding||PCA-whit.||mAP|
It is worth to note that all the VLAD and locVLAD vectors are then finally normalized using Z-score normalization. Furthermore, the application of DDR allowed to higly improve the retrieval performance, as reported in the fourth last row of the Table 1.
4.3 Comparison with the state of the art on Holidays, Oxford5k, Paris6k and UKB
|Ng et al. ||128||55.80||58.30||83.60||-|
Table 2 reports the comparison with VLAD approaches on some public datasets.
All the descriptors of the VLAD-based methods are PCA-whitened, except the firt one. Our method outperforms the others on all the presented datasets, reaching good results on the public benchmarks, in particular on the Holidays dataset. Unfortunately, on Oxford5k, gVLAD obtained an accuracy slightly better than our due to the concatenation of different VLAD, calculated following the orientation of local features. On the other hand, augmenting the dimension of the the descriptor from 128D to 512D, our method performs better than the others even on Oxford5k. The high accuracy is the result of the focus on the central features of the images due to locVLAD and the dense representation of DDR.
The paper presents a new Dense-Depth Representation that allows, combined with locVLAD descriptors, to outperform the state of the art related to VLAD descriptors on several public benchmarks. The combination of the dense representation (DDR) and the focus on the most important part of the images of locVLAD allowed to obtain a better representation of each image and, therefore, to improve the retrieval accuracy without the need to augment the dimension of the descriptor that still remains 512D. The future work will be on a different embedding like R-MAC. Also, the application of fine-tuning could help to improve the final accuracy results.
Acknowledgment. This work is partially funded by Regione Emilia Romagna under the “Piano triennale alte competenze per la ricerca, il trasferimento tecnologico e l’imprenditorialità”.
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.:
NetVLAD: CNN architecture for weakly supervised place
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5297–5307
-  Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 2911–2918
-  Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: European conference on computer vision, Springer (2014) 584–599
-  Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: European conference on computer vision, Springer (2006) 404–417
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE (2009) 248–255
-  Eggert, C., Romberg, S., Lienhart, R.: Improving VLAD: hierarchical coding and a refined local coordinate system. ICIP (2014) 3018—3022
-  Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision, Springer (2014) 392–407
-  Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: Learning global representations for image search. In: European Conference on Computer Vision, Springer (2016) 241–257
-  Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124(2) (2017) 237–254
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
-  Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening. Computer Vision–ECCV 2012 (2012) 774–787
-  Jégou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometry consistency for large scale image search-extended version. Proceedings of the 10th European Conference on Computer Vision: Part I (2008) 304 – 317
-  Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. CVPR (2010) 3304–3311
-  Liu, Z., Wang, S., Tian, Q.: Fine-residual VLAD for image retrieval. Neurocomputing (2016)
-  Lloyd, S.: Least squares quantization in pcm. IEEE transactions on information theory 28(2) (1982) 129–137
-  Lowe, D.G.: Object recognition from local scale-invariant features. In: Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Volume 2., Ieee (1999) 1150–1157
-  Magliani, F., Bidgoli, N.M., Prati, A.: A location-aware embedding technique for accurate landmark recognition. ICDSC (2017) 9 – 14
-  Magliani, F., Prati, A.: An accurate retrieval through R-MAC+ descriptors for landmark recognition. arXiv preprint arXiv:1806.08565 (2018)
-  Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Computer vision and pattern recognition, 2006 IEEE computer society conference on. Volume 2., Ieee (2006) 2161–2168
-  Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, IEEE (2006) 165–176
-  Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22(10) (2010) 1345–1359
-  Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE (2010) 3384–3391
-  Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2007)
-  Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving particular object retrieval in alrge scale image databases. CVPR (2008)
-  Razavian, A.S., Sullivan, J., Carlsson, S., Maki, A.: Visual instance retrieval with deep convolutional networks. ITE Transactions on Media Technology and Applications 4(3) (2016) 251–258
Reddy Mopuri, K., Venkatesh Babu, R.:
Object level deep feature pooling for compact image representation.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2015) 62–70
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. ICCV 2 (2003) 1470–1477
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., et al.: Going deeper with convolutions, Cvpr (2015) 1–9
-  Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2818–2826
-  Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015)
-  Wang, Z., Di, W., Bhardwaj, A., Jagadeesh, V., Piramuthu, R.: Geometric VLAD for large scale image search. arXiv preprint arXiv:1403.3829 (2014)
-  Yan, K., Wang, Y., Liang, D., Huang, T., Tian, Y.: CNN vs. SIFT for image retrieval: Alternative or complementary? In: Proceedings of the 2016 ACM on Multimedia Conference, ACM (2016) 407–411
-  Yue-Hei Ng, J., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2015) 53–61
-  Zhao, W.L., Jégou, H., Gravier, G.: Oriented pooling for dense and non-dense rotation-invariant features. BMVC (2013)
-  Zhou, Q., Wang, C., Liu, P., Li, Q., Wang, Y., Chen, S.: Distribution entropy boosted VLAD for image retrieval. Entropy Vol.18, No. 8 (2016)