In this paper, we are interested in image retrieval with a special focus on how to integrate spatial information in the process. In image retrieval, a collection of images is ranked by decreasing visual similarity with respect to a query, with the hope that relevant images are ranked first. This visual search engine is known as Content Based Image Retrieval (CBIR) and has been the subject of many improvements during the last decade. Applications of image retrieval includes copyright infringement detection , geo-localization  or user interaction for shopping  among others.
Two competing strategies are popular for solving the image retrieval problem. The first one involves computing local descriptors in both the query and the target images and count the number of matching descriptors between the query and the target. A geometric consistency check is then applied to remove matches that are incoherent with the layout of both images (e.g., descriptors that are spatially close in the query should also be spatially close in the target) . As noted in , this spatial consistency check is crucial in descriptor matching techniques as it holds a major contribution to their good performance. However, despite efficient descriptors hashing techniques [6, 7], matching based strategies still come at a significant computational and storage cost.
The second strategy addresses these shortcomings by computing a global representation for each image and then using a similarity measure on these representations as a proxy for the visual similarity [8, 9, 10]. Recently, global features have become very competitive when compared to local descriptors matching on challenging datasets . However, global features usually do not allow to perform a geometric consistency check because all spatial information is lost during the computation of the representation, and consequently do not benefit from the associated performance gain.
In this paper, we intend to integrate spatial information into global features so as to perform implicit geometric consistency check when computing the similarity between the resulting features. We propose to integrate this information by modeling correlations between nearby features using a tensor framework, as illustrated by the results presented on Figure 9. Tensor embeddings have recently become popular for computing visual features . In particular, we build on the ideas proposed in the Spatial Tensor Aggregation (STA) of  to propose a new global feature called ISTA. Our contributions are the following:
We correct all the flaws in the original STA, namely the lack of proper centering, alignment and normalization, leading to the Improved STA (ISTA).
We propose an adapted two-step dimension reduction method to cope with the high intermediate dimension of the STA.
Finally, our proposed model is able to obtain state of the art results on well known image retrieval datasets (Holidays, Oxford and Paris).
The remaining of this paper is as follows: First, we present the related work on global features computation for image retrieval. Then, we develop our Improved STA model for which we show all the corrections to the original STA as well as the proposed new steps and give intuition about their goals. Next, we present experiments comparing our model to the state of the art on three well known datasets, namely Holidays, Oxford5k and Paris6k.
2 Related work
In content based image retrieval, aggregation algorithms are the current state of the art due to their very favorable trade off between computational complexity and performances. Efficient aggregation methods such as Fisher Vector[14, 10] and VLAD  have been shown to provide very good results, especially when considering their improved version [16, 17] combined with dimension reduction techniques [18, 19] and re-encoding methods [20, 21].
Recently, Convolutional neural networks (CNN) have been shown to provide a strong baseline image representation for image retrieval even when the network is trained for a completely different task. When used as features extractors , CNN showed excellent improvements. Finally, NetVLAD  but also Deep VSQ 
greatly improved the state-of-the-art by converting aggregation algorithms into differentiable neural network layers allowing to learn the representations in an end-to-end manner with dedicated loss functions.
However, the main drawback of aggregation methods is that all spatial information is lost during the aggregation. Several methods were proposed in order to retrieve this spatial information either in absolute coordinates or in relative ones. Spatial Pyramid Matching (SPM)  and MOP-CNN  integrate spatial information in absolute coordinates by aggregating descriptors in cells of multiple scales following a recursive grid pattern. However, these methods see their dimension increased by a factor where is the number of scales, which becomes prohibitive. Since these methods model the entire image layout, there are not well suited for comparing images where the position of the regions of interest varies.
Contrarily to these global methods, Spatial Tensor Aggregation (STA)  is an aggregation method that integrates relative spatial information based on the linearization of the following matching kernel between spatially coupled pairs of descriptors:
with being the set of local descriptors from image (resp. for image ), being the set of descriptors in a spatial neighborhood centered on and a similarity measure between descriptors (e.g., based on a visual codebook). The embedding resulting from the linearization contains an implicit local geometric consistency check, in that the similarity is high only when both the descriptor and its neighbors are matched in the target image with preserved neighborhood properties. STA performs better than similar second-order information aggregations such as Fisher Vector  while offering an alternative to the spatial pyramid. While the STA idea is sound, the work of  does not leverage local descriptors centering and processing which are known to have a significant impact [16, 17] Furthermore, the tensor framework proposed in STA leads to signatures of too high a dimension to be of any practical use. The lack of proper normalization and dimension reduction therefore lead to non competitive results in the original paper. Because of non-trivial steps in the case of STA, we detail our propositions leading to a simple yet very competitive image feature in the next section.
3 Method overview
In this section, we detail our propositions to improve the STA aggregation method. In section 3.1 we propose an adapted descriptor processing. In section 3.2 we investigate a proper normalization for the resulting feature. In section 3.3 we present a two step dimension reduction method.
3.1 Descriptors processing
In improved VLAD representations 
, descriptors are processed using the following steps: First, the residual between the descriptor and its cluster center is computed. Then, this residual is projected into the eigenspace of the cluster (using PCA).
In the case of the STA, we propose a similar approach. Given a clustering of the descriptors space, we first aggregate the residue between (i) a 4th order tensor computed from a descriptor and a given neighborhood ; (ii) the average of the 4th order tensors for the relevant cluster pair. More precisely, the 4th order aggregation tensor of the set of descriptors from image is computed with the following equation borrowed from :
where is filled by 0 with a unique 1 at position if the descriptor belongs to the -th cluster, is the set of descriptors in a neighborhood of , and is the tensor (Kronecker) product. In other words, is the STA feature of image
and computes the correlation matrices between pairs of nearby descriptors for all ordered pairs of clusters to which they belong. A more convenient form is to define the tensorfor a given pair of clusters using the following equation:
where is the cluster . By concatenating these tensors for all ordered pair of clusters, we recover the original 4th order tensor .
To perform the centering, we compute the average 4-th order tensor . With the same observation, we define the partial average tensor for the given pair of clusters
and estimate it over a set of samples. is the set of pairs where belongs to the -th cluster and belongs to the neighborhood of and the -th cluster:
Finally, we are able to define the partial image representation based on the aggregation of residues between the tensor and its estimation :
is not symmetric, projecting the descriptors into the eigenspace of their cluster pair is not as straightforward as for VLAD. We propose to use the Singular Values Decomposition (SVD) to perform this projection. For a given pair of clusters, we compute the SVD of :
Using the result from the Equation (7) has two advantages: On the first hand, each descriptor is projected into the eigenspace which leads to decorrelated components in the resulting feature. On the second hand, only the singular values
are needed for the centering. Furthermore, we can keep only the largest singular values in order to remove irrelevant correlation between nearby descriptors while also reducing the size of the resulting features. In that aspect, we propose an adaptive strategy that keeps a variable number of components for each pair of clusters so as to keep a fixed amount of the explained variance (e.g., 80%). The full signature is the concatenation of all for all pairs .
For a given number of clusters in the codebook, STA computes pair of clusters. Without a specific normalization, each tensor has the same impact on the final representation. To reduce the influence of a potential noisy cluster, we propose to normalize separately the case from the case . The reason behind the proposed approach is that both cases correspond to different contexts in the image. The case corresponds to self-similar regions in the image (e.g., textures), while the case corresponds to transitions between different patterns. As such, a cluster can have a negative impact in one context but not in the other.
We propose to normalize in a cross-cluster way by considering similarly ranked eigen-components. Since descriptor pairs have been projected into the eigenspace of their cluster pair, processing components independently should not lead to any loss of information, with the added benefit of having a signature invariant to the number of descriptors and while keeping track of the distribution among cluster pairs.
For a component of , the normalization is computed as follow:
This normalization is computed after a power normalization () and is followed by a normalization on the full image representation.
3.3 Dimension reduction
After the centering and normalization, the resulting features have a dimension in the order of where is the number of component retained in the preprocessing. This usually leads to intermediate features of size close to
, which we now propose to reduce. Dimension reduction is usually achieved using PCA on a set of signatures by keeping components associated with the largest eigenvalues. In our case, PCA is not tractable neither with the classic approach nor with the Gram matrix decomposition due to the high dimension of the aggregated features. To cope with such high dimension, we propose a two step reduction scheme. First, we propose to perform a sparse reduction that considers components related to a specific cluster pair. We call this step Block Reduction due to the block diagonal structure of the resulting projection matrix. Then, we perform a full projection on the resulting vectors.
The block reduction is performed independently for each part of the normalized feature. This projection should map the feature block in a target space with fewer dimension while retaining the inner product. Indeed, since the inner product on the full signature is the sum of the inner products over all blocks , retaining the inner product on blocks is a sufficient condition to retain the full similarity. This is achieved by performing a low-rank approximation of a Gram matrix computed on a large set of sampled blocks.
In practice, we compute for each pair the projection matrix using the SVD of a large set of blocks and retaining a number of components proportional to the original number of dimension using the following equations:
Each of these block-wise projection
is then zero padded to match the full size of the input features and the padded projections are concatenated. This result in a single sparse projection matrixwith a block diagonal structure corresponding to the different pairs of clusters.
Once the sparse projection is done, we perform the second reduction on the resulting features. Similarly to the first step, we aim at retaining the original inner product, which can be performed by finding a low-rank approximation of the Gram matrix using the SVD of a large set of features obtained by the sparse projection.
As shown in , performing a whitening (i.e., equalizing the variance in the target space) at this stage can lead to a significant improvement, which is what we also observe. Remark that performing a whitening at this stage is only possible if no whitening was done during the block reduction stage as it would lead to all correlation between blocks being of equal importance.
In this section, we compare our proposed ISTA approach with recent approaches in the literature. We start by giving technical details and present the datasets on which the evaluation is performed before we comment on the results.
4.1 Experiments pipeline
(cut after block 11) both pre-trained on ImageNet. All parameters are computed on 20k images taken from Places 365  validation set with a codebook of 32 visual words. In all our experiments, raw dimension is 670k, which corresponds to keeping 80% of variance for VGG16 and 95% for MobileNet in the preprocessing. The block projection are computed using 8192 images from Flickr100k and we kept 40% of the original dimension. The full reduction is computed on 25k images from Flickr100k to reduce the final dimension to 22k. We compute the representations at 2 image resolutions (512px and 1024px) while conserving the image ratio and we sum the two representations.
We test our model with 3 image retrieval datasets: INRIA Holidays  (1941 images, 500 queries), Oxford5k  (5062 images, 55 queries) and Paris6k  (6412 images, 55 queries). We report the mean average precision (mAP) for each datasets while using the full image as query.
In this part, we compare ISTA on Holidays, Oxford5k and Paris6k against the state-of-the-art in Table 1. Results with VGG16 as features extractor are similar to others methods that performs fine-tuning. However, these features were harder to compress than those extracted with MobileNet which leads to better results: +3% mAP on Holidays and +5.5% mAP on Oxford. We are able to outperform state-of-the-art networks on Holidays with 94.4% of mAP versus 91.7% mAP for NetVLAD with poly SLEM. On Oxford5k (resp. Paris6k), ISTA also outperforms the state-of-the-art obtained by the same methods by more than 5% (resp. 10%).
Remark that most of the methods reported in Table 1 perform a fine tuning over all parameters, whereas we do not perform it for ISTA.
As an illustration, we show on Figure 22 two queries that where among the most improved by ISTA over NetVLAD. For a given query, we show the associated top 5 ranked images for both methods, as well as masks of the regions that contributed the most to the ranking. As we can see, NetVLAD focuses on regions that may look similar taken independently from their context (like the sky in the first query), but that are not very distinctive. In contrast, our ISTA method focuses on strongly structured patterns that are much more distinctive.
In this paper, we propose the Improved Spatial Tensor Aggregation (ISTA) for aggregating local features into a single representation taylored for image retrieval. ISTA is based on a careful analysis of spatially coupled descriptors for which we provide essential centering, normalization and dimension reduction operations. All our contributions allow ISTA to obtain state of the art results on challenging datasets like Holidays, which show the soundness of the approach.
In future work, ISTA can easily be adapted in a fully differentiable architecture like , which would allow the fine tuning of the full model.
-  Zhili Zhou, Yunlong Wang, QM Jonathan Wu, Ching-Nung Yang, and Xingming Sun. Effective and efficient global context verification for image copy detection. IEEE Transactions on Information Forensics and Security, 12(1):48–63, 2017.
-  Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. Learned contextual feature reweighting for image geo-localization. In , July 2017.
-  Adriana Kovashka, Devi Parikh, and Kristen Grauman. Whittlesearch: Interactive image search with relative attribute feedback. International Journal of Computer Vision (IJCV), abs/1505.04141, 2015.
-  Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak geometric consistency for large scale image search. In European Conference On Computer Vision (ECCV), volume 5302 of Lecture Notes in Computer Science, pages 304–317. Springer, October 2008.
-  Xinchao Li, Martha Larson, and Alan Hanjalic. Pairwise geometric matching for large-scale object retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5153–5161, 2015.
-  Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
-  Ke Jiang, Qichao Que, and Brian Kulis. Revisiting kernelized locality-sensitive hashing for improved large-scale image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  Jégou Hervé, Douze Matthijs, Schmid Cordelia, and Pérez Patrick. Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2010.
-  David Picard and Philippe-Henri Gosselin. Improving Image Similarity With Vectors of Locally Aggregated Tensors. In IEEE International Conference On Image Processing (ICIP), pages 669 – 672, Brussels, Belgium, September 2011.
-  Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision (IJCV), 105(3):222–245, December 2013.
-  Relja Arandjelovic, Petr Gronát, Akihiko Torii, Tomás Pajdla, and Josef Sivic. Netvlad: CNN architecture for weakly supervised place recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  Thanh-Toan Do, Q. D. Tran, and Ngai-Man Cheung. Faemb: A function approximation-based embedding method for image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3556–3564, June 2015.
-  David Picard. Preserving local spatial information in image similarity using tensor aggregation of local features. In IEEE International Conference On Image Processing (ICIP), pages 201–205, Phoenix, AZ, United States, September 2016.
-  Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. Large-scale image retrieval with compressed fisher vectors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3384–3391. IEEE, 2010.
-  Hervé Jégou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Perez, and Cordelia Schmid. Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1704–1716, September 2012.
-  R. Arandjelović and A. Zisserman. All about VLAD. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
-  Jonathan Delhumeau, Philippe-Henri Gosselin, Hervé Jégou, and Patrick Perez. Revisiting the VLAD image representation. In ACM Multimedia, Barcelona, Spain, October 2013.
-  Hervé Jégou and Ondrej Chum. Negative evidences and co-occurrences in image retrieval: the benefit of PCA and whitening. In European Conference On Computer Vision (ECCV), Firenze, Italy, October 2012.
-  Yue Cao, Mingsheng Long, Jianmin Wang, and Shichen Liu. Deep visual-semantic quantization for efficient image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Joaquin Zepeda and Patrick Pérez. Exemplar svms as visual feature encoders. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3052–3060, 2015.
-  Rafael S Rezende, Joaquin Zepeda, Jean S Ponce, Francis S Bach, and Patrick Perez. Kernel Square-Loss Exemplar Machines for Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, United States, July 2017.
-  Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) workshops, pages 806–813, 2014.
-  Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In European Conference On Computer Vision (ECCV), pages 392–407. Springer, 2014.
-  Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2169–2178, Washington, DC, USA, 2006. IEEE Computer Society.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2015.
-  Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.
Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
-  Yannis Kalantidis, Clayton Mellina, and Simon Osindero. Cross-dimensional weighting for aggregated deep convolutional features. In European Conference On Computer Vision (ECCV), pages 685–701. Springer, 2016.