Image retrieval has always been an attractive research topic in the field of computer vision. By allowing users to search similar images from a large database of digital images, it provides a natural and flexible interface for image archiving and browsing. Convolutional Neural Networks (CNNs) have shown remarkable accuracy in tasks such as image classification, and object detection. Recent research has also shown positive results of using CNNs on image retrieval[1, 2, 3, 4, 5]. However, unlike image classification approaches which often use global feature vectors produced by fully connected layers, these methods extract local features depicting image patches from the outputs of convolutional layers and aggregate these features into compact (a few hundred dimensions) image-level descriptors. Once meaningful and representative image-level descriptors are defined, visually similar images are retrieved by computing similarities between pre-computed database feature representations and query representations.
Thus, a key step contained in these image retrieval methods is to compute global representations. In order to generate distinguishable image-level descriptors, one has to avoid over-representing bursty (or repetitive) features during the aggregation process. Inspired by an observation of similar phenomena in textual data, Jegou et al.  identified intra-image burstiness as the phenomenon that numerous feature descriptors are almost identical within the same image111In the original publication , the authors define two kinds of burstiness: intra- and inter-image burstiness. Here, we only study intra-image burstiness, and in what follows it is simplified as burstiness for short.. This phenomenon, which appears due to repetitive patterns (e.g., window patches in an image of a building’s facade), is widely existed in images containing man-made objects. Since bursty descriptors are often in large numbers, they may contribute much to the final image representations through inappropriate aggregation strategies such as sum pooling. However, this is undesirable as such bursty descriptors may correspond to cluttered regions and consequently result in less distinguishable representations. To address visual burstiness, Jegou et al.  proposed several re-weighting strategies that penalised descriptors assigned to the same visual word within an image and penalised descriptors matched to many images. Despite effectiveness, these strategies are only customized for the Bag-of-words (BoW) representation model that considers descriptors individually, and cannot be combined with compact representation models that aggregate local features into a global vector (i.e., do not consider descriptors individually).
In this manuscript, we develop a method to reduce the contribution of bursty features to final image representations in the aggregation stage. We do so by revealing the relationship among features in an image. Based on the property of burstiness, the sum of the similarity score between a bursty feature and the whole feature set would be large as there exist many other features identical (or nearly identical) to the bursty feature. Whereas, the sum of the similarity score between a distinctive feature and the whole feature set tends to be small as it should be dissimilar to other features. To formulate this idea, we emulate deep features of an image as a heat system, where the sum of the similarity score is measured as system temperature. Specifically, for a certain feature, we consider it as the unique heat source, and compute the temperature of any other feature with the partial differential equation induced by the heat equation. Consequently, we define system temperature obtained with this certain feature by summing temperatures of all features. It is simple to understand that features leading to high system temperatures tend to be bursty ones, while features resulting in low system temperatures are distinctive ones. Thus, in order to balance the contributions of bursty features and distinctive ones to the final image-level descriptor, we compel the system temperatures derived from all features (heat sources) in one image to be a constant by introducing a set of weighting coefficients.
Heat diffusion, and more specifically anisotropic diffusion, has been used successfully in various image processing and computer vision tasks. Ranging from the classical work of Perona and Malik  to further applications in image co-segmentation, image denoising, and keypoint detection [8, 9, 10, 11]. Here, we employ the system temperature induced by this well-known theory for measuring the similarity score between a certain deep convolutional feature and others due to the following two reasons. First, as well-known, diffusing the similarity information in a weighted graph can measure similarities between different deep features more accurately compared to the pairwise cosine distances. Second and more importantly, it inspires our second contribution and allows us to obtain considerable performance gains in the image re-ranking stage. Specifically, by considering global similarities that also take into account the relation among the database representations, we propose a method to re-rank a number of top ranked images for a given query image using the query as the heat source.
Our contributions can be summarized as follows:
Feature weighting: By greedily considering each deep feature as a heat source and enforcing the temperature of the system be a constant within each heat source, we propose a novel efficient feature weighting approach to reduce the undesirable influence of bursty features (Section III).
Image re-ranking: By considering each query image as a heat source, we also produce an image re-ranking technique that leads to significant gains in performance (Section IV).
We conduct extensive quantitative evaluations on commonly used image retrieval benchmarks, and demonstrate substantial performance improvement over existing unsupervised methods for feature aggregation (Section V).
The remainder of this manuscript is organized as follows: We briefly overview representative works that close to us in Section II. The details of the proposed feature weighting strategy are described in Section III, while the proposed image re-ranking method is presented in Section IV. Experimental results are described and discussed in Section V, and conclusions are drawn in Section VI.
Ii Related Work
Since both our deep feature aggregation and image-ranking methods are built on heat diffusion, we therefore first review classical diffusion methods in the computer vision field. Second, we review representative deep learning based image retrieval methods as this paper also aims to address the image retrieval problem with convolutional neural networks.
Ii-a Diffusion for Computer Vision
Anisotropic diffusion has been applied to many computer vision problems, such as image segmentation [8, 9], saliency detection [12, 13], and clustering [14, 15]. In these applications, diffusion is used for finding central points by capturing the intrinsic manifold structure of the data. Our deep feature aggregation method models the problem in the opposite direction, and we differentiate them by weakening deep features that are densely connected to other features with high similarities.
Diffusion is also popular in the context of retrieval [16, 17, 18, 19, 20]. Among them, the approaches [16, 17, 19] addressed the shape retrieval problem, and performed diffusion on image level. While we focus on the instance-level retrieval problem, and we carry out heat diffusion on deep convolutional features. Donoser and Bischof  reviewed a number of diffusion mechanisms for retrieval. They focused on iterative solutions arguing that closed form solutions, when existing, were impractical due to inversion of large matrices. However, we rather focus on a closed form solution without iteration as in our case the number of features and the number of re-ranking images is small. Recently, Iscen et al.  introduced a regional diffusion mechanism on image regions for better measuring similarities between images. Compared with this method, our re-ranking method is much efficient as we re-rank images based on global level image vectors.
Democratic Diffusion Aggregation (DDA) 
is probably the most closest to our feature aggregation method as it also handled the bursts problem by diffusion. However, there exists at least a distinctive difference between our method and DDA. Specifically, we start from the heat equation, and balance the influence between rare features and frequent ones by enforcing the system temperatures obtained with different heat sources be a constant. While DDA inherited from Generalized Max-Pooling (GMP), which equalized the contribution of a single descriptor to the aggregated vector. Furthermore, we provide a unified solution to feature aggregation and image re-ranking, which is otherwise not possible by .
Ii-B Deep Learning for Image Retreival
Early attempts to use deep learning for image retrieval considered the use of the activations of fully connected layers as image-level descriptors [23, 24, 25]. In , a global representation was derived from the output of the penultimate layer. This work was among the first to show better performance than traditional methods based on SIFT-like features at the same dimensionality. Concurrently, Gong et al.  extracted multiple fully-connected activations by partitioning images into fragments, and then used VLAD-embeddings  to aggregate the activations into a single image vector. The work  reported promising results using sets of a few dozen features from the fully-connected layers of a CNN, without aggregating them into a global descriptor. However, observing that neural activations of the lower layers of a CNN capture more spatial details, later works advocated using the outputs of convolutional layers as features [1, 2, 3, 27, 28, 29, 30, 31, 32]. These convolutional features were subsequently used for similarity computation either with individual feature matching  or with further aggregation steps [1, 2, 3, 27]. In this work, we consider convolutional features as local features, and aggregate them into a global image descriptor.
Considerable effort has been dedicated to aggregating the activations of convolutional layers into a distinctive global image vector. For instance, [28, 29] evaluated image-level descriptors obtained using max-pooling over the last convolutional layer, while Babenko and Lempitsky  showed that sum-pooling leads to better performance. Kalantidis et al.  further proposed a non-parametric method to learn weights for both spatial locations and feature channels. Related to that, Hoang et al.  proposed several masking schemes to select a representative subset of local features before aggregation, and achieved satisfactory results by taking advantage of the triangulation embedding 
. Similarly, we proposed a deep feature selection and weighting method using the replicator equation in our very recent work. In another work, Tolias et al.  computed a collection of region vectors with max-pooling on the final convolutional layer, and then combined all region vectors into a final global representation. More recently, Xu et al.  independently employed selected part detectors to generate regional representations with weighted sum-pooling , and then concatenated regional vectors as the global descriptor. In this paper, we instead propose heat diffusion to weight and then aggregate deep feature descriptors.
Fine-tuning an off-the-shelf network is also popular for improving retrieval quality. For instance, there are a number of approaches that learn features for the specific task of landmark retrieval [34, 35, 36, 37]. While fine-tuning a pre-trained model is usually preceded by extensive manual annotation, Radenovic et al.  introduced an unsupervised fine-tuning of CNN for image retrieval from a large collection of unordered images in a fully automated manner. Similar to this work, the methods presented in [36, 37] overcame laborious annotation, and collected training data in a weakly supervised manner. More specifically, Arandjelovic et al.  proposed a new network architecture, NetVLAD, that was trained for place recognition in an end-to-end manner from weakly supervised Google Street View Time Machine images. Cao et al.  trained a special architecture Quartet-net by harvesting data automatically from GeoPair . We show that our feature weighting, and image re-ranking approach, while not requiring extra supervision, performs favorably compared to these previous methods.
In a couple of very recent works [20, 39], images were represented by multiple high-dimensional regional vectors. These two approaches achieve great performance on common benchmarks, they are however computationally demanding, both in terms of memory and computational usage. In contrast, our work uses a single vector representation while achieving similar performance.
Iii Feature Weighting with the Heat Equation
Given an input image
that is fed through a pre-trained or a fine-tuned CNN, the activations (responses) of a convolutional layer form a 3D tensor, where is the spatial resolution of the feature maps, and is the number of feature maps (channels). We denote as a set of local features, where is a -dimensional vector at spatial location in . That is to say, , where and are non-negative.
Iii-a Problem Formulation
We utilize the theory of anisotropic diffusion  to compute weights for each feature in based on their distinctiveness, thus avoiding the burstiness issue. Let us assume that the deep feature point set constitute an undirected graph. By assigning as the unique heat source, we assume that the graph constitutes a heat transfer system, and the linear heat equation for this system is defined as follows :
where , and its -th entry represents the temperature at node at time . is a positive definite symmetric matrix called the diffusion tensor, and denotes the -th diffusion coefficient reflecting the interactions between the feature pair and .
Our problem is to compute the temperature at each node . That is,
where we use the Dirichlet boundary conditions , and assume that the temperature of the environment node (outside of the system) and the source node is always zero (i.e., ) and one (i.e., ), respectively.
In practice, we compute the temperature at each node with the following simplified assumptions. Specifically, we let and consequently drop in our method as we are interested in the steady state, and define the diffusion tensor by the cosine similarity between deep feature vectors:
Furthermore, we assume that the dissipation heat loss at a node is , which is constant in time. In other words, each node is connected to an environment node with diffusivity of . With these assumptions, the heat diffusion Eq.(2) reduces to the simplified version [40, 41]:
where . Without loss of generality, we assume , where stores the similarities between points in and , and stores the similarities between any two pair points in . Then, Eq.(4) can be rewritten as
is the identity matrix of size, and is the diagonal matrix. Thus, the temperature of the system induced by the heat source is defined as
Two example visualization results of the proposed method with deep features extracted using the SiaMAC CNN model. Left: the original image; Right: relative weights (warmer colors indicate larger weights) for deep convolutional features. For instance, in the top image, our feature weighting method assigns larger weights to features corresponding to the distinctive area of tower, and smaller weights to ones corresponding to the repetitive areas of grass and sky.
We greedily consider each point as a heat source, and compute the corresponding system temperature under the linear anisotropic diffusion equation. The value of can indicate whether is a bursty feature or a distinctive one. As described earlier, a bursty feature tends to be identical (or nearly identical) to many other features, whereas a distinctive feature is prone to be dissimilar to other features. This means a bursty feature is densely connected to other features with high similarities, which consequently rises the temperature of the system. In contrast, a distinctive feature connects other features sparsely and therefore causes the system temperature to be low. Thus, in order to balance the influence between bursty features and distinctive ones, we compel the system temperatures derived from all features (heat sources) in one image to be a constant by introducing a set of weighting coefficients . That is,
As a result, is used to reduce the burstiness effects, and we accordingly compute the final image representation of each image by
where is a constant, and we typically set . plays the same role as the exponent parameter in the Power-law Normalization (PN) formula . However, it is worth noting that, we apply -normalisation on the image vector before PCA whitening, while PN is integrated in the retrieval frameworks [22, 43] after rotating the image representation with a PCA rotation matrix. Fig. 1 visualizes the weights computed for two sample input images, larger weights are shown in warmer colors. As shown, our feature weighting method assigns larger weights to distinctive areas, and smaller weights to repetitive ones.
For convenience, in the following we denote our heat equation based feature weighting method presented in this section as HeW.
Iii-B Computing Weights in Practice
It seems that we have to solve Eq.(5) times to get the image representation of , and each time we need to solve a linear equation of size . Thus, the total time cost of HeW is about in . This might be computationally intensive if the selected feature set cardinality is large.
However, the actual time complexity can be reduced to , and we can compute all by inverting the matrix only once, where . Specifically, we take computing as an example to illustrate the practical computational process. We leverage the block structure of , i.e.,
By leveraging the property of the block matrix, we can derive that
Furthermore, one can prove that
The above three equations demonstrate that we can derive by using the first column of . Similarly, we can get using the -th column of :
Thus, the conclusion that the computational cost is in holds.
For very large-scale image retrieval, the time complexity in might still be computationally intensive as we need to process too many images. However, it is worth noting that database representations are pre-computed in a retrieval system, and therefore image representation efficiency is not so important. Furthermore, as each database image can be dealt with independently, one can take advantage of the parallel processing technique (with a multi-core computer or with multiple computers, or even with both) for fast computing image-level descriptors of database images.
Iv Image Re-ranking with the Heat Equation
Inspired by our deep feature aggregation method HeW, we propose a heat equation based re-ranking approach HeR in this section. Given a query image , we denote its image vector produced by HeW or other potential image representation methods as . After querying the database, we get a ranked list of images for the query , where is the -th top ranked image. Similarly, we denote the global image vector of as .
We consider the query as the heat source, and re-rank top-ranked images by computing their temperatures with the linear anisotropic diffusion equation Eq.(2). For simplicity, as performed in the previous section, we also use the following more specific equation:
to compute temperature gains of each image, where
contains the cosine similarity scores among image vectors, and represents the similarity vector with the -th entry storing the similarity between and ().222In practice, before computing the similarity between any two different vectors, we first center the set of image vectors. That is, , . Additionally, is the diagonal matrix that is similar to appeared in Eq.(5), and with the -th entry denoting the temperature gain of . Apparently, we re-rank based on , and images with larger temperature gains are ranked higher.
The additional computational burden of both memory usage and running time introduced by HeR is negligible. In fact, HeR is computed on image-level descriptors, and it only refines a shortlist of the top best results. This means we only need to store image vectors (query as well as results), and matrices of size contained in Eq.(14). In practice, both image vector dimensions and are relatively small (a few hundreds), showing the low memory usage of HeR. Furthermore, computing is very fast with (adopted in our experiments), and we observe that the actual computing time on our platform is less than 30ms.
V Experiments and Results
This section describes the implementation of our method, and reports results on public image retrieval benchmarks333Our code is available at https://github.com/MaJinWakeUp/HeWR.. Throughout the section, we normalize the final image vector to have unit Euclidean norm.
V-a Datasets and Evaluation Protocol
Oxford5k contains a set of 5,062 photographs comprising 11 different Oxford landmarks. There are 55 query images with each 5 queries corresponding to a landmark. The ground truth similar images with respect to each query is provided by the dataset. Following the standard protocol, we crop the query images based on the provided bounding boxes before retrieval. The performance is measured using mean average precision (mAP) over the 55 queries, where junk images are removed from the ranking.
Paris6k consists of 6,392 high resolution images of the city Paris. Similar to Oxford5k, it was collected images from Flickr by querying the associated text tags for famous Paris landmarks. Additionally, this dataset also provides 55 query images and their corresponding ground truth relevant images. We also use cropped query images to perform retrieval, and measure the overall retrieval performance using mAP.
Holidays includes 1,491 images in total, and selects 500 images as queries associated with the 500 partitioning groups of the image set. To be directly comparable with recent works [1, 2, 3], we manually fix images in the wrong orientation by rotating them by degrees. The retrieval quality is also measured using mAP over 500 queries, with the query removed from the ranked list.
Flickr100k  was crawled from Flickr’s 145 most popular tags and consists of 11,071 images. We combine these 100k distractor images with Oxford5k and Paris6k, and produce Oxford105k and Paris106k datasets respectively. In this way, we evaluate the behavior of our method at a larger scale.
|Method||Representation time||Query time with varied image vector dimensions|
V-B Implementation Notes
Deep convolutional features. In order to extensively evaluate our method, we use two pre-trained and a fine-tuned deep neural networks to extract multiple deep convolutional features for each image. The adopted pre-trained networks are VGG16  and ResNet50 , which are widely used in the literature. The fine-tuned network is siaMAC , a popular fine-tuned model of VGG16.
Following the practice of previous works [2, 34], we choose the last convolutional layer of each network to separately extract patch-level image features. We use public available trained models. Specifically, we use the MatConvNet toolbox  for VGG16 and ResNet50, and use the model provided in  for siaMAC. In addition, in order to accelerate feature extraction, we resize the longest side of all images to 1,024 pixels while preserving aspect ratios before feeding them into each deep network.
PCA whitening is widely used in many image retrieval systems [50, 2, 3, 1] as it can effectively improve the discriminative ability of image vectors. In order to avoid over-fitting, the PCA matrix is usually learned with the set of aggregated image vectors produced from a held-out dataset. To be directly comparable with related works, we learn PCA parameters on Paris6k for Oxford5k and Oxford105k, and on Oxford5k for Paris6k and Paris106k. As for Holidays, we randomly select a subset of 5,000 images from Flickr100k to learn parameters.
Query Expansion (QE)  is an effective post-processing technique to increase retrieval performance. Given the ranked list of database images over a query image, we simply calculate the average vector of the 10 top-ranked image vectors and the query vector, and we then use the normalized average vector to re-query again. After QE, we then apply our re-ranking algorithm HeR to further improve retrieval performance. We will show the combination of QE and HeR is beneficial in practice.
V-C Impact of the Parameters
We investigate the impact of the parameter as well as the impact of the final image vector dimensionality to different retrieval frameworks with deep convolutional features extracted by the network siaMAC on the datasets of Oxford5k and Oxford105k. Specifically, we evaluate their impact to the following four combinations: SumA, HeW, SumA+QE+HeR and HeW+QE+HeR. SumA means we obtain image representations by simply setting in Eq.(8), and perform image search by linearly scanning database vectors. It justifies the contribution of our deep feature aggregation method HeW. SumA+QE+HeR and HeW+QE+HeR indicate that we perform re-ranking QE+HeR with image vectors produced by SumA and HeW respectively. We use them to determine the parameter that is introduced in HeR. Meanwhile, by setting , they can also be used for evaluating the effect of our image re-ranking strategy HeR.
Impact of the number of re-ranking images. We first use the retrieval frameworks SumA+QE+HeR and HeW+QE+HeR to evaluate the impact of to retrieval quality with full image vector dimensions . The mAP performance for the considered two frameworks on Oxford5k and Oxford105k under different values is shown in Fig. 2, where means HeR is not applied. As we see, HeR consistently improves the retrieval quality, and the margin is around 3% for the two largest values. The best results for HeW+QE+HeR and SumA+QE+HeR on Oxford5k and Oxford105k are achieved at and , respectively. Since gives overall better results than , therefore it is used in the following experiments.
Impact of the final image vector dimensionality. With , we illustrate mAP curves when varying the dimensionality of the final image vectors from 32 to 512 in Fig. 3. Dimensionality reduction is achieved by keeping only the first components of 512 dimensional image vectors after PCA whitening. As illustrated, the gain of HeW over SumA is nearly 2% in mAP on both Oxford5k and Oxford105k at . Furthermore, the performance gain is increasing with the reduction of the number of dimensions, and the gain is around 4% at . This means our image representation method HeW affects less by dimensionality reduction than the baseline SumA.
Both HeW and SumA are significantly benefited with image re-ranking, especially with image vectors of higher dimensions. As shown, when , the increased mAP values of HeW+QE+HeR (SumA+QE+HeR) over HeW (SumA) on Oxford5k and Oxford105k are 9.4% (9.7%) and 11.5% (12.1%), respectively. After incorporating re-ranking, the performance advantage of HeW over SumA is enlarged at low dimensions. For instance, the increased mAP values of HeW+QE+HeR over SumA+QE+HeR on Oxford5k and Oxford105k at are 4.7% and 4.6%, respectively.
Computational cost. We now turn to present running time for the considered combinations on Oxford105k. Table I reports timings (excluding time for feature extraction) measured to compute image representations and to perform image querying. We implement both SumA and HeW in Matlab, and benchmarks are obtained on an Intel Xeon E5-2630/2.20GHz with 20 cores. As the table shows, although the baseline method SumA is faster than HeW by about an order of magnitude, HeW is still fast in practice. For an image of high resolution , the number of deep features is , and the time to derive image vector with HeW is typically 173ms. It is worth noting that, the number of features extracted by siaMAC is four times of that extracted by VGG16 and ResNet50. This means aggregating features produced VGG16 and ResNet50 is much faster. In practice, we observe that, we can accomplish feature aggregation in less than 10ms with features extracted by both VGG16 and ResNet50.
It is simple to know that HeW does not affect search efficiency, and it has the same searching time as SumA. Thus, the online query time for HeW is about 188ms at . The increased time caused by incorporating image re-ranking into the retrieval system is very limited. As shown, the total online processing time for HeW+QE+HeR with is only about 225ms. This means searching with our method is efficient in practice.
|connected||Neural codes ||4,096||54.5||51.2||–||–||79.3|
Deep Conv. layer of VGG16
V-D Impact of the Networks
Table II illustrates the impact of the evaluated networks to the baseline and our methods. Although ResNet50 has demonstrated much superior performance than VGG16 on the ILSVRC classification task [48, 49], it does not actually produce better performance than the latter on the image retrieval benchmarks. As shown, while relying on much higher image representation dimensions , ResNet50-based results are still inferior to VGG16-based results in many cases. Even worse, when reducing the dimensionality to 512 components using PCA, ResNet50-based results fall behind the corresponding VGG16-based results by large margins except on the dataset of Holidays. As is expected, SiaMAC-based results outperform VGG16-based results on the datasets of Oxford and Paris as the network SiaMAC is fine-tuned with a large number of landmark building photos. However, it is worth noting that, fine-tuning may result in over-fitting. As shown, after fine-tuning, the performance for both SumA and HeW is slightly decreased on the Holidays.
Both the proposed feature aggregation method HeW and image re-ranking method HeR give boost in performance. As shown in Table II, although HeW outperforms SumA by a little margin in some cases, it outperforms the latter by over 1% in mAP in most cases after incorporating QE. Furthermore, we obtain additional 3% mAP gains with our re-ranking method HeR. Accordingly, the proposed complete method HeW+QE+HeR typically outperforms the baseline method SumA+QE by 4% with the adopted networks.
|Fisher Vector ||512||81.5||76.6||82.4||–||–|
V-E Comparison with the State-of-the-art
We below show the comparison results of the proposed approach with related unsupervised methods that use off-the-shelf and fine-tuned networks separately.
Comparison with methods using SIFT and pre-trained networks. In Table III, we present comparisons of our approach using VGG16 with methods using SIFT and off-the-shelf available networks, which utilize global representations of images. The comparison results are summarized as follows:
Our approach HeW significantly outperforms two state-of-the-art methods [43, 22] using weaker SIFT features, although their dimensions are more than 10 times higher than ours. Furthermore, it also shows clear advantages over [25, 23, 24] that utilize fully connected layers to derive image representations.
Applying HeR after QE results in significant performance gains. As shown, our re-ranking strategy HeR gives a boost of around 3.5% in mAP on all evaluated datasets. Consequently, we outperform the compared methods on two large-scale datasets Oxford105k and Paris106k by more than 5% in mAP.
|Mikulik et al. ||16M||84.9||82.4||79.5||77.3||75.8|
|Tolias et al. ||8M||87.9||–||85.4||–||85.0|
|Tolias and Jégou ||8M||89.4||84.0||82.8||–||–|
|Arandjelovic et al. ||4,096||71.6||–||79.7||–||87.5|
|Hoang et al. ||4,096||83.8||80.6||88.3||83.1||92.2|
|Iscen et al. ||5512||91.5||84.7||95.6||93.0||–|
|Iscen et al. ||5512||91.6||86.5||95.6||92.4||–|
|Noh et al. ||–||90.0||88.5||95.7||92.8||–|
|Gordo et al. ||512||89.1||87.3||91.2||86.8||89.1|
Comparison with methods using fine-tuned networks. We perform comparisons with recent unsupervised fine-tuned methods [36, 34, 3, 52, 32] in Table IV. The table again demonstrates the superior performance of our approach over related baselines at the same dimensionality:
Although HeW slightly falls behind our very recent work  on Oxford5k and Oxford105k, we establish new state-of-the-art results on the other three evaluated datasets at dimensionality of 512, and the improved mAP values over [36, 34, 3, 52] are not negligible. For example, the gain on Paris106k is at least 4.7%.
With the same siaMAC features, our approach improves two related baselines R-MAC and MAC presented in  without and with dimensionality reduction, and the improvement is more significant on two large datasets Oxford105k and Paris106k.
Finally, we further enlarge the mAP gains over the compared methods by applying HeR after QE. As shown, at , we outperform  by 7.0%, 8.5%, 7.8%, 11.4% on Oxford5k, Oxford105k, Paris6k, Paris106k, respectively.
To better understand our complete method HeW+QE+HeR, we visualize two example query images in Fig. 4. The top query example can be considered as a failure case for our method as its average precision is only 58.8%, falling far behind the mAP value of 92.0%. As displayed, although the landmark contained in the 24-th ranked image is exactly not the same as the one contained in the query, it is visually similar to the query. For the bottom query example, there are 24 ground truth similar images, and there is only one false positive image ranked at 22-th. Its average precision is 97.0%, and thus it can be seen as a successful example. As shown, the unique false positive image contains several window patches, and therefore it is understandable that it has a large similarity score with the query region.
Comparison with costly methods. Table V compares our best results with costly methods that focus on spatial verification or matching kernel. Some of them [20, 39, 30] do not necessarily rely on a global representation, and some others [53, 54, 55, 36, 3] represent images with much higher dimensional vectors, and are thus not directly comparable. There is no doubt that these methods use a larger memory footprint than our approach. Additionally, their search efficiency is obviously much lower than our method. For instance, the method of  requires a slow spatial verification taking over 1 second per query (excluding descriptor extraction time). This means these best results are hardly scalable as they require a lot of storage memory and searching time. Compared with these methods, we still produce the best performance on Oxford5k and Oxford105k. Similar to our approach, the supervised fine-tuned method  also represents images with 512 dimensional vectors. Compared with this method, we produce slightly inferior result on Holidays, and achieve much better performance on the other datasets.
We proposed an efficient aggregation approach for building compact but powerful image representations by utilizing the heat equation in this manuscript. We utilized the theory of anisotropic diffusion, and assumed that graph defined by a set of deep features constitutes a heat transfer system. By considering each deep feature as a heat source, our approach avoided over-representation of bursty features by enforcing the system temperatures derived from all features be a constant. We provided a practical solution to derive image vectors, and demonstrated the effectiveness of our method on the task of instance-level retrieval. Inspired by our aggregation method, we also presented a heat equation based image re-ranking method to further increase retrieval performance. Both of feature aggregation and image re-ranking methods are unsupervised, and can be compatible with different CNNs, including pre-trained and fine-tuned networks. Experimental results showed that we have established new state-of-the-art results on public image retrieval benchmarks using 512-dimensional vector representations.
This work was supported by National Key Research and Development Plan 2016YFB1001004, National Natural Science Foundation of China Grant 61603289, China Postdoctoral Science Foundation Grant 2016M602823, and Fundamental Research Funds for the Central Universities xjj2017118.
-  A. Babenko and V. Lempitsky, “Aggregating local deep features for image retrieval,” in 15th IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, December 11-18, 2015, pp. 1269–1277.
-  Y. Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weighting for aggregated deep convolutional features,” in 14th European Conference on Computer Vision (ECCV) Workshops, Amsterdam, The Netherlands, October 8–16, 2016, pp. 685–701.
-  T. Hoang, T.-T. Do, D.-K. L. Tan, and N.-M. Cheung, “Selective deep convolutional features for image retrieval,” in ACM Multimedia Conference, Silicon Valley, California, October 23-27, 2017, pp. 1600–1608.
-  A. Chadha and Y. Andreopoulos, “Voronoi-based compact image descriptors: Efficient region-of-interest retrieval with VLAD and deep-learning-based descriptors,” IEEE Transactions on Multimedia (TMM), vol. 19, no. 7, pp. 1596–1608, 2017.
-  J. Zhang and Y. Peng, “Query-adaptive image retrieval by deep weighted hashing,” IEEE Transactions on Multimedia (TMM), 2018.
H. Jégou, M. Douze, and C. Schmid, “On the burstiness of visual
22th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami Beach, Florida, June 20-26, 2009, pp. 1169–1176.
-  P. Perona and J. Malik, “Scale-space and edge detection using anisotropic diffusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 12, no. 7, pp. 629–639, 1990.
-  J. Zhang, J. Zheng, and J. Cai, “A diffusion approach to seeded image segmentation,” in 23th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, California, June 13-18, 2010, pp. 2125–2132.
-  G. Kim, E. P. Xing, L. Fei-Fei, and T. Kanade, “Distributed cosegmentation via submodular optimization on anisotropic diffusion,” in 13th International Conference on Computer Vision (ICCV), Barcelona, Spain, November 6 -13, 2011, pp. 169–176.
-  M. Karpushin, G. Valenzise, and F. Dufaux, “Keypoint detection in RGBD images based on an anisotropic scale space,” IEEE Transactions on Multimedia (TMM), vol. 18, no. 9, pp. 1762–1771, 2016.
-  S. I. Cho and S.-J. Kang, “Geodesic path-based diffusion acceleration for image denoising,” IEEE Transactions on Multimedia (TMM), vol. 20, no. 7, pp. 1738–1750, 2018.
-  S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimal seeds for diffusion-based salient object detection,” in 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, Ohio, June 23-28, 2014, pp. 2790–2797.
-  S. Chen, L. Zheng, X. Hu, and P. Zhou, “Discriminative saliency propagation with sink points,” Pattern recognition, vol. 60, pp. 2–12, 2016.
-  M. Donoser, “Replicator graph clustering,” in 24th British Machine Vision Conference (BMVC), Bristol, UK, September 9-13, 2013, pp. 1–11.
-  S. Pang, J. Xue, Z. Gao, L. Zheng, and L. Zhu, “Large-scale vocabularies with local graph diffusion and mode seeking,” Signal Processing: Image Communication, vol. 63, pp. 1–8, 2018.
-  A. Egozi, Y. Keller, and H. Guterman, “Improving shape retrieval by spectral matching and meta similarity,” IEEE Transactions on Image Processing (TIP), vol. 19, no. 5, pp. 1319–1327, 2010.
-  X. Yang, S. Koknar-Tezel, and L. J. Latecki, “Locally constrained diffusion process on locally densified distance spaces with applications to shape retrieval,” in 22th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami Beach, Florida, June 20-26, 2009, pp. 357–364.
-  M. Donoser and H. Bischof, “Diffusion processes for retrieval revisited,” in 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, Oregon, June 23-28, 2013, pp. 1320–1327.
-  T. Furuya and R. Ohbuchi, “Diffusion-on-manifold aggregation of local features for shape-based 3D model retrieval,” in ACM International Conference on Multimedia Retrieval (ICMR), Shanghai, China, June 9–12, 2015, pp. 171–178.
-  A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum, “Efficient diffusion on region manifolds: Recovering small objects with compact CNN representations,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, June 24-30, 2017, pp. 926–935.
-  Z. Gao, J. Xue, W. Zhou, S. Pang, and Q. Tian, “Democratic diffusion aggregation for image retrieval,” IEEE Transactions on Multimedia (TMM), vol. 18, no. 8, pp. 1661–1674, 2016.
-  N. Murray, H. Jégou, F. Perronnin, and A. Zisserman, “Interferences in match kernels,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 39, no. 9, pp. 1797–1810, 2017.
-  A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes for image retrieval,” in 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, September 6-12, 2014, pp. 584–599.
-  A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: an astounding baseline for recognition,” in 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Columbus, Ohio, June 23-28, 2014, pp. 512–519.
-  Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling of deep convolutional activation features,” in 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, September 6-12, 2014, pp. 392–407.
-  H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in 23th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, California, June 13-18, 2010, pp. 3304–3311.
-  G. Tolias, R. Sicre, and H. Jégou, “Particular object retrieval with integral max-pooling of CNN activations,” in International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, May 2-4, 2016, pp. 1–12.
-  H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “From generic to specific deep representations for visual recognition,” in 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR) DeepVision Workshop, Boston, Massachusetts, June 7-12, 2015, pp. 36–45.
-  A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, “Visual instance retrieval with deep convolutional networks,” ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 251–258, 2016.
-  H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in 16th International Conference on Computer Vision (ICCV), Venice, Italy, October 22-29, 2017, pp. 3456–3465.
-  J. Xu, C. Shi, C. Qi, C. Wang, and B. Xiao, “Unsupervised partbased weighting aggregation of deep convolutional features for image retrieval,” in 32nd AAAI Conference on Artificial Intelligence (AAAI), New Orleans, Louisiana, February 2–7, 2018, pp. 1–8.
-  S. Pang, J. Zhu, J. Wang, V. Ordonez, and J. Xue, “Building discriminative CNN image representations for object retrieval using the replicator equation,” Pattern Recognition, vol. 83, pp. 50–60, 2018.
-  H. Jégou and A. Zisserman, “Triangulation embedding and democratic aggregation for image search,” in 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, Ohio, June 23-28, 2014, pp. 3310–3317.
-  F. Radenović, G. Tolias, and O. Chum, “CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples,” in 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, October 8–16, 2016, pp. 3–20.
-  A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval: Learning global representations for image search,” in 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, October 8–16, 2016, pp. 241–257.
-  R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, June 26-July 1, 2016, pp. 5297–5307.
-  J. Cao, Z. Huang, P. Wang, C. Li, X. Sun, and H. T. Shen, “Quartet-net learning for visual instance retrieval,” in ACM Multimedia Conference, Amsterdam, The Netherlands, October 15–19, 2016, pp. 456–460.
-  B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “YFCC100M: the new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
-  A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum, “Fast spectral ranking for similarity search,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, June 18–22, 2018.
-  J. Weickert, Anisotropic diffusion in image processing. Teubner Stuttgart, 1998.
-  L. Grady, “Random walks for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 28, no. 11, pp. 1768–1783, 2006.
-  F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in 11th European Conference on Computer Vision (ECCV), Crete, Greece, September 11-15, 2010, pp. 143–156.
-  T.-T. Do and N.-M. Cheung, “Embedding based on function approximation for large scale image search,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 99, pp. 1–12, 2017.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in 20th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, Minnesota, June 18-23, 2007, pp. 1–8.
-  ——, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in 21th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, Alaska, June 23-28, 2008, pp. 1–8.
-  H. Jégou, M. Douze, and C. Schmid, “Improving bag-of-features for large scale image search,” International Journal of Computer Vision (IJCV), vol. 87, no. 3, pp. 316–336, 2010.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, June 26-July 1, 2016, pp. 770–778.
-  A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for MATLAB,” in ACM Multimedia Conference, Queensland, Australia, October 26-30, 2015, pp. 689–692.
-  H. Jégou and O. Chum, “Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening,” in 12th European Conference on Computer Vision (ECCV), Florence, Italy, October 7-13, 2012, pp. 774–787.
-  O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: Automatic query expansion with a generative feature model for object retrieval,” in 11th IEEE International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil, October 14-20, 2007, pp. 1–8.
-  E.-J. Ong, S. Husain, and M. Bober, “Siamese network of deep fisher-vector descriptors for image retrieval,” arXiv preprint arXiv:1702.00338, 2017.
-  A. Mikulik, M. Perdoch, O. Chum, and J. Matas, “Learning vocabularies over a fine quantization,” International Journal of Computer Vision (IJCV), vol. 103, no. 1, pp. 163–175, 2013.
-  G. Tolias, Y. Avrithis, and H. Jégou, “To aggregate or not to aggregate: Selective match kernels for image search,” in 14th IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, December 1-8, 2013, pp. 1401–1408.
-  G. Tolias and H. Jégou, “Visual query expansion with or without geometry: Refining local descriptors by feature aggregation,” Pattern Recognition, vol. 47, no. 10, pp. 3466–3476, 2015.