Adaptive Co-weighting Deep Convolutional Features For Object Retrieval

by   Jiaxing Wang, et al.
Xi'an Jiaotong University

Aggregating deep convolutional features into a global image vector has attracted sustained attention in image retrieval. In this paper, we propose an efficient unsupervised aggregation method that uses an adaptive Gaussian filter and an elementvalue sensitive vector to co-weight deep features. Specifically, the Gaussian filter assigns large weights to features of region-of-interests (RoI) by adaptively determining the RoI's center, while the element-value sensitive channel vector suppresses burstiness phenomenon by assigning small weights to feature maps with large sum values of all locations. Experimental results on benchmark datasets validate the proposed two weighting schemes both effectively improve the discrimination power of image vectors. Furthermore, with the same experimental setting, our method outperforms other very recent aggregation approaches by a considerable margin.


page 3

page 4

page 5


Unsupervised Part-based Weighting Aggregation of Deep Convolutional Features for Image Retrieval

In this paper, we propose a simple but effective semantic part-based wei...

Deep Aggregation of Regional Convolutional Activations for Content Based Image Retrieval

One of the key challenges of deep learning based image retrieval remains...

Cross-dimensional Weighting for Aggregated Deep Convolutional Features

We propose a simple and straightforward way of creating powerful image r...

Feature Weighting for Improving Document Image Retrieval System Performance

Feature weighting is a technique used to approximate the optimal degree ...

Unsupervised Semantic-based Aggregation of Deep Convolutional Features

In this paper, we propose a simple but effective semantic-based aggregat...

ACTNET: end-to-end learning of feature activations and aggregation for effective instance image retrieval

We propose a novel CNN architecture called ACTNET for robust instance im...

AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation

In this paper, we propose an adaptive weighting regression (AWR) method ...

1 Introduction

When given a query image of an object, we are interested in finding images containing the same object from a large-scale database based on Content-based image retrieval (CBIR). The key step of CBIR is to generate an image representation, which involves aggregating patch-level feature descriptors into a single fixed-length image-level vector. Most state-of-the-art representations are based on hand-crafted features (e.g. SIFT [1]

) or Convolutional Neural Network  

[2] features.

The pioneer image representation model based on hand-crafted features is bag-of-word (BoW) [3], which maps each feature into a visual word and consequently represents an image as a high-dimensional sparse vector. BoW makes a celebrated success in CBIR, and numerous works [4, 5, 6, 7, 8] adopt this model for the purpose of searching similar images. Although BoW has attracted great attention, it suffers two major drawbacks on large-scale retrieval: search efficiency and memory cost [9, 10]. An alternative solution is to aggregate local descriptors into a mid-size vector, e.g. Fisher vectors [11], Vector of locally aggregated descriptors [9], and etc [12, 13].

Recently, the focus of image retrieval has shifted from hand-crafted features to CNN-based ones since the discriminative power of the latter are much stronger. Early works [14, 15, 16] consider the outputs of last fully-connected layer as global representations. After that, most recent works with superior performances, such as SPoC [17], R-MAC [18] and CroW [19]

, overall first employ the outputs of deep convolutional layer as local features, and then aggregate them into the global representation. Specifically, SPoC leverages centering prior to aggregate features output from the last convolutional layer with sum-pooling. It’s worth noting the assumed centering prior that RoI is located at the geometric center of an image is probably not true for many images. Simply speaking, R-MAC first derives representations for image regions by performing max-pooling on the convolutional layer activations over the corresponding regions, and then calculates the image representation by aggregating region vectors with sum-pooling. Similar to SPoC, CroW also favors sum-pooling to directly aggregate convolutional layer features. The major difference between them is CroW employs a more effective cross-dimensional weighting strategy to weight deep features. In more detail, besides performing spatial-wise weighting on feature maps, CroW also employs a channel-wise weighting to jointly boost discriminative power of features.

In this paper, we present an unsupervised deep feature weighting method to improve image representations. Our method is somewhat similar to CroW as we both perform spatial- and channel-wise weighting for each feature, but there are at least two distinct differences between them. First, our spatial weighting strategy, which extends SPoC by adaptively determining the center point of RoI, assigns larger weights to RoI features with an adaptive Gaussian (aGaussian) filter, while CroW leverages aggregated spatial response map to compute spatial-wise weight for each feature. Second, the proposed element-value sensitive channel (eChannel) weighting strategy obtains channel weights based on aggregating the product values of feature maps and aGaussian, while CroW derives channel weights with the sparsity of feature maps. To summarize, our contributions are two-fold:

  • We design a strategy to adaptively determine the center point of RoI, and then incorporates this prior into Gaussian filter for assigning larger weights to RoI features.

  • We design an eChannel weighting vector by aggregating the product values of feature maps and aGaussian for easing the intra-image visual burstiness [20].

The organization of this paper is as follows. We review recent advances of CBIR, and outline our contributions in this section. Section 2 describes our two weighting strategies in detail. We support our method by extensive experiments in Section 3. Finally, we conclude our paper briefly in Section 4.

Input:Tensor , dimensionality , parameter ,whitening  parameters , aGaussian generation function ,eChannel generation function
Output: -D global representation ;
  // Adaptive Gaussian filter
  // Spatial weighting
  // Element-value vector
  // Channel weighting
  // Normalize
  // Whitening
  // Normalize again
Algorithm 1 Deep feature aggregation framework

2 Methodology

2.1 Framework

Let be the feature tensor extracted from the deep convolutional layer, which consists of feature maps (each having width and height ). Let and denote the spatial weighting matrix and the channel weight vector, respectively. We summarize the aggregation framework in Algorithm 1, where denotes element-wise product between vectors. In the following, we discuss our strategies of obtaining and in detail.

2.2 Adaptive Gaussian filter for spatial weighting

For effective object retrieval, it is better to pay more attention to RoI. In other words, The aggregation process should distinguish features of RoI and cluttered regions by assigning different weights to them:



denotes the weighting function. This function should assign large weights to RoI, and small weights to other regions. Accordingly, the probability density of Gaussian distribution can be selected for the purpose:


where and

denote the standard deviation and center (mean value) of Gaussian distribution, respectively. For

, we can set it to be the half distance between the geometrical center of the feature map and the farthest boundary.The problem arising here is how to determine .

To determine , we first introduce the matrix of aggregated response from all channels per spatial location:


As an important finding, we notice contains the semantic information of the raw image. Fig.1 displays heat maps of for some randomly selected images, where high and low values are denoted in red and blue, respectively. It clearly shows the large responses of correspond to salient regions. We sort out all regions and assign geometrical center of top large responses to be the center of Gaussian distribution. Therefore, the center can be adaptively selected by .

Fig.1 shows geometrical centers obtained with three different values. As displayed in Fig.1, centers determined by the top 10% large responses are more agreed with human vision perception. Besides, experiment results will further show that can give promising performance.

To examine the effect of spatial weighting, we visualize the active locations boosted by the aGaussian filter with four example images in Fig.2. As shown in Fig.2, RoI is strengthened while other regions are suppressed by our strategy.

Figure 1: Original images and corresponding heat maps with Gaussian centers (yellow points) determined by selecting different values of . Centers in the 2nd, 3rd and 4th rows correspond , and , respectively.
Figure 2: The responses of some examples activated by the adaptive Gaussian filter with . The 1st row shows the original images, the 2nd row displays the Gaussian weight , the 3rd row illustrates the heat maps and the 4th row demonstrates the responses of .

2.3 Element-value sensitive channel weighting

BoW is a classical model for CBIR, where retrieval accuracy can be effectively boosted by inverse document frequency (idf) weighting. In the domain of deep feature based CBIR, different filters usually activate different semantic content and generate corresponding feature maps. Inspired by the success of idf weighting, we want to differentiate channels in an image (Note that the standard idf weighting in the case of BoW is computed on a database) and expect that channels with small aggregated values of feature maps are boosted. Accordingly, it is necessary to design channel weighting for deep convolutional features.

Kalantidis [19] proposed a channel weighting strategy based on the sparsity of feature maps. However, a feature map is a real-value but not a binary matrix. The element-value on each spatial location denotes the intensity of activation of the filter, therefore the sparsity does not make full use of all available information. Here, we propose a method to derive a new channel weighting based on element-value. It is expected that similar images will have similar Gaussian filter responses for a given feature. For each channel, the item can be calculated as follows:


Obviously, the term is computed by summing the element-value of each feature map after Gaussian filtering.

To compare with sparsity-sensitive weighting, we concatenate all into a dimensional vector . Fig.3

displays the pair-wise correlation of different vectors for all the images from the query set of the Paris6K dataset. This dataset contains 55 images, where each 5 images corresponds to a landmark of Paris and therefore images can be classified into 11 classes. As shown in Fig.

3, both sparsity sensitive (Fig.3a) and our element-value sensitive vectors (Fig.3b) are highly correlated for images of a same landmark, but vectors computed by our strategy are less correlated than ones computed by the sparsity for images of different landmarks. Therefore, the information of the element-value sensitive vector is more discriminative than that of the sparsity sensitive vector. Besides, the feature with small average element-value could provide important information if, for example, only a small part of features have small average element-values for images in the same class. Hence, the element-value related term is used to replace the sparsity to compute channel weight as follows:


where a small constant is added for numerical stability.

As the BoW model, deep convolutional features also suffer from the problem of visual burstiness [20]. The introduction of channel weighting can alleviate this issue. Actually, channels with large average aggregated values correspond to CNN filters that give large response in many image regions. This implies there are some visual elements, which are spatial recurring and can negatively affect the retrieval accuracy. In our method, small weights can be assigned to channels of such bursty CNN filters. Experiment results will illustrate its effectiveness for image retrieval.

(a) Sparsity (b) Element-value
Figure 3: The correlation of different vectors for the 55 images in the query-set of the Pari6K.

3 Experiments

3.1 Experimental Setting

Our method is evaluated on five benchmark datasets, i.e., Oxford5K [4] ( building photos with queries including landmarks), Paris6K [5] ( building photos with queries including landmarks), Oxford105K, Paris106K, and INRIA Holidays [6] ( holiday photos with queries). Oxford105K and Paris106K are the extensions of Oxford5K [4] and Paris6K [5] respectively, by adding other distracted K images collected from Flickr. We employ the standard protocol consisting with other methods, i.e, using the cropped queries in Oxford5K(105K) and Paris6K(106K), adopting upright version of the images in Holidays. All deep features are extracted from the pool5 layer of the VGG16 [21]

pre-trained in ImageNet, so the number of channels is

. The retrieval performance is measured by the mean Average Precision (mAP), which is defined as the average percentage of same class images in all retrieved images after evaluating all queries. Additionally, to fairly compare with other methods, we learn whitening parameters on Paris6K when testing on Oxford5K(105K), learn whitening parameters on Oxford5K when testing on Paris6K(106K) and INRIA Holidays.

3.2 Impact of the parameter

The proposed method has only one parameter that need to be evaluated, i.e., , determining the center of the spatial weighting matrix. In Section 2.2, we mentioned the large responses of correspond to salient regions. However, ”large” is a vague concept. Someone may think top responses in are large, while others may think top 50% responses in are large. As different values of would lead to different geometrical centers of the spatial weighting matrix, we empirically determine the value of in practice.

5% 10% 15% 20% 50% 100%
Dim 128 63.0 63.3 63.2 63.2 62.8 61.2
256 67.9 68.4 68.1 67.7 66.2 63.4
512 70.0 70.4 70.6 70.3 69.2 66.1
Table 1: Performance of the spatial weighting with different values of when tested on Oxford5K.

Table 1 shows the retrieval performance of Oxford5K [4] in selecting different values of . In this experiment, we only apply aGaussian weighting without the eChannel weighting. According to Table 1, the excellent retrieval performance can be preserved when is between 10% and 20%. While the retrieval performance begins to fall when is larger than 50%. Therefore, considering the trade-off between retrieval efficiency and accuracy, we empirically set , which is also consistent with the results shown in Fig.1.

3.3 Impact of different weighting schemes

In our method, the aGuassian and eChannel weighting strategies are proposed to co-weight the deep convolutional features, and their effectiveness should be verified by the experiment. For spatial weighting, we compare the aGuassian weighting with the normal Gaussian (nGaussian) weigthing, which corresponds the aGaussian weighting with . For channel weighting, we compare the eChannel weigthing with the sparsity channel (sChannel) weighting. As the spatial and channel weight are independently, they can be individually or simultaneously applied to deep convolutional features. Six groups of weighting combination are tested on Oxford5K under different for image retrieval. Fig.4 displays the retrieval accuracy for these groups of combination. As we can see, both aGaussian and eChannel weighting can improve the accuracy of image retrieval. What’s more, the combination of them can further contribute to the improvement of retrieval accuracy. Therefore, both the proposed weighting are validated for the aggregation of convolutional features.

Figure 4: Comparison of retrieval accuracy for different weighting combinations tested on Oxford5K.
Figure 5: Top 10 results for randomly selected five queries of Oxford5K, using a 512-dimensional global vector to represent each image. The query images are displayed on the leftmost place. The true and false results are marked with red and yellow dotted borders respectively.

3.4 Comparison with the-state-of-the-art

Table 2 compares the retrieval performance of the proposed method with several relevant methods without fine-tuning, namely CroW [19], Neural code [14], R-MAC [18], SPoC [17] and Razavian et al [16]. As shown, we compare them on five benchmark datasets with different dimensions. According to the table, our method can achieve overall the best performance among the compared methods. Particularly, in the datasets of Oxford5K(105K) and Holidays, the proposed method is at least accurate compared with the other methods using dimensional features. Fig. 5 illustrates several randomly selected retrieval results of the Paris6K.

Method Dim Paris6K Oxford5K Paris106K Oxford105K Holidays
NetVLAD [22] 512 74.9 67.6 86.1
CroW [19] 512 79.7 70.8 72.2 65.3 85.1
Neural code [14] 512 55.7 52.2 78.9
R-MAC [18] 512 83.0 66.9 75.7 61.6
Razavian [16] 512 67.4 46.2 74.6
Our method 512 83.0 72.8 76.3 68.1 87.4
NetVLAD [22] 256 73.5 63.5 84.3
SPoC [17] 256 53.1 50.1 80.2
R-MAC [18] 256 72.9 56.1 60.1 47.0
Neural code [14] 256 55.7 52.4 78.9
CroW [19] 256 76.5 68.4 69.1 63.7 85.1
Our method 256 80.5 70.7 74.0 66.5 87.2
NetVLAD [22] 128 69.5 61.4 82.6
CroW [19] 128 74.6 64.1 67.0 59.0 82.8
Neural code [14] 128 55.7 52.3 78.9
Our method 128 77.9 65.8 70.5 61.4 85.7
Table 2: Performance comparison (in mAP) with recent deep feature based image retrieval methods. Our method consistently outperforms all of them.

We also compare our method with a fine-tuned method NetVLAD [22]. According to Table 2, one can find that our method outperforms NetVLAD on all the test datasets with diverse dimensions by a significant margin. When comparing with the best reported results of other very recent two fine-tuning works [23, 24], our results seem not optimal. However, these fine-tuning methods [22, 23, 24] heavily rely on manual annotation and training correction. Therefore, advantages of our method can be further highlighted for the problems where the fine-tuning methods cannot work well.

4 Conclusion

In this paper, we propose an unsupervised aggregation method for image retrieval using convolutional features. The key characteristics is that we design an adaptive Gaussian filter and an element-value sensitive channel vector to co-weight deep convolutional features extracted from the pre-trained CNN. The former can highlight the retrieval objects of the image and the latter can ease the burstiness phenomenon in deep local features. We analyze the motivation of these weighting schemes and prove their effectiveness by experiments.

Experiments on five benchmark datasets demonstrate our method outperforms other very recent aggregation methods based on off-the-shelf deep features without needing a long dimension vector to represent each image. It is worth noting our unsupervised aggregation method is quite suitable and effective under the situation where the related training dataset is difficult to collect and the retrieval dataset is quite large which requires a lot of storage space.


This work was supported by National Natural Science Foundation of China (NSFC) (Grant Nos. 61603289 and 61573273); China Postdoctoral Science Foundation (Grant No. 2016M602823); Postdoctoral Science Foundation of Shaanxi; and Fundamental Research Funds for the Central Universities (Grant No. xjj2017118).


  • [1] Lowe DG, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
  • [2] Krizhevsky A, Sutskever I, and Hinton GE, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
  • [3] Sivic J and Zisserman A, “Video google: A text retrieval approach to object matching in videos,” in ICCV, 2003.
  • [4] Philbin J, Chum O, Isard M, Sivic J, and Zisserman A, “Object retrieval with large vocabularies and fast spatial matching,” in CVPR, 2007.
  • [5] Philbin J, Chum O, Isard M, Sivic J, and Zisserman A, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in CVPR, 2008.
  • [6] Jégou H, Douze M, and Schmid C, “Improving bag-of-features for large scale image search,” IJCV, 2010.
  • [7] Pang S, Xue J, Tian Q, and Zheng N, “Exploiting local linear geometric structure for identifying correct matches,” CVIU, 2014.
  • [8] Arandjelović R and Zisserman A, “Three things everyone should know to improve object retrieval,” in CVPR, 2012.
  • [9] Jégou H, Douze M, Schmid C, and Pérez P, “Aggregating local descriptors into a compact image representation,” in CVPR, 2010.
  • [10] Li Z, Zhang X, Müller H, and Zhang S, “Large-scale retrieval for medical image analytics: A comprehensive review,” Medical Image Analysis, 2017.
  • [11] Perronnin F, Liu Y, Sánchez J, and Poirier H, “Large-scale image retrieval with compressed fisher vectors,” in CVPR, 2010.
  • [12] N Murray, H Jégou, F Perronnin, and A Zisserman, “Interferences in match kernels,” TPAMI, 2017.
  • [13] Pang S, Zhang W, Zhu L, Zhu J, and Xue J, “Beyond sum and weighted aggregation: An efficient mixed aggregation method with multiple weights for image search,” in ACM’MM Thematic Workshops, 2017.
  • [14] Babenko A, Slesarev A, Chigorin A, and Lempitsky V, “Neural codes for image retrieval,” in ECCV, 2014.
  • [15] Gong Y, Wang L, Guo R, and Lazebnik S, “Multi-scale orderless pooling of deep convolutional activation features,” in ECCV, 2014.
  • [16] AS Razavian, Azizpour H, Sullivan J, and Carlsson S, “CNN features off-the-shelf: an astounding baseline for recognition,” in CVPR, 2014.
  • [17] Artem Babenko and Lempitsky V, “Aggregating local deep features for image retrieval,” in ICCV, 2015.
  • [18] Tolias G, Sicre R, and Jégou H, “Particular object retrieval with integral max-pooling of cnn activations,” in ICCV, 2016.
  • [19] Kalantidis Y, Mellina C, and Osindero S, “Cross-dimensional weighting for aggregated deep convolutional features,” in ECCV workshops, 2016.
  • [20] Hervé Jégou, Matthijs Douze, and Cordelia Schmid, “On the burstiness of visual elements,” in CVPR, 2009.
  • [21] Simonyan K and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [22] Arandjelovic R, Torii P, Gronatand A, Pajdla T, and Sivic J, “NetVLAD: CNN architecture for weakly supervised place recognition,” in CVPR, 2016.
  • [23] Radenović F, Tolias G, and Chum O, “CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples,” in ECCV, 2016.
  • [24] Gordo A, Almazán J, Revaud J, and Larlus D, “Deep image retrieval: Learning global representations for image search,” in ECCV, 2016.