Can Image Retrieval help Visual Saliency Detection?

09/24/2017 ∙ by Shuang Li, et al. ∙ The University of Adelaide The Chinese University of Hong Kong 0

We propose a novel image retrieval framework for visual saliency detection using information about salient objects contained within bounding box annotations for similar images. For each test image, we train a customized SVM from similar example images to predict the saliency values of its object proposals and generate an external saliency map (ES) by aggregating the regional scores. To overcome limitations caused by the size of the training dataset, we also propose an internal optimization module which computes an internal saliency map (IS) by measuring the low-level contrast information of the test image. The two maps, ES and IS, have complementary properties so we take a weighted combination to further improve the detection performance. Experimental results on several challenging datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

Significant improvement in saliency detection has been witnessed in the past decade. Numerous unsupervised and supervised saliency detection methods have been proposed under different theoretical models [14, 22, 37, 9, 25]. However, few works address this problem from the perspective of image retrieval.

Most unsupervised algorithms are based on low-level features and perform saliency detection directly on the individual image. Itti et al. [14] propose a saliency model which linearly combines image features including color, intensity and orientation over different scales to detect local conspicuity. However, this method tends to highlight the salient pixels and loses object information. Zhu et al. [37] propose a background measurement, boundary connectivity, to characterize the spatial layout of image regions. In  [9], Cheng et al. address saliency detection based on the global region contrast, which simultaneously considers the spatial coherence across the regions and the global contrast over the entire image. However, unsupervised algorithms lose object information and easily get affected by complex backgrounds.

Supervised methods always take a large dataset of training samples and contain high-level object information when computing saliency maps. Liu et al. [24] regard saliency detection as a binary labeling task and combine multi-features with a conditional random field (CRF) to generate the saliency maps. Lu et al. [25] search for optimal seeds by combining bottom-up saliency maps and mid-level vision cues. However, training on a large dataset cannot ensure generating a good classifier, since it is hard to balance a large number of images with various appearances and categories. If the training set is not large enough, the classifier becomes less robust. Different to most supervised saliency detection methods, we train an optimal classifier for each test image by selecting training samples just from similar images instead of the whole training set. Our image retrieval framework considers the specificity of each individual image and better designs the training set, thus generating more accurate saliency maps.

In  [26], Marchesotti et al. also proposed to retrieve similar images for saliency detection. However, our approach is different from theirs in three aspects. First, we address saliency detection based on region proposals, which contain a large amount of shape and boundary information of salient regions and keep the consistency of the whole object or part of it. Second, our approach uses a more discriminative SVM, instead of distance-based classification, to better predict the saliency values of object proposals. Our annotation database consists of 50,000 images, which is large enough to contain similar examples for most test images. Third, unlike  [26] which relies purely on a retrieved list and thus potentially suffers from retrieval errors for uncommon objects, we use internal saliency cues with external high-level retrieved information to leverage the best out of both schemes. Our method combines the supervised and unsupervised algorithms, considering high-level object concepts and low-level contrast simultaneously, and thus can uniformly highlight the whole salient region with explicit object boundaries and achieves better performance on the PR curves.

2 Internal Optimization Module

The performance of the proposed image retrieval framework relies heavily on the object proposals. Therefore, we first present a novel internal optimization module to generate a relatively accurate saliency map for the subsequent proposal generation and saliency integration. We first decompose the image to superpixels, then jointly optimize superpixel prior, discriminability and similarity terms under a single objective function. The superpixel prior, obtained by the sum of objectness scores within a superpixel, provides an essential saliency estimation of the test image. The superpixel discriminability aims at identifying salient superpixels by exploring the distinctiveness between each pair. The superpixel similarity term tries to cluster superpixels with similar appearances together using the N-cut 

[31] algorithm. Since multiple saliency estimations using different cues may enhance relevant information, we propose to fuse these three terms together to make the best of complementary properties to generate more accurate saliency maps.

In this section, we first introduce the superpixel features, then provide a detailed explanation of the superpixel prior, discriminability, and similarity terms. Finally, we jointly optimize an objective function to compute the internal saliency map.

2.1 Superpixel Features

Superpixel segmentation algorithms generate compact and uniform superpixels, thus greatly reducing the complexity of subsequent vision tasks. Lacking the knowledge of size and position of objects, we produce six layers of superpixels using the SLIC algorithm with different parameters [2]

and construct a 30-dimensional feature vector

that captures color, texture, and position information to describe each superpixel. The detailed feature components are summarized in Table 1. The color features, including RGB, Lab, and HSV, have been widely adopted by previous saliency detection methods and contribute significantly to the algorithm performance. In addition, we use the absolute response of LM filters, proposed by [18], to represent texture features and extract center and boundary coordinates as position information.

2.2 Superpixel Prior

In [3], Alexe et al. present an objectness approach to measure the likelihood of an image window containing an object. We generate a pixel-wise objectness map by adding all the windows together and define the superpixel prior as , where is a pixel within superpixel . The prior vector is formed by stacking . We aim at computing the saliency score of each superpixel, therefore constructing a linear term as follows:


where is the score vector obtained by stacking . The prior score just provides a rough saliency estimation of each superpixel and more attention should be put on the internal structure of the test image by exploring the superpixel discriminability and similarity as described in the following sections.

2.3 Superpixel Discriminability

To discriminate which are the salient superpixels, we adopt a discriminative learning approach [20]

to address this problem by solving a ridge regression objective function:


where and are the weight vector and weight parameter respectively, is the number of superpixels, and is a bias. Following [4], the objective function can be transformed to a quadratic form with a closed solution:


where , is obtained by stacking , is the centering projection matrix, and is a weight parameter. The quadratic function detects salient superpixels by exploring the nonlinearity and discriminability of their features based on positive definite kernels, and assigns distinctive labels to different superpixels.

2.4 Superpixel Similarity

Superpixels with similar features are expected to have similar saliency values. In this part, we construct an affinity matrix to measure the similarity of superpixels:


where is a weight parameter to control the strength of distances. In  [31]

, Shi and Malik propose a normalized clustering algorithm to compute the cluster labels by finding the second smallest eigenvector of the normalized Laplacian matrix

, where is the diagonal matrix of . However, we find the unnormalized form achieves better performance in our experiments, with the objective function constructed as follows:


In contrast to superpixel discriminability, superpixel similarity focuses on clustering superpixels together based on similarity. We construct the internal saliency map using the superpixel prior.

Figure 3: Left to right: Test image and corresponding similar examples.

2.5 Internal Saliency Map

Tang et al. [32] present a joint image-box formulation to localize objects from different images. Inspired by their work, we compute the saliency values of superpixels at each layer by jointly optimizing the above three terms:


To ensure the invertibility of , we add a minimum in this quadratic function, where

is the identity matrix. The parameters

and control the tradeoff among these three terms. Since and are both positive semi-definite, the objective function is convex and has a unique solution.

We compute a saliency map by summing the superpixels values at each layer, and then take a weighted linear combination as follows:


controls the weight of different layers, and is the final internal saliency map.

3 Image Retrieval Framework

The internal saliency map can locate objects with great accuracy by considering the prior, discriminability and similarity information simultaneously. However, a low-level saliency method loses object concepts and may be sensitive to high frequency background noise when the scenes are challenging. Since similar images with bounding box annotations provide much object information for the test image, we design an image retrieval framework that searches for similar examples from the validation set of CLS-LOC [10] database to further improve the detection performance. There are 1000 object categories, with 50 validation images for each synset, annotated in the validation set.

The image retrieval framework utilizes pre-stored object regions extracted from similar examples as Linear SVM training samples to learn a linear classifier to predict saliency values of object proposals in the test image, and computes an external saliency map by the sum of regional values. The detailed procedures are listed as follows:

3.1 Similar Image Retrieval

For each example image from the dataset, we extract a 4096-dimensional feature vector using the pre-trained Caffe framework [15] and store it. The similarity of each pair of images is measured by the Euclidean distance between their Caffe features:


We sort all the examples by their distance to the test image in a descending order and select the top five for subsequent SVM training. Five similar images provide enough object proposals to train a robust classifier. Furthermore, images with large appearance variations may influence the quality of training samples. We experimentally find that using five similar images achieves the best performance. Figure 3 shows some retrieval results. For most images there are a sufficient number of similar examples, but exceptions do exists such as the third image in the last row.

3.2 Region Selection

For each test image, we produce a set of object segments using the geodesic object proposal (GOP) [21] method. The choice of GOP over other segmentation approaches is motivated by the fact that GOP achieves significantly higher accuracy and runs substantially faster. For the facility of computation, we select candidate regions that could potentially contain an object according to their confidence values:




is the objectness map and is the mask of region . indicates that pixel belongs to , and otherwise.


is a center prior map, where and are the coordinates of pixel , and denote the center of test image, and and are weight parameters to control the strength of distances. Two example results and PR curves in Figure 4 demonstrate the efficiency of the center prior map. We experientially set and in all experiments. The external saliency map is constructed based on these selected region proposals.

3.3 Regional Features

Different features affect the performance of vision tasks significantly. Therefore, designing discriminative features is essential to our work. In this part, we propose a 81-dimensional feature vector, , to describe each region. The detailed components of regional features are listed in Table 1. We define the 15-pixel wide narrow border regions of the test image as background regions. The color histogram and mean color distances are measured by the chi-square and Euclidean distances between each candidate proposal and the background regions respectively. We also add the superpixel features, replacing superpixels with regions, and 3-dimensional shape features to the component list to form a 81-dimensional feature vector. Regions extracted from the test image or its similar examples are represented by this feature vector.

(a) (b)
Figure 4: Evaluation of saliency maps: (a) The precision-recall curves of IS, ES (with and without center prior), and EIS on the MSRA-5000 and Pascal-S datasets. (b) Two example results of the ES with and without center prior.
(a) (b) (c) (d)
Figure 5: Results of different methods: (a), (b) Precision-recall curve on the ASD dataset. (c), (d) Precision-recall curves on the THUS dataset.
(a) (b) (c) (d)
Figure 6: Results of different methods: (a), (b) Precision-recall curve on the MSRA-5000 dataset. (c), (d) Precision-recall curves on the Pascal-S dataset.
ASD THUS MSRA-5000 Pascal-S
Figure 7: Average precision, recall, F-Measure and AUC of methods on different datasets.

3.4 External Saliency Map

Instead of utilizing the whole dataset as training samples, we select a subset of similar images to train a customized SVM for each test image. Images from the CLS-LOC database are segmented into object proposals by the GOP, with each proposal corresponding to a saliency label based on its overlapping area with the ground truth bounding boxes. To save time, we pre-store these segments, with labels and regional features, and load them directly once the corresponding image is selected as one of similar examples.

We learn parameters, and , by training a linear classifier , and predict the saliency value of each candidate region in the test image as follows:


The external saliency map is generated by adding the regional values of 100 selected proposals:


where , having the same size with the test image, is the mask map of region . The external saliency map can locate salient objects accurately in most cases, which demonstrates the efficiency of the proposed image retrieval framework.

4 Final Saliency Map (EIS)

The image retrieval framework adopts a supervised learning approach to address saliency detection, which contains the high-level object concept and achieves good performance in localizing salient objects. The external saliency map can uniformly highlight the whole salient region with explicit object boundaries except in the case when the test image cannot find similar examples. As an essential supplement to image retrieval, an internal optimization module is proposed and combined with the external saliency map to construct the final saliency map. The internal saliency map captures low-level feature contrast within an image and performs well in identifying the salient superpixels. But it easily gets affected by background noise, especially when dealing with challenging scenes. To make the best use of their advantages, we propose to take a weighted sum of the internal and external saliency maps:


where controls the tradeoff between these two maps, and is the final saliency map of our method.

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k)
Figure 8: Comparison of our saliency maps with eight state-of-the-art methods. Left to right: (a) Test image. (b) Ground truth. (c) IT [14]. (d) RCJ [7]. (e) SVO [5]. (f) FT [1]. (g) PD [27]. (h) GSSP [33]. (i) LR [30]. (j) RA [29]. (k) EIS.

5 Experiments

We evaluate the proposed method on four benchmark datasets: ASD [1], THUS [7], MSRA-5000 [24], Pascal-S [23]. The ASD dataset is a subset of MSRA-5000, containing 1,000 images with accurate human-labelled masks. The THUS database consists of 10,000 images, with each image having an unambiguous salient object with pixel-wise ground truth. The MSRA-5000 dataset, which includes 5,000 more comprehensive images, has been widely used in previous saliency detection approaches. The Pascal-S dataset is composed of 850 natural images with multiple objects and complex backgrounds.

5.1 Evaluation of Saliency Maps

In this paper, we generate three saliency maps including the internal saliency map (IS), external saliency map (ES) and final combined saliency map (EIS) for visual saliency detection. To demonstrate the efficiency of these maps, we select some sample results as shown in Figure 1. The IS can separate salient regions from backgrounds in most cases, but it fails to highlight the whole object. In contrast, the ES can detect the whole object accurately, but it sometimes brightens the background. To overcome their shortcomings, the EIS is constructed by a weighted combination of the IS and ES, and achieves good performances on different datasets. We also provide the PR curves of the above three maps on the MSRA-5000 and Pascal-S datasets in Figure 4. The fused result, EIS, is apparently better than the IS and ES, which demonstrates that combining these two maps does indeed work well. We should mention that the ES does not always outperform the IS, since it relies heavily on the image retrieval results. Overall, the IS and ES can both highlight salient objects with great accuracy, and the performance after taking a weighted sum is superior.

5.2 Quantitative Comparisons

We compare the proposed saliency detection model, EIS, with 21 state-of-the-art methods including AMC [16], CA [11], CB [17], FT [1], GB [12], GC [8], GSSP [33], HC [6], HS [35], IT [14], LC [36], LR [30], PD [27], RA [29], RCJ [7], SF [28], SR [13], SVO [5], UFO [19], XL [34] and wCO [37]. We either use source code provided by the authors or implement them based on available code or software.

We conduct several quantitative comparisons of our EIS with some typical saliency detection approaches in this part. Figure 5 show the PR curves of different methods on the ASD and THUS datasets. Figure 6 illustrates the comparisons on the MSRA-5000 and Pascal-S datasets, Figure 7

are relevant average precisions, recalls, F-Measures and AUCs on four datasets. The precision and recall are computed by segmenting a saliency map with a set of thresholds varying from 0 to 255, and comparing each binary map with the benchmark. Our method performs well on precision-recall curves. The highest precision rates on these four datasets are 98.2

, 95.8, 93.1, and 79.9 respectively.

In addition, we evaluate the quality of saliency maps using the F-Measure and AUC. By setting an adaptive threshold that is twice the mean saliency value of the input map, each image is segmented to a binary map. We compute the average precision and recall based on these binary maps and computed the F-Measure as follows:


where is set to 0.3 to emphasize the precision. Our method is comparable with most of the saliency detection approaches in terms of the F-Measure. We also show the comparison results of AUC, which reflects global properties by computing the area under the PR curve. Various evaluation methods on different datasets demonstrate that the proposed EIS performs favorably against the state-of-the-arts.

5.3 Qualitative Comparisons

Figure 8 shows some example results of eight previous approaches and our EIS algorithm for qualitative comparisons. The IT and PD methods can find salient regions in most cases, but they tend to highlight object boundaries and lose the object information. The SVO and RA methods generate blurry saliency maps and highlight the background. FT is easily affected by high-frequency noise and it fails to detect salient objects in all of these examples. LR cannot highlight all the salient pixels and in all these cases mislabels small background patches as salient regions. RCJ and GSSP are capable of finding salient regions, but they are less convincing in dealing with challenging scenes. In constrast, our method can locate salient regions with great accuracy and highlight the whole object uniformly with unambiguous boundaries. Furthermore, we can detect more than one object without worrying about their size and location.

6 Conclusion

In this paper, we propose a novel saliency detection algorithm based on the image retrieval framework. The image retrieval framework first searches for similar examples from a subset of the CLS-LOC database to train a customized SVM for each test image, then predicts saliency values of object proposals to generate an external saliency map. Since some images with uncommon objects may not have similar examples, we also propose an internal optimization module, which explores the contrast information within the test image by jointly optimizing the superpixel prior, discriminability, and similarity, to assist the image retrieval. The final saliency map is generated by taking a linear combinaation of the above two maps. We compare the proposed method with 21 state-of-the-art saliency detection approaches and show the results of precision-recall curves, average precisions, recalls, F-Measures and AUCs on four databases, including ASD, THUS, MSRA-5000, Pascal-S. Various results demonstrate the effectiveness and efficiency of our algorithm. In the future, we plan to design more robust image retrieval approaches to further improve the performance of our method.


  • [1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. Frequency-tuned salient region detection. In

    Computer vision and pattern recognition

    , pages 1597–1604. IEEE, 2009.
  • [2] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012.
  • [3] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. IEEE Transactions on pattern analysis and machine intelligence, 34(11):2189–2202, 2012.
  • [4] F. R. Bach and Z. Harchaoui. Diffrac: a discriminative and flexible framework for clustering. In Advances in Neural Information Processing Systems, pages 49–56, 2008.
  • [5] K.-Y. Chang, T.-L. Liu, H.-T. Chen, and S.-H. Lai. Fusing generic objectness and visual saliency for salient object detection. In International Conference on Computer Vision, pages 914–921. IEEE, 2011.
  • [6] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu. Salient object detection and segmentation. 2011.
  • [7] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu. Salient object detection and segmentation. IEEE Transactions on pattern analysis and machine intelligence, 2(3), 2013.
  • [8] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and N. Crook. Efficient salient region detection with soft image abstraction. In International Conference on Computer Vision, pages 1529–1536. IEEE, 2013.
  • [9] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu. Global contrast based salient region detection. In Computer Vision and Pattern Recognition, pages 409–416. IEEE, 2011.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
  • [11] S. Goferman, L. Zelnik-Manor, and A. Tal. Context-aware saliency detection. IEEE Transactions on pattern analysis and machine intelligence, 34(10):1915–1926, 2012.
  • [12] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In Advances in neural information processing systems, pages 545–552, 2006.
  • [13] X. Hou and L. Zhang. Saliency detection: A spectral residual approach. In Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
  • [14] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998.
  • [15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014.
  • [16] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang.

    Saliency detection via absorbing markov chain.

    In International Conference on Computer Vision, pages 1665–1672. IEEE, 2013.
  • [17] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li. Automatic salient object segmentation based on context and shape prior. In British Machine Vision Conference, volume 6, page 7, 2011.
  • [18] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li. Salient object detection: A discriminative regional feature integration approach. In Computer Vision and Pattern Recognition, pages 2083–2090. IEEE, 2013.
  • [19] P. Jiang, H. Ling, J. Yu, and J. Peng. Salient region detection by ufo: Uniqueness, focusness and objectness. In International Conference on Computer Vision, pages 1976–1983. IEEE, 2013.
  • [20] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In Computer Vision and Pattern Recognition, pages 1943–1950. IEEE, 2010.
  • [21] P. Krähenbühl and V. Koltun. Geodesic object proposals. In European Conference on Computer Vision, pages 725–739. Springer, 2014.
  • [22] S. Li, H. Lu, Z. Lin, X. Shen, and B. Price. Adaptive metric learning for saliency detection. IEEE Transactions on Image Processing, 24(11):3321–3331, 2015.
  • [23] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. The secrets of salient object segmentation. In Computer Vision and Pattern Recognition, pages 280–287. IEEE, 2014.
  • [24] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum. Learning to detect a salient object. IEEE Transactions on pattern analysis and machine intelligence, 33(2):353–367, 2011.
  • [25] S. Lu, V. Mahadevan, and N. Vasconcelos. Learning optimal seeds for diffusion-based salient object detection. In Computer Vision and Pattern Recognition, pages 2790–2797. IEEE, 2014.
  • [26] L. Marchesotti, C. Cifarelli, and G. Csurka. A framework for visual saliency detection with applications to image thumbnailing. In Computer Vision, pages 2232–2239. IEEE, 2009.
  • [27] R. Margolin, A. Tal, and L. Zelnik-Manor. What makes a patch distinct? In Computer Vision and Pattern Recognition, pages 1139–1146. IEEE, 2013.
  • [28] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung. Saliency filters: Contrast based filtering for salient region detection. In Computer Vision and Pattern Recognition, pages 733–740. IEEE, 2012.
  • [29] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä. Segmenting salient objects from images and videos. In European Conference on Computer Vision, pages 366–379. Springer, 2010.
  • [30] X. Shen and Y. Wu. A unified approach to salient object detection via low rank matrix recovery. In Computer Vision and Pattern Recognition, pages 853–860. IEEE, 2012.
  • [31] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
  • [32] K. Tang, A. Joulin, L.-J. Li, and L. Fei-Fei. Co-localization in real-world images. In Computer Vision and Pattern Recognition, pages 1464–1471. IEEE, 2014.
  • [33] Y. Wei, F. Wen, W. Zhu, and J. Sun. Geodesic saliency using background priors. In European Conference on Computer Vision, pages 29–42. Springer, 2012.
  • [34] Y. Xie, H. Lu, and M.-H. Yang. Bayesian saliency via low and mid level cues. IEEE Transactions on Image Processing, 22(5):1689–1698, 2013.
  • [35] Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency detection. In Computer Vision and Pattern Recognition, pages 1155–1162. IEEE, 2013.
  • [36] Y. Zhai and M. Shah. Visual attention detection in video sequences using spatiotemporal cues. In Proceedings of the 14th annual ACM international conference on Multimedia, pages 815–824. ACM, 2006.
  • [37] W. Zhu, S. Liang, Y. Wei, and J. Sun. Saliency optimization from robust background detection. In Computer Vision and Pattern Recognition, pages 2814–2821. IEEE, 2014.