Salient Object Detection Combining a Self-attention Module and a Feature Pyramid Network

04/30/2020 ∙ by Guangyu Ren, et al. ∙ Imperial College London 0

Salient object detection has achieved great improvement by using the Fully Convolution Network (FCN). However, the FCN-based U-shape architecture may cause the dilution problem in the high-level semantic information during the up-sample operations in the top-down pathway. Thus, it can weaken the ability of salient object localization and produce degraded boundaries. To this end, in order to overcome this limitation, we propose a novel pyramid self-attention module (PSAM) and the adoption of an independent feature-complementing strategy. In PSAM, self-attention layers are equipped after multi-scale pyramid features to capture richer high-level features and bring larger receptive fields to the model. In addition, a channel-wise attention module is also employed to reduce the redundant features of the FPN and provide refined results. Experimental analysis shows that the proposed PSAM effectively contributes to the whole model so that it outperforms state-of-the-art results over five challenging datasets. Finally, quantitative results show that PSAM generates clear and integral salient maps which can provide further help to other computer vision tasks, such as object detection and semantic segmentation.



There are no comments yet.


page 2

page 3

page 4

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Salient object detection or segmentation aims to identify visually distinctive parts of a natural scene. With this capability of providing high-level information, the saliency detection is widely applied in the computer vision applications, such as object detection Ren et al. (2013); Zhang et al. (2017a); Liu et al. (2019b) and tracking Hong et al. (2015), visual robotic manipulations Yuan et al. (2018); Schillaci et al. (2013), image segmentation Wei et al. (2017); Wang et al. (2018b) and video summarization Ma et al. (2002); Simakov et al. (2008)

. In early studies, the salient object detection was formulated as a binary segmentation problem. However, the connections’ establishment between the salient object detection and other computer vision tasks was unclear. Nowadays, convolution neural network (CNN) attracts more attention in the research community. Compared with the classic hand-crafted feature descriptors 

Lowe (2004); Dalal and Triggs (2005), CNNs have stronger feature representation ability. Specifically, CNN kernel with small receptive fields can provide local information and the kernel with large receptive fields can provide global information. This characteristic enables CNN-based approaches to detect salient areas with refined boundaries Borji et al. (2014). Thus, CNN-based approaches have become the major research field in the salient object detection.

Recently, Fully Convolution Networks (FCNs) becomes the fundamental framework in the salient object detection Liu et al. (2019a); Wang et al. (2019); Qin et al. (2019), as FCNs can be fed by arbitrary size of input and achieve richer spatial information compared with the fully connected layer. Although these works have achieved great improvement in the performance, they are still restricted by some limitations. FCN-based approaches utilize multiple convolution layers and pooling layers to produce the high-level semantic features which are helpful to locate objects but they may lose information during pooling operations. This can lead to degraded boundaries of detected objects being generated. Besides, when the high-level features are upsampled to generate score prediction for each pixel, it will also be diluted which could decrease the ability of object localization.

In this paper, we propose a novel pyramid self-attention module (PSAM) to overcome the limitation of feature dilution of the previous FCN-based approaches. Figure 1(c) shows the inherent problems of Feature Pyramid Networks (FPNs). Through incorporating self-attention module with multi-scale feature map of FPNs, the model will focus on the high-level features. This leads to the extraction of features with richer high-level semantic information and larger receptive fields. In addition, a channel-wise attention module is employed to reduce the redundancy in the FPN which can refine the final results. Experimental results show that PSAM can improve the performance of salient object detection and achieve state-of-the-art results in five challenging datasets. The contributions of this work can be concluded as: 1) We propose a novel pyramid self-attention structure which can make the model focus more on the high-level features and reduce feature dilation in top-down pathway. 2) We adopt a channel-wise attention to reduce the redundant information in the lateral connections of the FPN to refine the final results.

(a) Image (b) ground truth (c) baseline (d) baseline+sa (e) ours
Figure 1: Qualitative visual results of ablation studies. (a) original images, (b) ground truth, (c) baseline, (d) baseline+self-attention (SA), (e) our method.

2 Related Works

Salient Object Detection

Due to the outstanding feature representation ability of CNN, the hand-craft feature based methods have been replaced by the CNN models. In the work of  Li and Yu (2015), Li and Yu used fully connected layers on the top CNN layers to extract different scale features of a single region. Then, multi-scale features were used to predict the scores for each region.  Zhao et al. (2015) utilized two independent CNN to extract the global context of the full image and the local context of the detailed area to train the model jointly. However, the spatial information was lost in these CNN-based methods, because of the fully connected layers.

Recently, FCN-based methods have raised more concerns in the salient object detection.  Qin et al. (2019)

proposed a boundary-aware salient object detection network which incorporates a predict module and a residual refinement module module (RRM). The predict module was used to estimate the salient map from the raw images and the RRM was used to refine the results from the predict module which was trained by using the residual between the salient map and the ground truth.  

Liu et al. (2019a) introduced a PoolNet structure which has two pooling modules: global guidance module (GGM) and feature aggregation module (FAM). The GGM was designed to acquire more high-level information around the inputs which tackles the feature dilation problem in the U-shape network structure. Then, the FAM merges the multi-scale features of the FPN, leading to reduce the problem of aliasing caused by the up-sampling and enlarge the receptive fields. From the experiments, PoolNet can make more precise localization of the sharpened salient objects compared with other baseline approaches. In the work of  Wu et al. (2019b), a Cascaded Partial Decoder (CPD) structure that contains two prime branches was proposed. The first branch contributes in the computation speed improvement by dropping features in the shallow layers. The second branch uses the salient map from the first branch in order to refine the features in the deeper layers which ensures the speed and accuracy of the framework.

Attention Mechanism

Attention mechanism is mainly used in the area of Natural Language Processing (NLP).  

Vaswani et al. (2017) introduced a framework called Transformer which was used to replace the recurrent layers through using attention mechanism to capture global dependencies between input and output. This framework also allows parallel computing which leads to faster speed compared to recurrent networks. Except for the sequence models, this kind of attention mechanism is also needed in the CNN models. Different from the attention mechanism in the sequence models, self-attention was introduced to utilize attention mechanism in the single context data.  Bello et al. (2019) proposed Attention Augmented Convolutional Network which produces attentional feature maps via self-attention module and combines these with CNN feature maps to capture spatial dependencies of the input and it achieves huge improvement in the tasks of object classification and detection. The stand alone self-attention layer was introduced in the work of  Ramachandran et al. (2019)

. It can be used to set up a fully attention model through replacing all spatial convolution layers with self-attention layers. The self-attention layer leverages the components in the previous works and proves that it can be used as a stand-alone layer which can replace the spatial convolution layer easily.

Figure 2: Overall pipeline of the proposed model

3 Self-Attention Based FPN

In this section, we describe the proposed architecture that integrates two attention modules. More specifically, we use a pyramid self-attention module which aims to enhance the high-level semantic features and transmit the enhanced semantic information to different feature levels. In addition, when feature maps are merged in the top-down pathway, a simple channel-wise attention Hu et al. (2018); Chen et al. (2017); Zhao and Wu (2019) module is added in each lateral connection to focus on the high responses of salient objects. The proposed architecture is based on a classic feature pyramid network (FPNs) Lin et al. (2017) which exploits ResNet He et al. (2016) as a backbone. It is well known that this basic FPN architecture has been widely used in many different computer vision tasks, especially for detection tasks, leading to accurate detection results because of its robust and reasonable structure. As shown in Figure 2, we retain the basic structure and introduce two effective modules to achieve a state-of-the-art performance. A pyramid self-attention module, which is built between the bottom-up and top-down pathway, supports the model to focus on the high-level features which contain semantic information. Then this module transfers the processed high-level information to each feature levels in the top-down path. Meanwhile, feature maps from different stages in ResNet pass through a channel-wise attention module to further emphasize the context information.

Figure 3: The structure of Pyramid Self-attention Module.

3.1 Pyramid Self-Attention Module

In this subsection, we describe the proposed module in detail and demonstrate the differences from previous works.  Hou et al. (2017) has demonstrated that high-level semantic features are more representative and discriminative, leading to the position of salient objects being located more accurately. Figure 1(c) shows that without any extra attention modules, the FPN baseline can generate rough saliency map which has insufficient and incomplete salient objects. Meanwhile, there are also some non-salient objects which should not be detected in the saliency maps. These error predictions are caused by two main challenges which cannot be avoided in the FPN architecture. The first problem is that the high-level information is diluted progressively when it is integrated in different feature levels in the top-down pathway. Another problem of the FPN baseline is that this architecture can be impacted by other non-essential information which may reduce the final performance of the model. In other words, the FPN architecture detects not only incomplete salient objects but also unnecessary objects. To overcome these two intrinsic problems of the baseline, we propose a novel pyramid self-attention module (PSAM) which contains stand-alone self-attention layers Ramachandran et al. (2019) in different scales, further focusing on important regions and enlarging the receptive field of the model. Specifically, as shown in Figure 3, PSAM firstly transforms the feature map which is produced by bottom-up pathway into multi-scale feature regions and then each self-attention layer learns to pay more attention on important semantic information. After processed by self-attention layers, these multi-scale representations, which contain effective semantic information, are concatenated together to complement high-level semantic information in the top-down pathway. More technically, let denote the feature map, which is produced by the top-most layer. We downsample the feature map into three different scales denoted as . Given a pixel a corresponding local memory block is extracted from the same feature map . This is a region which surrounds . There are three crucial learnable parameters in this self-attention algorithm: queries, keys and values. We use , and to represent their learnable weights respectively. The final attention output pixel is computed as follows:


where , , denote the three crucial parameters, denotes the output pixel of a self-attention layer, defines the coordinates of and we use to denote the final feature map which further represents an upsampling operation after each self-attention layer. Then we concatenate them with the original to generate the final output of the PSAM.


Inspired by the previous work Liu et al. (2019a), we exploit a similar complementing strategy to avoid the dilution of high-level semantic information. However, compared to the previous work, the proposed pyramid self-attention module achieves a state-of-the-art performance. This module is built at the end of bottom-up pathway and converts the high-level semantic features into different scales, further enlarging the receptive field of the model. Based on multi-scale high-level feature maps, the attention layers view these semantic features at different scales and then can achieve a comprehensive attention task. More performance detail will be shown in the ablation study.

Figure 4: The structure of Channel-wise attention Module.

3.2 Channel-Wise Attention

To enhance the context and structural information, the lateral connection has been used in the top-down pathway, leading to a state-of-the-art performance of detection tasks. However, this operation also introduces some unmeaningful information, which can reduce the performance and impact on the final prediction. From Figure 1(c), the two problems caused by features redundancy are obvious. The first problem is that there are extra regions which should not be detected in the saliency map. Another problem is that the edges of salient objects are ambiguous. Both problems indicate that a further refinement should be applied.  Hu et al. (2018) has pointed out that different channels have different semantic features and channel-wise attention can capture channel-wise dependencies. In other words, channel-wise attention can emphasize the salient objects and alleviate the inaccuracy which is caused by redundant features in channels. Therefore, we add this simple channel-wise attention Hu et al. (2018); Chen et al. (2017); Zhao and Wu (2019) to each later connection to achieve a refinement task. The structure of channel-wise attention is shown in Figure 4

. It consists of one pooling layer and two fully-connected layers which are followed by a ReLU 

Nair and Hinton (2010)

and a sigmoid function respectively. First, an operation of squeezing global spatial information is applied to each channel. This step can be easily implemented by an average pooling :


where c refers to the channel number, H x W refers to the spatial dimensions of i-th element of . After the pooling operation, the generated channel descriptor is fed into the fully-connected layers to fully capture channel-wise dependencies.


where refers to the sigmoid function, refers to the ReLU function and fc means the fully-connected layers. Finally, this generated scalar multiplies the feature map to generate a weighted feature map :


4 Experiments

4.1 Datasets and Evaluation Metrics

For the evaluation of the proposed methodology, we carry out a series of experiments using five popular saliency detection benchmarks. More specifically, we use the: ECSSD Yan et al. (2013), DUT-OMRON Yang et al. (2013), DUT-TE Wang et al. (2017), HKU-IS Li and Yu (2015) and SOD Movahedi and Elder (2010). These five datasets consist of a variety of objects and structures which are still challenging for salient object detection algorithms to locate and detect them precisely. For the training of our model we use the large-scale dataset DUTS Wang et al. (2017)

, which contains 10533 training images and 5019 testing images. To evaluate the performance of the model, we estimate three representative evaluation metrics: precision-recall curves, F-measure score and mean absolute error (MAE). F-measure indicates the standard overall performance which are computed by precision and recall:


where is set to 0.3 as default, precision and recall are obtained by using different thresholds to compare prediction and ground truth. The MAE indicates the deviations between the binary saliency map and the ground truth. In other words, this metric quantifies the similarity between prediction map and ground truth mask:


where W denotes the width and H denotes the height of prediction, P denotes the prediction map which is the output of the model and G represents the ground truth.

(a) DUTS-TE (b) ECSSD (c) HKU-IS (d) SOD (e) DUT-OMRON
Figure 5: Results with PR curves on five benchmark datasets: DUTS-TE, ECSSD, HKU-IS, SOD and DUT-OMRON. x-axis represents the recall rate and y-axis represents the precision.

4.2 Impelmentation Detail

Our model is implemented in Pytorch. We use ResNet-50 

He et al. (2016)

as a backbone which has been pre-trained on ImageNet 

Krizhevsky et al. (2012)

. The proposed architecture is trained on a GTX TITAN X GPU for 24 epochs. As suggested in 

Liu et al. (2019a), the initial learning rate is set equal to 5e-5 for the first 15 epochs and then reduces to 5e-6 for the last 9 epochs. We adopt 0.0005 weight decay for the Adam Kingma and Ba (2014)

optimizer and binary cross entropy loss function in the proposed framework. Finally, in order to increase the robustness of the model, we perform data augmentation through the application of random horizontal flipping.

Figure 6: Overall comparison of qualitative visual results between our method and selected baseline methods. It shows that our method is capable to provide more complete salient map and smooth boundaries.

4.3 Comparisons with State-of-the-arts

We perform our proposed method on five datasets to compare with 11 previous state-of-the-art methods, which include LEGS Wang et al. (2015), UCF Zhang et al. (2017c), DSS Hou et al. (2017), Amulet Zhang et al. (2017b), R3Net Deng et al. (2018), DGRL Wang et al. (2018a), PiCANet Liu et al. (2018), BMPM Zhang et al. (2018), MLMSNet Wu et al. (2019a), AFNet Feng et al. (2019) and PAGE-Net Wang et al. (2019). For fair comparisons, we use the results which are generated by their original work with default parameters and released by the authors. Moreover, all results are evaluated by the same evaluation method without any other processing tools.

4.3.1 Quantitative Comparisons

Figure 5 and Table 1 show the evaluation results of the proposed framework in comparison to eleven state-of-the-art methods on five challenging salient object datasets. More specifically, in Figure 5, PR curve of the proposed methodology (red line) outperforms the state-of-the-art methods. This result means that our method has better robustness than other previous methods. Furthermore, the quantitative results are listed in Table 1. The proposed method achieves higher F-measure scores and lower error scores than other methods, demonstrating that our novel model outperforms almost all previous state-of-the-art models on the different testing datasets.

4.3.2 Qualitative Comparisons

Figure 6 illustrates the visual comparisons in order to further show the advantages of the method. More precisely, compared to other approaches, the detection results of our method show the best performance on the different challenging scenarios. In other words, the detection results, even in certain details, are close to the ground truth.

4.4 Ablation study

In this subsection, we conduct a series of experiments on five different datasets to investigate the effectiveness of two modules. The ablation experiments are trained on DUTS Wang et al. (2017) training dataset in the same environment. From Table 1, the model which contains PSAM and channel-wise attention module achieves the best performance, demonstrating that the proposed modules can effectively assist the baseline’s the salient object detection performance. More specifically, we initially conduct the baseline experiments on FPN baseline with ResNet-50 as backbone. This basic model can generate rough saliency map which is shown in Figure 1(c). Then we add pyramid self-attention module (PSAM) on the baseline and the F-measure scores increases significantly on all benchmark datasets, especially for DUTS-TE Wang et al. (2017) and SOD Movahedi and Elder (2010). On this basis, we add channel-wise attention on the model to compose the proposed framework. The final best results show that the channel-wise attention modules can further increase the performance and alleviate error predictions. To this end, Figure 1(d) and (e) demonstrate the effectiveness of two modules respectively.

F-score MAE F-score MAE F-score MAE F-score MAE F-score MAE
LEGS Wang et al. (2015) 0.654 0.138 0.827 0.118 0.770 0.118 0.733 0.196 0.669 0.133
UCF Zhang et al. (2017c) 0.771 0.117 0.910 0.078 0.888 0.074 0.803 0.164 0.734 0.132
DSS Hou et al. (2017) 0.813 0.064 0.907 0.062 0.900 0.050 0.837 0.126 0.760 0.074
Amulet Zhang et al. (2017b) 0.778 0.085 0.914 0.059 0.897 0.051 0.806 0.141 0.743 0.098
R3Net Deng et al. (2018) 0.824 0.066 0.924 0.056 0.910 0.047 0.840 0.136 0.788 0.071
PiCANet Liu et al. (2018) 0.851 0.054 0.931 0.046 0.922 0.042 0.853 0.102 0.794 0.068
DGRL Wang et al. (2018a) 0.828 0.050 0.922 0.041 0.910 0.036 0.845 0.104 0.774 0.062
BMPM Zhang et al. (2018) 0.851 0.048 0.928 0.045 0.920 0.039 0.855 0.107 0.774 0.064
PAGE-Net Wang et al. (2019) 0.838 0.051 0.931 0.042 0.920 0.036 0.841 0.111 0.791 0.062
MLMSNet Wu et al. (2019a) 0.852 0.048 0.928 0.045 0.920 0.039 0.855 0.107 0.774 0.064
AFNet Feng et al. (2019) 0.863 0.045 0.935 0.042 0.925 0.036 0.856 0.109 0.797 0.057
Ours 0.879 0.040 0.944 0.038 0.931 0.034 0.874 0.104 0.813 0.056
Baseline 0.856 0.045 0.933 0.045 0.921 0.037 0.848 0.116 0.785 0.059
Baseline+SA 0.876 0.041 0.940 0.042 0.928 0.034 0.857 0.121 0.803 0.056
Table 1: Quantitative results with F-score and MAE on five challenging datasets: DUTS-TE, ECSSD, HKU-IS, SOD and DUT-OMRON. Our method is compared with 11 competitive baseline methods. The last three rows of the table shows the results of the ablation studies.

5 Conclusion

In this paper, we propose a novel end-to-end salient object detection method. Considering the intrinsic problems of the FPN architecture, a pyramid self-attention module (PSAM) is designed. This module contains different self-attention layers in multiple scales, leading to capture multi-scale high-level features to make the model focus on the high-level semantic information and further enlarge the receptive field. Furthermore, we employ the channel-wise attention in lateral connections to reduce the feature redundancy and refine prediction results. Experimental results on five challenging datasets demonstrate that our proposed model surpasses 11 state-of-the-art methods and the ablation experiments also demonstrate the effectiveness of the two modules.


  • Bello et al. (2019) I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le. Attention augmented convolutional networks. In IEEE International Conference on Computer Vision, pages 3286–3295, 2019.
  • Borji et al. (2014) A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li. Salient object detection: A survey. Computational Visual Media, pages 1–34, 2014.
  • Chen et al. (2017) L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua.

    Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning.


    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5659–5667, 2017.
  • Dalal and Triggs (2005) N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 886–893, 2005.
  • Deng et al. (2018) Z. Deng, X. Hu, L. Zhu, X. Xu, J. Qin, G. Han, and P.-A. Heng. R3net: Recurrent residual refinement network for saliency detection. In

    International Joint Conference on Artificial Intelligence

    , pages 684–690, 2018.
  • Feng et al. (2019) M. Feng, H. Lu, and E. Ding. Attentive feedback network for boundary-aware salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1623–1632, 2019.
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • Hong et al. (2015) S. Hong, T. You, S. Kwak, and B. Han. Online tracking by learning discriminative saliency map with convolutional neural network. In

    International Conference on Machine Learning

    , pages 597–606, 2015.
  • Hou et al. (2017) Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr. Deeply supervised salient object detection with short connections. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3203–3212, 2017.
  • Hu et al. (2018) J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
  • Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  • Li and Yu (2015) G. Li and Y. Yu.

    Visual saliency based on multiscale deep features.

    In IEEE Conference on Computer vision and Pattern Recognition, pages 5455–5463, 2015.
  • Lin et al. (2017) T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
  • Liu et al. (2019a) J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang. A simple pooling-based design for real-time salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3917–3926, 2019a.
  • Liu et al. (2018) N. Liu, J. Han, and M.-H. Yang. Picanet: Learning pixel-wise contextual attention for saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3089–3098, 2018.
  • Liu et al. (2019b) T. Liu, J.-J. Huang, T. Dai, G. Ren, and T. Stathaki. Gated multi-layer convolutional feature extraction network for robust pedestrian detection. arXiv preprint arXiv:1910.11761, 2019b.
  • Lowe (2004) D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • Ma et al. (2002) Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li. A user attention model for video summarization. In International Conference on Multimedia, pages 533–542, 2002.
  • Movahedi and Elder (2010) V. Movahedi and J. H. Elder. Design and perceptual validation of performance measures for salient object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 49–56, 2010.
  • Nair and Hinton (2010) V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, pages 807–814, 2010.
  • Qin et al. (2019) X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand. Basnet: Boundary-aware salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7479–7489, 2019.
  • Ramachandran et al. (2019) P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens. Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909, 2019.
  • Ren et al. (2013) Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang. Region-based saliency detection and its application in object recognition. IEEE Transactions on Circuits and Systems for Video Technology, 24(5):769–779, 2013.
  • Schillaci et al. (2013) G. Schillaci, S. Bodiroža, and V. V. Hafner. Evaluating the effect of saliency detection and attention manipulation in human-robot interaction. International Journal of Social Robotics, 5(1):139–152, 2013.
  • Simakov et al. (2008) D. Simakov, Y. Caspi, E. Shechtman, and M. Irani. Summarizing visual data using bidirectional similarity. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008.
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • Wang et al. (2015) L. Wang, H. Lu, X. Ruan, and M.-H. Yang. Deep networks for saliency detection via local estimation and global search. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3183–3192, 2015.
  • Wang et al. (2017) L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan. Learning to detect salient objects with image-level supervision. In IEEE Conference on Computer Vision and Pattern Recognition, pages 136–145, 2017.
  • Wang et al. (2018a) T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji. Detect globally, refine locally: A novel approach to saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3127–3135, 2018a.
  • Wang et al. (2019) W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji. Salient object detection with pyramid attention and salient edges. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1448–1457, 2019.
  • Wang et al. (2018b) X. Wang, S. You, X. Li, and H. Ma. Weakly-supervised semantic segmentation by iteratively mining common object features. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1354–1362, 2018b.
  • Wei et al. (2017) Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1568–1576, 2017.
  • Wu et al. (2019a) R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding. A mutual learning method for salient object detection with intertwined multi-supervision. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8150–8159, 2019a.
  • Wu et al. (2019b) Z. Wu, L. Su, and Q. Huang. Cascaded partial decoder for fast and accurate salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3907–3916, 2019b.
  • Yan et al. (2013) Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1155–1162, 2013.
  • Yang et al. (2013) C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via graph-based manifold ranking. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3166–3173, 2013.
  • Yuan et al. (2018) X. Yuan, J. Yue, and Y. Zhang. Rgb-d saliency detection: Dataset and algorithm for robot vision. In International Conference on Robotics and Biomimetics, pages 1028–1033, 2018.
  • Zhang et al. (2017a) D. Zhang, D. Meng, L. Zhao, and J. Han. Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning. arXiv preprint arXiv:1703.01290, 2017a.
  • Zhang et al. (2018) L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang. A bi-directional message passing model for salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1741–1750, 2018.
  • Zhang et al. (2017b) P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan. Amulet: Aggregating multi-level convolutional features for salient object detection. In IEEE International Conference on Computer Vision, pages 202–211, 2017b.
  • Zhang et al. (2017c) P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin. Learning uncertain convolutional features for accurate saliency detection. In IEEE International Conference on Computer Vision, pages 212–221, 2017c.
  • Zhao et al. (2015) R. Zhao, W. Ouyang, H. Li, and X. Wang.

    Saliency detection by multi-context deep learning.

    In IEEE Conference on Computer Vision and Pattern Recognition, pages 1265–1274, 2015.
  • Zhao and Wu (2019) T. Zhao and X. Wu. Pyramid feature attention network for saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3085–3094, 2019.