Salient object segmentation, which aims to extract the most conspicuous object regions in visual range, has become an attractive computer vision research topic over the decades. A vast family of saliency algorithms have been proposed to tackle the saliency segmentation problem, which distinguishes whether a pixel pertains to a noticeable object or inconsequential background. Due to a mixture of both object and background, pixels closed to the boundary between object and background are error-prone.
Early methods [8, 3, 35] determine the saliency by utilizing hand-crafted appearance features. These methods often focus on the low-level visual features and are struggling to achieve satisfactory results in the image with a complex scene. Compared with the methods based on hand-crafted features and prior knowledge, the fully convolution network (FCN) based frameworks [21, 26, 17]
made remarkable progress by exploiting high-level semantic information. To learn the binarized saliency segmentation, these networks usually adopt cross entropy loss as objective function. However, cross entropy loss only considers sample distribution while neglecting the appearance cues of target objects, such as boundaries and inner textures. Especially, the boundary is regarded very important for saliency segmentation.
To leverage the boundary information, multi-task architecture [22, 15] was designed to aggregate the features acquired from boundary and saliency labels. As the supervised information is distinct between boundary and saliency branches, simply aggregating these features may lead to incompatible interference, thus their models are hard to converge. Since aforementioned saliency frameworks cannot well exploit boundary information to determine the contour pixels, these methods may obtain a sub-optimal result with vague boundaries.
Because the final decision of salient object regions relies on the spatial contexts, exploiting contextual information in models can lessen the misdirections of insignificant background. Recently, attention mechanism was exploited to obtain attended features for capturing global contexts by [19, 32, 39]. However, as these attended features are generated by softmax in global forms, they only emphasize several significant pixels and abnegate other information in images. Therefore, for the high-resolution image, it is not a good choice to capture the global contexts by softmax-wise attention modules, which easily leads to overfitting in training.
To address aforementioned issues of boundary-aware learning and attention mechanism for saliency segmentation, we propose a novel segmentation loss and an effective global attention module, i.e., Contour Loss and hierarchical global attention module (HGAM). The aim of Contour Loss is to guide the network to perceive the object boundaries to learn the boundary-wise distinctions between salient objects and background. Motivated by the focal loss , we apply spatial weight maps in cross entropy loss, which assigns a relatively high value to emphasize the pixels near object borders in training. As a result, the trained model is sensitive to the boundary-wise distinctions in images. Since the Contour Loss focuses on local boundaries, HGAM is proposed to hierarchically attend to global contextual information for alleviating background distractions. Different from the abovementioned attention modules which work with softmax, HGAM is based on global contrast thus can capture global contexts in all resolutions. Our baseline model is based on FPN  architecture with VGG-16  backbone, which is refined by employing residual blocks instead of simple convolution layers in decoder module. With the help of the abovementioned techniques, our network yields state-of-the-art performance on six benchmarks.
Followings are the summary of our main contributions:
1) We propose Contour Loss to guide networks to perceive salient object boundaries. Consequently, boundary-aware features can be obtained to facilitate final predictions on object boundaries.
2) We present the hierarchical global attention module (HGAM) to attend to global contexts for reducing background distractions.
3) We construct a network based on FPN architecture and incorporate those proposed methods for joint training.
4) Comprehensive experimental results and extensive in-depth analysis can explain the outperformance of proposed methods. In addition, our model is very fast which has speed of 26 fps on an NVIDIA TITAN X GPU.
2 Related Works
2.1 Salient Object Segmentation
In recent years, various frameworks, including conventional methods and fully convolution network (FCN) based models, have been presented to address the problems of salient object segmentation. We briefly review these two categories of methods in the followings.
2.1.1 Conventional methods
Conventional saliency detection methods utilize prior knowledge, as well as hand-crafted appearance features to capture salient regions. Considering obvious distinctions between salient regions and background in pictures, local contrast is used to determine the pixel is conspicuous or not by . Inspired by the effectiveness of local contrast, Cheng et al.  proposed to capture the salient regions by global contrast. To exploit different appearance cues for refining the saliency quality, multi-level segmentation model is designed by [34, 9] to hierarchically aggregate these cues. As conventional methods only leverage low-level visual features, the lack of semantic information can lead to failures in complex situations.
2.1.2 FCN-based method
With the development of deep learning, remarkable progress has been made by FCN-based models. Different from conventional methods, high-level semantic features can be exploited by FCNs to achieve better results.
propose the U-Net architecture, which consists of a contracting path, a symmetric expanding path and lateral connections to integrate features with the same resolution. To exploit the potential of deep feature pyramids, Lin et al. present the FPN architecture, which based on U-Net and employs hierarchical predictions. These architectures are popularly followed by later related works.
Inspired by RNNs, some recurrent structures have been proposed to tackle saliency segmentation problems. Kuen et al.  first design a recurrent network with convolution and deconvolution layers to enhance saliency maps from coarse to fine. Liu and Han  present an U-Net based architecture, which refines saliency maps by recursively integrating hierarchical predictions. Wang et al.  utilize saliency results as feedback signals to improve saliency performance. In 
, Zhang et al. propose a complex recurrent structure to recursively extract and aggregate features. Although these advanced recurrent structures can better leverage the potential of hierarchical features, the lack of heuristic knowledge limits their capability.
The aim of attention mechanism is to adaptively select significant features, in other words, alleviating background distractions. Since both attention and saliency have similar contextual meanings in pictures, recently many researchers adopt attention mechanism for saliency detection. To alleviate background distractions, Wang et al.  obtain attention maps from encoded features to attend to the global contexts, while Zhang et al. utilize both spatial and channel-wise attentions in . Because softmax only emphasizes several pixels in image, softmax-wise attention modules are hard to capture global contexts in high-resolution. To tackle this problem, Liu et al.  propose global and local attention modules to capture global contexts and local contexts in low-resolution and high-resolution respectively, and Chen et al.  employ hierarchical predictions as attention maps, which can attend to global contexts in all resolutions. As not all the features in background regions are helpless for saliency determination especially in deep layers, the predicted maps which are trained to close to annotation masks may lose some crucial information. In contrast to the aforementioned attention modules, the proposed HGAM can not only capture global contexts in all resolutions but also considers some crucial information from background regions.
2.2 Boundary-aware Learning
One of the major challenges in saliency segmentation is to determine the conspicuous object boundaries. Some researchers have pay attention to this point.
Since superpixel methods like SLIC  can obtain the regions by aggregating adjacent pixels with similar attributes, they are usually adopted to refine saliency results. Yang et al.  propose a background-prior method, which utilize superpixel methods to obtain regions and detect salient regions by ranking the similarity of foreground or background units. To revise the vague boundaries, [7, 14, 13] employ superpixel algorithms to generate the object contours by these over-segmented regions. Because superpixel relies on the distinction of pixel integration, it cannot well segment the pixels from low contrast regions. Besides, these superpixel-based methods often have a huge computational cost.
Insteads of using superpixel methods for boundary determination, recent researches prefer to straightly leverage contour information in an entire framework. As a conventional method,  build a two-stream framework for the mixture of texture and contour. Luo et al.  and Xin et al.  present a multi-task network architecture based on U-Net, which predicts both saliency and contour maps of the corresponding salient objects. Due to the great distinctions between the saliency and boundary maps, it leads to inconsistent interference by simply aggregating these features. Therefore, these models which are difficult to converge may generate sub-optimal results.
Different from abovementioned boundary-aware methods, the proposed Contour Loss can help the model to perceive the object boundaries by focusing on the boundary pixels, which is more robust and easier to be convergence.
3 Proposed Methods
Our proposed method mainly integrates a basic network with a Contour Loss and a hierarchical global attention module (HGAM), which aim at acquiring boundary-aware features and hierarchically integrating global contexts in all resolutions to enhance saliency results. We describe our methods and baseline network in the following subsections. The overall network structure is shown in Fig 2.
3.1 Baseline Network
For encoder module, we adopt the VGG-16 
backbone which is pretrained on ImageNet for image classification. As the resolution of input image is , to adapt the saliency segmentation task, we utilize the backbone to extract feature maps at 5 levels, which can be represented as encoded features with the resolution . Since are extracted at multi-levels, they contain both low-level visual cues and high-level semantic information from different resolutions. To integrate these multi-level information, we transfer to decoder module.
Because residual block is better than pure convolution layer in aggregating the multi-scale features, our decoder module is constructed by 5 residual blocks corresponding to . After the decoder module has received , it generates the residual features and each can be formulated as:
stands for the convolution together with ReLU layers with parameters. and represent the channel-wise concatenation and the upsample operation by a factor 2 respectively. To achieve the hierarchical predictions like FPN, we resample to resolution for obtaining the upsampled features , then utilize these feature maps to generate the hierarchical predictions . and can be formulated as:
where denotes upsampling features to resolution and stands for the convolution together with the Sigmoid layers with parameters .
As the , …, are based on from low to high resolutions, these prediction maps can receive various supervised information from coarse to fine. To better leverage these various feedbacks from loss for updating parameters, in the training phase, the loss is calculated by the weighted sum of like , it can be formulated as:
where is the annotation mask, the and represent the cross entropy loss and its weighted combination respectively.
is the hyperparameter of corresponding prediction. In testing, we adopt the as saliency result.
3.2 Contour Loss
Salient object segmentation aims at capturing the most conspicuous objects in input images. Suppose images only contain two parts: the background and salient objects. For most pixels, they locate at the inside of the objects or background, which indicate that they are far from the object borders. Intuitively, their contexts are relatively pure because only object or background pixels are shown in receptive fields except for few noise pixels. Consequently, saliency networks can well classify these pixels without auxiliary techniques. However, pixels located at the boundary between background and salient objects are so ambiguous that even experienced people are difficult to determine their labels. From the perspective of features, these vectors extracted from motley image pixels fall near the hyperplanes, acting as hard examples. As general saliency networks only apply pixel-wise binary classification, while neglect the boundary cues and train all pixels equally by cross entropy loss, they usually predict broad outline of target objects but are inferior in precise boundaries.
Base on the above observations, we argue that border pixels, as well as the hard examples in saliency maps, deserve much higher attention in the training phase. Inspired by focal loss , assigning higher weights to focus on these hard examples is theoretically and technologically convincing. Towards this end, we apply spatial weight maps in cross entropy loss, which assigns relatively high value to emphasize pixels near the salient object borders. The spatial weight map can be formulated as:
where and represent dilation and erosion operations with the mask respectively. The object boundaries can be obtained by the difference between dilated and eroded images. is a hyperparameter for assigning the high value which is set to 5 empirically. To endow the pixels which are closed but not located at the boundaries with a moderate weight, we also adopt Guass function with a range. denotes the ones matrix with resolution to set the pixels which is aloof from object boundaries to 1. Compared with some boundary operators, such as Laplace operator, the above approach can generate thicker object contours for considerable error rates.
Generally, the proposed Contour Loss is implemented as the following formula:
where , and represent the spatial weight map, annotation map and predicted saliency map of the pixel respectively. In implementation, since our network outputs multiple intermediate saliency maps, Contour Loss is applied to all intermediate maps to supervise network in the training process. In other words, as we adopt Contour Loss, the in Eq 3 represents .
|ResNet-50  backbone|
|VGG-19  backbone|
|VGG-16  backbone|
3.3 Hierarchical Global Attention Module
The aim of salient object segmentation is to detect evident object regions, in other words, remove insignificant regions. Although an original picture may contain multi-objects, not all the objects are conspicuous for saliency maps. Thus, negligible information can lead to a sub-optimal result by distracting the models from salient regions. To alleviate distractions of background, attention mechanism is a useful auxiliary module for salient object segmentation. Since attention module can leverage the contextual information to generate a weight map, this map can guide the model to abate the insignificant features. However, existing attention modules often adopt softmax function, which enormously emphasizes several important pixels and endows the others with a very small value. Therefore these attention modules cannot attend to global contexts in high-resolution, which easily lead to overfitting in training.
Due to the above observations, instead of using softmax-wise attention modules, we utilize a novel function which is based on global contrast to attend to global contextual information. Since a region is conspicuous in feature maps, each pixel in the region is also significant with a relatively large value, for example, over the mean. In other words, the inconsequential features often have a relatively small value in feature maps, which are often smaller than the mean. Thus, an intuitive method to abnegate the insignificant features is pixel-wisely subtracting the average value from feature maps. After the subtraction, we can conduct a pixel-wise classification in feature maps: the positive pixels represent significant features while the negative ones denote inconsequential features. Accordingly, the attention map can be generated as:
where is the input feature map, and
represent the average and variance value ofrespectively. denotes a regularization term which is set to 0.1 empirically, while is a small value to avoid zero-division as the default setting. Compared with softmax results, the pixel-wise disparity of our attention maps is more reasonable, in other words, our attention method can retain conspicuous regions from feature maps in high-resolution.
Since attention maps do not hold the labels, they are usually generated or supervised by predicted maps or ground truth masks which only retains the salient regions. However, as models may also need information from background regions to determine salient objects, these attention maps may miss some crucial information. In contrast to these attention modules like , we exploit the feature maps which are near the predictions to generate unsupervised attention maps. Therefore, our attention maps can not only leverage the strong feedbacks of supervised information to update themselves, but also contain the background information which may be crucial for saliency determination.
Towards this end, we propose our hierarchical global attention module (HGAM) to capture the multi-scale global contexts. As shown in Fig 3, for a given HGAM , it receives three inputs: the encoded feature , the upsampled feature which is near the prediction , and the previous HGAM message . To extract the global contextual information from input features, we adopt maxpool and avgpool layers to deal with for obtaining contextual features and respectively, which is suggested by . We also make channel-wise compression of to obtain , while can be generated by the previous HGAM message as:
After obtaining , …, , we make channel-wise concatenation of them to generate the th HGAM message . The th attention map also can be generated by Eq 6 with as input feature. Since obtain these two outputs, is transferred to next HGAM , while is utilized to guide the as:
where is the guided feature map and represents the element-wise multiplication.
Different from the baseline network utilizes as final output, since guides by recursively aggregating the multi-scale HGAM messages, we exploit the guided feature to generate the final prediction , which is also included in . Therefore, in training, in Eq 3 can be rewrote as:
In testing, we adopt as our final prediction to evaluate our model.
4.1 Experimental Settings
To evaluate the performance of our model, six public saliency segmentation datasets are exploited. DUTS  is a large scale saliency benchmark dataset which contains 10,553 images as trainging set (DUTS-TR) and 5,019 images as testing set (DUTS-TE). In the experiments, we adopt DUTS-TR to train our model and DUTS-TE for evaluation. For comprehensive evaluation, we also utilize SOD , PASCAL-S , ECSSD , HKU-IS  and DUT-O  for testing, which contain 300, 850, 1,000, 4,447 and 5,168 images respectively. Note that for the testing on the abovementioned databases, no corresponding fine-tuning is carried.
Our experiments are based on the Pytorch framework and run on a PC machine with a single NVIDIA TITAN X GPU (with 12G memory).
For training, we adopt DUTS-TR as training set and utilize data augmentation, which resamples each image to before random flipping, and randomly crops the
region. We employ stochastic gradient descent (SGD) as the optimizer with a momentum (0.9) and a weight decay (1e-4). We also set basic learning rate to 1e-3 and finetune the VGG-16 backbone with a 0.05 times smaller learning rate. Since the saliency maps of hierarchical predictions are coarse to fine from to , we set the incremental weights with these predictions. Therefore , …, are set to 0.3, 0.4, 0.6, 0.8, 1 respectively in both Eq 3 and 9
. The minibatch size of our network is set to 10. The maximum iteration is set to 150 epochs with the learning rate decay by a factor of 0.05 for each 10 epochs. As it costs less than 500s for one epoch including training and evaluation, the total training time is below 21 hours.
For testing, follow the training settings, we also resize the feeding images to , and only utilize the final output . Since the testing time for each image is 0.038s, our model achieves 26 fps speed with resolution.
To evaluate different algorithms, we adopt three metrics for the quality of saliency maps, including the precision-recall (PR) curves, -measure  and mean absolute error ().
To evaluate the robustness of saliency results in different thresholds, we utilize the PR curve to demonstrate the relation of precision and recall by thresholding saliency maps from 0 to 255.
The -measure is a weighted combination of precision and recall value for saliency maps, which can be calculated by
where is set to 0.3 as suggested in . To alleviate the unfairness caused by different thresholds in papers, we report the maximum -measure as suggested by [22, 36], which selects the best score over all thresholds from 0 to 255.
For comprehensive comparisons, we also adopt the metric to evaluate the pixel-wise average absolute difference between the saliency map and its corresponding ground truth mask ,
where and represent the width and height of a given picture respectively.
4.2 Comparison with State-of-the-arts
To evaluate the performance, we compare our method with 13 state-of-the-art algorithms on aforementioned six public benchmarks in terms of visual evaluation, PR curve, maximum -measure and metrics. These methods include 2 conventional algorithms: DRFI , MR , as well as 11 deep learning models: RFCN , Amulet , UCF , NLDF , SRM , PAGRN , BRN , CKT , BMP , PCA  and RA .
The visual comparison between ours and other state-of-the-arts is shown in Fig 7. It can be observed that our method well detect the target objects in various situations, i.e., containing the object too huge or too small (rows 1 and 2), object touching image edges (row 1), object touching other inconsequential items (row 3), multi-objects (row 4) and object appearance similar with background (row 5). It is also worth noting that our results have finer boundaries and more precise localization of salient regions, which thanks to the effect of Contour Loss and HGAM respectively.
F-measure and MAE.
In Table 1, we show quantitative evaluation results between ours and other superior methods under maximum -measure and metrics. To the best of our knowledge, as only utilize the VGG-16 backbone without any post-processing methods like CRF , our model surpasses all existing networks and significantly refresh state-of-the-art performance on benchmarks by 1 to 2 percent.
In Fig 4, we compare our approach with other state-of-the-art methods in terms of PR curve on 4 benchmarks. It can be observed that our model consistently outperforms all the other methods.
Besides, Table 1 also provides the average testing time for each image among the state-of-the-arts on an NVIDIA TITAN X GPU. We can see that our approach only takes 0.038s (corresponding to 26fps) to generate a saliency map, which is faster than other mainstream methods.
4.3 Ablation Study
To evaluate the effectiveness of the proposed Contour Loss and HGAM, we show the results of quantitative and visual comparison under different settings. Table 2 shows the quantitative comparison which demonstrates that only utilizing Contour Loss or HGAM can enhance the baseline performance by nearly 2 percent. As incorporating Contour Loss and HGAM can make a further improvement on two massive datasets by 1 to 2 precent, which proves that Contour Loss and HGAM refine the saliency results from different aspects.
In Fig 5, compared with the baseline result, Contour Loss can obtain a finer boundaries while HGAM is better in eliminating background distractions. Since our result outperforms the other results both in boundaries and background elimination, it proves that incorporating Contour Loss and HGAM can lead to mutual promotion in training.
4.4 HGAM Visualizaiton
As shown in Fig 6, we visualize the attention maps generated by HGAM to further understand how it works. We can observe that these attention maps show fine-to-coarse locations of salient objects from to , which greatly matches the global attention mechanism in different resolutions. It is worth pondering that, different from other attention maps which only focus on salient regions, assigns a higher value to the regions which are corresponding to background. As well abnegates these background regions, we reckon that the model also needs to perceive background regions for eliminating the insignificant features. Moreover, as shows the clear boundaries of salient objects, it is convincingly proved that the mutual promotion of Contour Loss and HGAM in boundary-aware learning.
We propose the Contour Loss and the HGAM to help networks learn to better detect saliency objects in visual range. The Contour Loss forces to learn boundary-wise distinctions between salient objects and background, while HGAM enables the models to capture global contextual information in all resolutions. Experimental results on six datasets demonstrate that our proposed approach outperforms 13 state-of-the-art methods under different evaluation metrics.
-  (2012) SLIC superpixels compared to state-of-the-art superpixel methods. IEEE TPAMI 34 (11), pp. 2274–2282. Cited by: §2.2.
-  (2018) Reverse attention for salient object detection. In ECCV, Cited by: §2.1.2, §3.3, Table 1, §4.2.
-  (2015) Global contrast based salient region detection. IEEE TPAMI 37 (3), pp. 569–582. Cited by: §1, §2.1.1.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: §3.1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: Table 1, Figure 7.
-  (2017) Deeply supervised salient object detection with short connections. In CVPR, Cited by: §3.1.
-  (2017) Deep level sets for salient object detection. In CVPR, Cited by: §2.2.
-  (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI (11), pp. 1254–1259. Cited by: §1.
-  (2013) Salient object detection: a discriminative regional feature integration approach. In CVPR, Cited by: §2.1.1, Table 1, §4.2.
-  (2011) Center-surround divergence of feature statistics for salient object detection. In ICCV, Cited by: §2.1.1.
-  (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, Cited by: §4.2.
-  (2016) Recurrent attentional networks for saliency detection. In CVPR, Cited by: §2.1.2.
-  (2015) Visual saliency based on multiscale deep features. In CVPR, Cited by: §2.2, §4.1.
Deepsaliency: multi-task deep neural network model for salient object detection. TIP 25 (8), pp. 3919–3930. Cited by: §2.2.
-  (2018) Contour knowledge transfer for salient object detection. In ECCV, Cited by: §1, §2.2, Table 1, Figure 7, §4.2.
-  (2014) The secrets of salient object segmentation. In CVPR, Cited by: §4.1.
-  (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §1, §1, §2.1.2, §3.1.
-  (2017) Focal loss for dense object detection. In ICCV, Cited by: §1, §3.2.
-  (2018) PiCANet: learning pixel-wise contextual attention for saliency detection. In CVPR, Cited by: Figure 1, §1, §2.1.2, Table 1, Figure 7, §4.2.
-  (2016) Dhsnet: deep hierarchical saliency network for salient object detection. In CVPR, Cited by: §2.1.2.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1, §2.1.2.
-  (2017) Non-local deep features for salient object detection. In CVPR, Cited by: §1, §2.2, Table 1, §4.1, §4.2.
-  (2019-02) Direction selective contour detection for salient objects. TCSVT 29 (2), pp. 375–389. Cited by: §2.2.
-  (2010) Design and perceptual validation of performance measures for salient object segmentation. In CVPR Workshops, Cited by: §4.1.
-  (2017) Automatic differentiation in pytorch. In NIPS Workshops, Cited by: §4.1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §1, §2.1.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Cited by: §1, Figure 2, §3.1, Table 1, §4.1.
-  (2009) Frequency-tuned salient region detection. In CVPR, Cited by: §4.1, §4.1.
-  (2017) Learning to detect salient objects with image-level supervision. In CVPR, Cited by: §4.1, Table 2.
-  (2016) Saliency detection with recurrent fully convolutional networks. In ECCV, Cited by: §2.1.2, Table 1, §4.2.
-  (2017) A stagewise refinement model for detecting salient objects in images. In ICCV, Cited by: Table 1, §4.2.
-  (2018) Detect globally, refine locally: a novel approach to saliency detection. In CVPR, Cited by: §1, §2.1.2, Table 1, Figure 7, §4.2.
-  (2018) CBAM: convolutional block attention module. In ECCV, Cited by: §3.3.
-  (2013) Hierarchical saliency detection. In CVPR, Cited by: §2.1.1, §4.1.
-  (2013) Saliency detection via graph-based manifold ranking. In CVPR, Cited by: §1, §2.2, Table 1, §4.1, §4.2, Table 2.
-  (2018) A bi-directional message passing model for salient object detection. In CVPR, Cited by: Table 1, Figure 7, §4.1, §4.2.
-  (2017) Amulet: aggregating multi-level convolutional features for salient object detection. In ICCV, Cited by: §2.1.2, Table 1, §4.2.
-  (2017) Learning uncertain convolutional features for accurate saliency detection. In ICCV, Cited by: Table 1, Figure 7, §4.2.
-  (2018) Progressive attention guided recurrent network for salient object detection. In CVPR, Cited by: §1, §2.1.2, Table 1, Figure 7, §4.2.