A Deeper Look at Salient Object Detection: Bi-stream Network with a Small Training Dataset

08/07/2020 ∙ by Zhenyu Wu, et al. ∙ NetEase, Inc 4

Compared with the conventional hand-crafted approaches, the deep learning based methods have achieved tremendous performance improvements by training exquisitely crafted fancy networks over large-scale training sets. However, do we really need large-scale training set for salient object detection (SOD)? In this paper, we provide a deeper insight into the interrelationship between the SOD performances and the training sets. To alleviate the conventional demands for large-scale training data, we provide a feasible way to construct a novel small-scale training set, which only contains 4K images. Moreover, we propose a novel bi-stream network to take full advantage of our proposed small training set, which is consisted of two feature backbones with different structures, achieving complementary semantical saliency fusion via the proposed gate control unit. To our best knowledge, this is the first attempt to use a small-scale training set to outperform state-of-the-art models which are trained on large-scale training sets; nevertheless, our method can still achieve the leading state-of-the-art performance on five benchmark datasets.



There are no comments yet.


page 1

page 3

page 5

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Salient object detection (SOD) aims to estimate the most attractive regions of images or videos. As the pre-processing tool, SOD plays an important role in a wide range of computer vision, such as visual tracking

[OurPR15, ChenPR16], object retargeting [vinyals2015show], RGB-D completion [CC2020TIP]

, image retrieval

[7004809] and visual question answering [lin2017task].

Inspired by cognitive psychology and neuroscience, the classical SOD models [itti1998model, 8554115, 6331539, OurTIP15] are developed by fusing various saliency cues, however, all these cues fail to capture the wide variety of visual features regarding the salient objects. After entering the deep learning era, the SOD performance has achieved tremendous improvement because of both the exquisitely crafted fancy network architectures [MDF, DSS, RADF] and the availability of large-scale well-annotated training data [cheng2015global, wang2017learning].

Following the single-stream network structure, the most recent SOD methods [CC2019TMM1, DSS, RADF]

have focused on how to effectively aggregate multi-level visual feature maps to boost their performances. Though remarkable progress has been achieved, these methods have reached their performance bottleneck, because their single-stream structures usually consist of single feature backbone, which usually results in limited semantical sensing ability. Theoretically, different network architectures have inequable feature response even if for same image. As a result, we may easily achieve complementary semantical deep features if we simultaneously use two distinct feature backbones, please refer to the pictorial demonstrations in Fig. 


Fig. 1: Deep features in networks with different architectures are generally complementary, in which these feature maps are obtained from the last convolutional layers.

In terms of the training dataset, the SOD community has reached a consensus on the training protocol, i.e., trained on the MSRA10K [cheng2015global] or DUTS-TR [wang2017learning] dataset, and then tested on other datasets. However, is this training strategy the best choice? According to our experimental results, some inspiring findings can be summarized as follows: 1) The overall model performance is not always positively correlated with the number of training data, see the quantitative proofs in Fig. 2; 2) The performances of deep models trained on single training dataset (MSRA10K or DUTS-TR) are usually limited due to the unbalanced semantic distribution problem, as evidenced in Fig. 4; (3) The MSRA10K and the DUTS-TR datasets are complementary.

From the perspective of neuroscience, the human visual system comprises two largely independent subsystems that mediate different classes of visual behaviors [visualParallel, schiller1991parallel]

. The subcortical projection from the retina to cerebral cortex is strongly dominated by the two pathways that are relayed by the magnocellular (M) and parvocellular (P) subdivisions of the lateral geniculate nucleus (LGN). Parallel pathways generally exhibit two main characteristics: 1) The M cells contribute to transient processing (e.g., visual motion perception, eye movement, etc.) while the P cells contribute more to recognition (e.g., object recognition, face recognition, etc.); 2) The M and P cells are separated in the LGN, but it is recombined in visual cortex latter.

Inspired by the above-mentions, we first build a semantic category balanced small-scale training dataset namely MD4K (total 4172 images) from the off-the-shelf MSRA10K and DUTS-TR datasets. To take full advantage of the proposed small training set, we then propose a novel bi-stream network, consisting of two sub-branches with different network structures, which aims to explore complementary semantical information to obtain more powerful feature representation for the SOD task. Meanwhile, we devise a novel gate control unit to effectively fuse complementary information encoded in different sub-branches. Moreover, we introduce the multi-layer attention into the bi-stream network to preserve clear object boundaries. To demonstrate the advantages of our method, we conducted massive quantitative comparisons against 16 state-of-the-art methods over 5 frequently used datasets. In summary, the contributions of this paper can be summarized as follows:

  • We provide a deeper insight into the interrelationship between the performance and training dataset;

  • We propose a novel way to automatically construct small-scale training set MD4K from the off-the-shelf training datasets and our proposed MD4K boost the state-of-the-art models performance consistently;

  • We design a bi-stream network with a novel gate control unit and multi-layer attention module. It can better mine the complementary information encoded in different network structures and help the network take full advantage of the proposed small dataset;

  • Experimental results demonstrate that the proposed model achieves the state-of-the-art performance on five datasets in terms of six metrics, which proves the effectiveness and superiority of the proposed method.

Fig. 2: The quantitative performances of 2 state-of-the-art models (CPD19 [CPD] and PoolNet19 [PoolNet]) vary with the training data size, showing that the conventional consensus regarding the relationship between the model performance and the training set size—“the model performance is positively related to the training set size” may not always hold.

Ii Related Works

To simulate the human visual attention, early image SOD methods mainly focus on the hand-crafted visual features, cues and priors such as center prior [liu2011learning, judd2009learning], background cues [li2013saliency, wei2012geodesic], regional contrast [cheng2015global] and other kinds of relevant low-level visual cues [8798692, 7776942]. Due to the space limitation, we only concentrate on deep learning based SOD models here.

Fig. 3: Examples of those inappropriate human annotations in the current SOD benchmarks, which are quite normal and can be divided into the above mentioned four groups, accordingly.

Ii-a Single-stream Model

Generally, the deep network performance can be boosted significantly by aggregating the multi-level and multi-scale deep features between different layers. As one of the most representatives, Hou et al[DSS] proposed a top-down model to integrate both high-level and low-level features, achieving much improved SOD performance. Following this rationale, various feature aggregation schemes [SRM, Amulet, RADF, DGRL, 7895181, BMP, liu2016dhsnet, wang2019an, zhao2019egnet] were proposed latter. Zhang et al. [Amulet] first integrate multi-level feature maps into multiple resolutions, which simultaneously incorporate semantic information and spatial details. Then this work predicts the saliency map in each resolution and fuses them to generate the final saliency map. Liu and Han [liu2016dhsnet] first make a coarse global prediction, and then hierarchically and progressively refine the details of saliency maps step by step via integrating local context information. Zhang et al. [BMP] proposed a bi-directional structure with a gate unit to control information flow between multi-level features. Wang et al. [wang2019an] proposed a novel schema that integrates both top-down and bottom-up saliency inference in an iterative and cooperative manner. Zhao et al. [zhao2019egnet] present an edge guidance network for salient object detection with three steps to simultaneously model these two kinds of complementary information in a single network. Wang et al. [wang2019inferring] build a novel attentive saliency network that learns to detect salient objects from fixations, which narrows the gap between salient object detection and fixation prediction. Compared to the gate setting proposed in [BMP], the major highlight of our gate control unit is that it has achieved the full interactions between two different sub-networks by integrating complementary semantical information mutually. Additionally, our gate control unit can well preserve the non-linear capabilities, enabling faster convergence and speed up training, more details can be found in Sec. IV-A.

Ii-B Two-stream Network

In recent years, the two-stream network has achieved much attention due to its effectiveness to many computer vision applications, including visual question answering [saito2017dualnet], image recognition [hou2017dualnet, lin2015bilinear], salient object detection [zhao2015saliency, zhang2019capsal, zhou2020interactive]. Saito et al. [saito2017dualnet] propose to use different kinds of networks to extract image features in order to fully take advantage of different information present in different kinds of network structures. Lin et al. [lin2015bilinear] propose bilinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image descriptor. Hou et al. [hou2017dualnet] present a framework named DualNet to effectively learn more accurate representation for image recognition. The core idea of DualNet is to coordinate two parallel DCNNs to learn features complementary to each other, and thus richer features can be extracted from the raw images. Besides, recently, two-stream network structure also is adopted by SOD. Zhao et al.[zhao2015saliency] proposed a multi-context deep learning framework, in which the global context and local context are combined in a unified deep learning framework. Zhang et al. [zhang2019capsal]

propose a new deep neural network model named CapSal which consists of two sub-networks, to leverage the captioning information together with the local and global visual contexts for predicting salient regions. Zhou

et al. [zhou2020interactive] propose a lightweight two-stream model that uses two branches to learn the representations of salient regions and their contours respectively. All of the previous works mentioned above have demonstrated the effectiveness of the two-stream network and potentially prove this idea is good. Inspired but different from previous works, we propose a novel bi-stream network, consisting of two sub-branches with different network structures, which is aim to take advantage of rich semantic information present in the proposed MD4K datasets.

Ii-C Attention Mechanism

The “attention mechanism” has been widely used to boost the state-of-the-art methods performances [AFNet, PAGRN, PiCANet], here, we will introduce several most representative approaches. Inspired by human perception process, attention mechanism is introduced by using high-level information to efficiently guide bottom-up feedforward process, and it has achieved great success in a lot of tasks. In [li2017instance-level, chen2016attention]

, attention model was designed to weight multi-scale features. In

[wang2017residual], residual attention module was stacked to generate deep attention aware features for image classification. In [hu2019squeeze-and-excitation], channel attention was first proposed to select representative channels. After that, it has been widely applied in various tasks including semantic segmentation [yu2018learning], image deraining [li2018recurrent]

, image super-resolution

[zhang2018image]. Recently, Zhang et al[PAGRN] introduced both the spatial-wise and channel-wise attention to the SOD task. Wang et al. [wang2019salient] devise an essential pyramid attention structure for salient object detection, which enables the network to concentrate more on salient regions while exploiting multi-scale saliency information. Liu et al[PiCANet] proposed a pixel-wise contextual attention mechanism to selectively integrate the global contexts into the local ones. In [RANet20], a novel reverse attention block was designed to highlight the prediction of the missing salient object and guide side-output residual learning. In contrast, our novel multi-layer attention module aims to transfer the high-level localization information to the shallower layers, shrinking the given problem domain effectively.

Ii-D The Major Highlights of Our Method

In sharp contrast to the previous works which merely focus on the elegant network designs, our research will inspire the SOD community to pay more attention to the training data, despite in its early stage, new state-of-the-art performance can be easily reached. The proposed bi-stream network, which is well designed for the proposed small MD4K dataset, aims to take advantage of rich semantic information present in the proposed MD4K datasets. To our best knowledge, this is the first attempt to use a “wider” model with a small-scale training set yet outperform previous models which are trained on large-scale training sets.

Iii A Small-scale Training Set

Given a SOD deep model, its performance usually relies on two factors: 1) the specific training dataset and 2) the number of training data. Previous works [wang2019salient1, fan2018SOC] have discussed that the selected training dataset influences model performance. In this section, we provide a further and detailed discussion about the interrelationship between these factors and the network performances.

Fig. 4:

The semantical category distributions (classified by 

[zhou2017places]) of the MSRA10K and the DUTS-TR datasets, indicating a strong semantical complementary status between these two datasets. We only demonstrate the top-50 categories due to the space limitation.

Iii-a Do We Really Need a Large-scale Training Data?

Previous networks adopting complex network structures usually require a large-scale training data to reach their best performance. This motivates us to consider a basic problem regarding the SOD task, i.e., will continually increasing the training data size be possible to achieve a steady SOD performance improvement? To clarify this issue, we have trained the 2 state-of-the-art SOD methods, including the CPD19 [CPD] and the PoolNet19 [PoolNet]. We first train the target SOD model on the whole DUTS-TR (10K) dataset and train the target model again using the DUTS-TR (9K) dataset which randomly removes 1000 images from the former training set, and repeating the above procedure.

The relationship between the overall performance and the training data number can be observed in Fig. 2. As we can see, when training data increase to 2K, the performances have a significant improvement. However, with the training data continue growing, the performance is not always positively correlated with the amount of training data even get worse. Moreover, the performance trained on the whole DUTS-TR dataset is not the optimal result. Specifically, in terms of weighted F-measure, the performance of CPD19 on DUT-OMRON has been improved by about 12.5 after training data increase to 2K. However, when the training data increased to 3K, the performance yet fell by 3.2 . The optimal performance is obtained when the training data equal to 6K instead of the whole DUTS-TR datasets. Similar conclusion can be obtained in other datasets or metrics.

The primary reasons can be attributed to two-fold: 1) The unbalanced semantical categories in the original large-scale training set. For instance, by using the semantical labeling tool [zhou2017places], there are 351 images in the DUTS-TR dataset that are marked with the “coffee shop” semantical label, while the scenes of labeled with “campus’ is less than 10. And, the considerable redundant semantic scenes have less substantial help to improve performance. Moreover, previous works [lake2011one, snell2017prototypical] have already demonstrated that CNN based model can able to understand new concepts given just a few examples. 2) There exists a considerable amount of bias annotations in the DUTS-TR training set, and such bias annotations even worse the overall performance as proofed in Fig. 2. In Fig. 3, we present several typical inappropriate human annotations, which motivates us to build a more clean training dataset to improve the SOD performance further.

avg MAE avg MAE avg MAE

0.702 0.069 0.726 0.068 0.888 0.050

0.738 0.055 0.781 0.040 0.880 0.049

0.716 0.073 0.732 0.068 0.882 0.050

0.738 0.056 0.784 0.044 0.880 0.037

0.734 0.053 0.786 0.042 0.877 0.040
AFNet19(DTS) 0.729 0.057 0.772 0.046 0.871 0.042

TABLE I: Comparisons of the 3 state-of-the-art models trained on different datasets, where MK and DTS stand for MSRA10K and DUTS-TR respectively, and we use bold to emphasize better results.

Iii-B Which Training Set Should be Selected?

We noticed that most of the state-of-the-art models are typically trained on either the MSRA10K or the DUTS-TR dataset, then be evaluated on the others. However, this training strategy suffers from serious limitations; i.e., the data distribution inconsistency between training and testing datasets may easily lead to the “domain-shift” problem. For example, the images in the widely used training set MSRA10K are attributed as high contrast, center-surround, simple background, and containing single salient objects only. However, the images in commonly used testing set HKU-IS [zhao2015saliency] are attributed as low contrast, relative complex background, and usually containing multiple salient objects. Although the DUTS-TR dataset is complex, it introduces additional challenging problems such as non-inconsistent saliency ground-truth and controversial annotation. This motivates us to combine their advantages of MSRA10K and DUTS-TR datasets.

Actually, as the commonly used training sets, the MSRA10K and the DUTS-TR datasets are complementary in general. To back our claim, we have tested the 3 state-of-the-art SOD models in Table III, in which these models are trained on MSRA10K and DUTS-TR datasets respectively and then tested on others. As shown in Table I, we may reach to a sub-optimal training performance if we only use either the MSRA10K or the DUTS-TR training set. Also, we have demonstrated the semantical category distribution of the MSRA10K and the DUTS-TR datasets in Fig. 4

, which shows a large semantical variance between these two datasets, showing their semantical complementary.

On the other hand, previous works [wang2017learning, zhang2019capsal, zeng2019multi-source, wang2019robust, hsu2017weakly] have already demonstrated that semantic information, especially in cluttered scenes, is beneficial to the SOD task. Wang et al. [wang2019robust] propose a novel end-to-end deep learning approach for robust co-saliency detection by simultaneously learning high-level group-wise semantic representation as well as deep visual features of a given image group. To address accurately detect salient objects in cluttered scenes, the author of [zhang2019capsal] argues that the model needs to learn discriminative semantic features for salient objects, such as object categories, attributes and the semantic context. Therefore, it is necessary to build a semantical category balanced training dataset to further improve the SOD performance.

Fig. 5: The detailed architecture of the proposed bi-stream network. Our bi-stream network is developed on the commonly used ResNet50 and VGG16, using both the newly designed gate control unit (Sec. IV-A) and the scaling-free multi-layer attention (Sec. IV-C) to achieve the complementary status between two parallel sub-branches, which is also capable of taking full advantage of the multi-level deep features as well.

Iii-C Our Novel Training Dataset Construction

In this section, we build a small, GT bias-free and semantical category balanced training dataset from the MSRA10K and the DUTS-TR datasets, namely “MD4K”. The motivation can be summarized as the following 4 aspects: 1) According to our experiment, the performance is not always positively correlated with the amount of training data; 2) The off-the-shelf SOD models can not achieve the optimal performances by using single training set solely; 3) Existing training sets contain massive dirty and unbalanced data; 4) The MSRA10K and DUTS-TR datasets are complementary as mentioned before.

We first divided MSRA10K and DUTS-TR datasets into 267 categories utilizing the off-the-shelf scene classification algorithm

[zhou2017places]. Then, we manually remove all those dirty data, thus there are 9012 left in the MSRA10K dataset and 9215 images left in the DUTS-TR dataset. Interestingly, we found that the semantical category distribution of the above 18K images obeys the Pareto Principle, i.e., scene categories are account for of the total. Specifically, the top-50 scene categories of MSRA10K account for of the whole MSRA10K dataset, and such percentage is in the DUTS-TR dataset. To balance the semantical categories, we randomly select 40 images for each of the top-50 scene categories and then choose 20 images for each of the remaining 217 scene categories. In this way, we finally obtain a small-scale training set, containing 4172 images with total 267 semantical categories. The reason we choose 4172 images is that we attempt to find a balance between training size and performance, and the performance trained on a different number of data is shown in Table II. According to the experimental results, the training set with 4172 images can achieve better performance than DUTS-TR meanwhile decrease the training data number significantly.

The significance of the proposed MD4K dataset can be summarized as follows: 1) The proposed MD4K can alleviate the demands for large-scale training data; 2) Our proposed MD4K boost the state-of-the- art models performance consistently; 3) Our MD4K may inspire other researchers about how to build a training set.

Tested onTrained on DUTS-TR MD1K MD2K MD3K MD4K MD5K MD6K
DUT-OMRON [yang2013saliency] 0.835 0.715 0.794 0.832 0.857 0.864 0.866
DUTS-TE [wang2017learning] 0.879 0.774 0.829 0.863 0.884 0.893 0.897
ECSSD [ecssd] 0.934 0.876 0.876 0.918 0.945 0.947 0.955
HKU-IS [zhao2015saliency] 0.933 0.864 0.885 0.920 0.942 0.948 0.952
PASCAL-S [li2014secrets] 0.885 0.778 0.837 0.864 0.886 0.895 0.897

TABLE II: Performance trained on a different number of MD4K data. For each dataset, we use the average max F-measure to evaluate their performance.

Iv Proposed Network

So far, we have built a small-scale and high-quality training dataset which can consistently boost the state-of-the-art performances, see the quantitative proofs in Table VIII. To further improve, we propose a novel bi-stream network consisting of two feature backbones with different structures, aiming to sense complementary semantical information, taking full advantage of our semantical balanced small-scale training set.

Iv-a How to Fuse Bi-stream Networks

In this section, we consider how to effectively fuse two different feature backbones, in which we attempt to use feature maps extracted from one sub-branch to benefit another one. We shall provide some preliminaries regarding the conventional common threads here.

For simplicity, the function f: {} represents fusing two feature maps and to generate the output feature Y, where {}, denote the height, width and channels respectively.

1) Element-wise summation, , which calculates the sum of two features at the same locations () and channels ():


2) Element-wise maximum, , which analogously takes the maximum of two input feature maps:


3) Concatenation, , which stack the input feature maps channel-wisely:


4) Convolution, , which first employ the concatenation operation to obtain features and then convolve it:


where denotes the convolution operation, W represents the convolution filters, and b denotes the bias parameters.

Fig. 6: Qualitative comparisons to the recent state-of-the-art models. Our approach can well locate the salient objects completely with sharp boundaries.

Iv-B Bi-stream Fusion via Gate Control Unit

In general, all of the above-mentioned fusion operations directly fuse two input feature maps without considering the feature conflictions between different layers, which easily lead to the sub-optimal results, see the quantitative proofs in Table VII. Inspired from the previous work [LSTM], we propose a novel gate control unit, i.e., input gate and output gate, to control which information flows in the network, where the Fig. 5 illustrates our novel network architecture. In our method, the proposed input gate play a critical role in aggregating feature maps. For clarity, let = {, } denotes the feature maps for each convolutional blocks of the pre-trained VGG16 feature backbone. Similarly, represents the feature maps of the pre-trained ResNet50 backbone.

We introduce the dynamic thresholding in the proposed input gate, in which each side-output of VGG16 with a probability below the threshold will be suppressed. Specifically, each side-output of VGG16 is a linear projection

modulated by the gates .

In practice, the input gate will be element-wisely multiplied by the side-output feature matrix , controlling the interactions between the parallel sub-branches hierarchically. Thus, the fused bi-stream feature maps () can be obtained by using the below operation.


where W, b, , are learned parameters,

is the sigmoid function and

is the element-wise multiplication operation.

Moreover, previous SOD models directly propagate the feature maps from low-layer to high-layer without considering whether these features are beneficial to the SOD task. In fact, only a small part of these features are useful, yet others may lead to even worse performance. To solve this problem, we propose a multiplicative operation based “output gate” to suppress those distractions from the non-salient regions. That is, given two consecutive layers, the feature responses in the precedent layer will serve as the guidance for the next layer to adaptively control which data flow should be propagated automatically, and this procedure can be formulated as Eq. 6.


where , is the learned weights and biases. In this way, the salient regions which have high responses will be enhanced while the background regions will be suppressed in subsequent layers. Consequently, our gate control unit constantly boost the conventional fusion performances, see the quantitative proofs in Table VII.

Differences to the LSTM.
The gradient in original LSTM [dauphin2015predicting] can be expressed as:


Notice that such gradient will gradually get vanished due to the down-scaling factor and . In sharp contrast, the gradient of our gate mechanism has a directional path without using any down-scaling operations for the activated gating units in as Eq. 8.


Thus, the proposed gate control unit outperforms the LSTM significantly, see the quantitative proofs in the Table VII, i.e., “Conv w/ GCN (Ours)” vs. “Conv w/ GCN (LSTM)”.

Images Dataset max MAE max MAE max MAE max MAE max MAE

ResNet50+VGG16 4172 MD4K 0.857 0.044 0.884 0.038 0.945 0.036 0.942 0.031 0.886 0.082

ResNet50+VGG16 10553 DTS 0.835 0.046 0.879 0.041 0.934 0.039 0.933 0.033 0.885 0.089

ResNet50+VGG16 10000 MK 0.828 0.047 0.863 0.044 0.931 0.042 0.917 0.035 0.857 0.088

ResNet50+ResNet50 4172 MD4K 0.833 0.046 0.855 0.041 0.921 0.043 0.916 0.037 0.853 0.087

VGG16+VGG16 4172 MD4K 0.826 0.049 0.849 0.047 0.924 0.042 0.918 0.033 0.844 0.092

VGG16 10553 DTS 0.799 0.058 0.874 0.044 0.941 0.042 0.928 0.036 0.866 0.078

VGG16 10553 DTS 0.793 0.061 0.855 0.050 0.935 0.044 0.921 0.030 0.864 0.075

ResNet50 10553 DTS 0.731 0.062 0.792 0.048 0.904 0.048 0.891 0.039 0.818 0.075

ResNet50 4172 MD4K 0.762 0.052 0.850 0.040 0.934 0.037 0.915 0.032 0.846 0.090

ResNet50 10553 DTS 0.754 0.056 0.841 0.044 0.926 0.037 0.911 0.034 0.843 0.092

ResNet50 4172 MD4K 0.767 0.051 0.863 0.042 0.931 0.040 0.922 0.033 0.859 0.084

ResNet50 10553 DTS 0.763 0.055 0.858 0.040 0.920 0.042 0.917 0.033 0.856 0.093

VGG16 4172 MD4K 0.765 0.054 0.842 0.044 0.932 0.041 0.913 0.034 0.854 0.087

VGG16 10553 DTS 0.759 0.057 0.838 0.046 0.924 0.042 0.910 0.036 0.852 0.089

ResNet34 10553 DTS 0.805 0.057 0.859 0.048 0.942 0.037 0.929 0.032 0.876 0.092

DenseNet169 310K CO+DTS 0.677 0.109 0.722 0.092 0.859 0.096 0.835 0.084 0.781 0.153

VGG19 10553 DTS 0.707 0.071 0.818 0.056 0.904 0.061 0.897 0.048 0.817 0.120

ResNet50 10553 DTS 0.739 0.062 0.806 0.051 0.914 0.049 0.900 0.036 0.856 0.085

VGG16 10000 MK 0.756 0.072 0.786 0.072 0.905 0.060 0.895 0.050 0.817 0.123

ResNeXt 10000 MK 0.460 0.138 0.478 0.136 0.656 0.161 0.583 0.150 0.611 0.203

ResNet50 10553 DTS 0.725 0.069 0.799 0.059 0.905 0.054 0.893 0.046 0.812 0.105

VGG16 10000 MK 0.715 0.098 0.751 0.085 0.904 0.059 0.884 0.052 0.836 0.107

VGG16 10000 MK 0.705 0.132 0.740 0.118 0.897 0.078 0.871 0.074 0.820 0.131

VGG16 2500 MB 0.681 0.092 0.751 0.081 0.856 0.090 0.865 0.067 0.777 0.149

TABLE III: The detailed quantitative comparisons between our method and 16 state-of-the-art models in F-measure and MAE. Top three scores are denoted in red, green and blue, respectively. {MD4K, DTS, MK, MB, VOC, TH, CO} are training datasets which respectively denote {our small dataset, DUTS-TR, MSRA10K, MSRA-B, PASCAL VOC2007, THUS10K, Microsoft COCO}. The symbol “*” indicates that the target models were trained on the MD4K dataset.
(d) HKU-IS
Fig. 7: The first row shows the PR curves of the proposed method with other state-of-the-art methods and the second shows F-measure curves. The proposed method performs best among all datasets in terms of all metrics.

Iv-C Multi-layer Attention

In general, the predicted saliency maps will lose their details if we use sequential scaling operations (e.g., pooling). Actually, the visual features generated in deep layers are usually abundant in high-level information, while the tiny details are preserved in shallower layers. Previous works have widely taken full advantage of the multi-level and multi-scale deep features, which introduce features in deep layers to shallower layers via short connections, and this topic has been well studied in [DSS].

However, as for our bi-stream network, the overall performance is mainly ensured by the gate mechanism based complementary fusion. Consequently, the feature map quality in each sub-branch is quite limited, which may result in performance degradation if we follow the conventional “lowhigh” or “highlow” feature connections directly.

Instead of combining multi-level features indiscriminately, the proposed multi-layer attention (MLA) is developed by using feature maps in deep layers , which provide valuable location information for the shallower layers. We demonstrate the MLA dataflow in Fig. 5, and its details can be formulated as follows:


where integrates the information of all channels in , denotes the feature at location , and is the location attention map. Next, these location attention maps are applied to facilitate those low-level features as following:


where the function denotes the element-wise summation, stands for downsampling operation. After obtaining the updated , it will be feed into the decoder part to recover details progressively. Compared with the widely used multi-scale short-connections, the proposed MLA can improve the overall performance significantly, and the corresponding quantitative proofs can be found in Table VIII.

Images Dataset W- S-m W- S-m W- S-m W- S-m W- S-m
Ours ResNet50+VGG16 4172 MD4K 0.761 0.858 0.804 0.883 0.915 0.936 0.902 0.921 0.816 0.857

ResNet50+VGG16 10553 DTS 0.757 0.847 0.788 0.871 0.908 0.920 0.893 0.914 0.808 0.851

ResNet50+VGG16 10000 MK 0.748 0.843 0.782 0.864 0.902 0.915 0.884 0.907 0.794 0.842

ResNet50+ResNet50 4172 MD4K 0.723 0.834 0.782 0.861 0.891 0.918 0.886 0.907 0.803 0.848

VGG16+VGG16 4172 MD4K 0.716 0.831 0.780 0.867 0.890 0.912 0.874 0.904 0.788 0.102

VGG16 10553 DTS 0.671 0.825 0.743 0.874 0.866 0.917 0.846 0.908 0.757 0.847

VGG16 10553 DTS - 0.824 - 0.861 - 0.915 - 0.903 - 0.847

ResNet50 4172 MD4K 0.722 0.845 0.785 0.874 0.891 0.913 0.879 0.912 0.784 0.839

ResNet50 10553 DTS 0.705 0.825 0.769 0.868 0.889 0.918 0.866 0.906 0.771 0.828

ResNet50 4172 MD4K 0.717 0.851 0.786 0.894 0.893 0.940 0.885 0.923 0.798 0.849

ResNet50 10553 DTS 0.696 0.831 0.775 0.886 0.890 0.926 0.873 0.919 0.781 0.847

VGG16 4172 MD4K 0.712 0.834 0.762 0.874 0.875 0.916 0.863 0.912 0.787 0.845

VGG16 10553 DTS 0.690 0.826 0.747 0.866 0.867 0.914 0.848 0.905 0.772 0.833

ResNet34 10553 DTS 0.752 0.836 0.793 0.865 0.904 0.916 0.889 0.909 0.776 0.819

DenseNet169 310K CO+DTS 0.423 0.756 0.531 0.757 0.652 0.828 0.613 0.818 0.613 0.753

VGG19 10553 DTS 0.601 0.775 0.685 0.837 0.822 0.889 0.805 0.887 0.701 0.793

ResNet50 10553 DTS 0.709 0.806 0.768 0.841 0.891 0.903 0.875 0.895 0.791 0.828

VGG16 10000 MK 0.611 0.813 0.635 0.824 0.802 0.895 0.782 0.888 0.709 0.797

ResNeXt 10000 MK 0.726 0.817 0.648 0.835 0.902 0.910 0.877 0.895 0.737 0.788

ResNet50 10553 DTS 0.607 0.798 0.662 0.835 0.825 0.895 0.802 0.888 0.736 0.817

VGG16 10000 MK 0.563 0.781 0.594 0.803 0.798 0.894 0.767 0.883 0.732 0.820

VGG16 10000 MK 0.465 0.758 0.493 0.778 0.688 0.883 0.656 0.866 0.666 0.808

VGG16 2500 MB 0.481 0.748 0.538 0.790 0.688 0.836 0.677 0.852 0.626 0.749

TABLE IV: The detailed quantitative comparisons between our method and state-of-the-art models in weighted F-measure and S-measure. Top three scores are denoted in red, green and blue, respectively. {MD4K, DTS, MK, MB, VOC, TH, CO} are training datasets which respectively denote {our small dataset, DUTS-TR, MSRA10K, MSRA-B, PASCAL VOC2007, THUS10K, Microsoft COCO}. The symbol “*” indicates that the target models were trained on the MD4K dataset.

V Experiments and Results

V-a Datasets

We have evaluated the performance of the proposed method on five commonly used benchmark datasets, including DUT-OMRON  [yang2013saliency], DUTS-TE [wang2017learning], ECSSD [ecssd], HKU-IS [zhao2015saliency] and PASCAL-S [li2014secrets]. DUT-OMRON contains 5,168 high-quality images. Images of this dataset have one or more salient objects with complex backgrounds. DUTS-TE has 5,019 images with high-quality pixel-wise annotations, which is selected from the currently largest SOD benchmark DUTS. ECSSD has 1,000 natural images, which contain many semantically meaningful and complex structures. As an extension of the complex scene saliency dataset, ECSSD is obtained by aggregating the images from BSD [martin2004learning] and PASCAL VOC [everingham2010pascal]. HKU-IS contains 4,447 images. Most of the images in this dataset have low contrast with more than one salient object. PASCAL-S contains 850 natural images with several objects, which are carefully selected from the PASCAL VOC dataset with 20 object categories and complex scenes.

V-B Evaluation Metrics

We have adopted commonly used quantitative metrics to evaluate our method, including the Precision-recall (PR) curves, the F-measure curves, Mean Absolute Error (MAE), weighted F-measure, and S-measure.

PR curves. Following the previous settings [cheng2015global, achanta2009frequency], we first utilize the standard PR curves to evaluate the performance of our model.


. The F-measure is a harmonic mean of average precision and average recall. we compute the F-measure as


where we set to be 0.3 to weigh precision more than recall.

MAE. The MAE is calculated as the average pixel-wise absolute difference between the binary and the saliency map as Eq. 13.


where and are width and height of the saliency map , respectively.

Weighted F-measure. Weighted F-measure [w-fmeasure] define weighted Precision, which is a measure of exactness, and weighted Recall, which is a measure of completeness:


S-measure. S-measure [Smeasure] simultaneously evaluates region-aware and object-aware structural similarity between the saliency map and ground truth. It can be written as follows: , where is set to 0.5.

Fig. 8: Demonstration of the sub-branch complementary status.
Method Model(MB) Encoder(MB) Decoder(MB) FLOPs(G) Params(M)
Ours 235.5 152.6 82.9 65.53 71.67
CPD19[CPD] 192 95.6 96.4 17.75 47.85
BASNet19[BASNet19] 348.5 87.3 261.2 127.32 87.06
PoolNet19[PoolNet] 278.5 94.7 183.8 88.91 68.26
TABLE V: The number of model size, FLOPs and parameters comparisons of our method with 3 state-of-the-art models.

V-C Implementation Details

The proposed method is developed on the public deep learning framework PyTorch. We run our model in a machine with an i7-6700 CPU (3.4 GHz and 8 GB RAM) and a NVIDIA GeForce GTX 1070 GPU (with 8G memory). Our bi-stream model was trained on the proposed small training dataset (MD4K). Then, we test our model on the other datasets. Due to the GPU memory limitation, we set the mini-batch size to 4. We use the stochastic gradient descent (SOD) method to train our model with a momentum 0.99 and weight decay 0.0005. We use the fixed learning rate policy and set the base learning rate to

. Learning stops after 30K iterations, and we use standard Binary Cross Entropy loss during learning.

V-D Comparison with the state-of-the-art Methods

We have compared our method with 16 state-of-the-art models, including DSS17 [DSS], Amulet17 [Amulet], UCF17 [UCF], SRM17 [SRM], RNet18[R3Net18], RADF18 [RADF], PAGRN18 [PAGRN], DGRL18 [DGRL], MWS19 [MWS], CPD19 [CPD], AFNet19 [AFNet], PoolNet19 [PoolNet], BASNet19[BASNet19], RNet20[R2Net20], MRNet20[MRNet20] and RANet20[RANet20]. For all of these methods, we use the original codes with recommended parameter settings or the saliency maps provided by the authors. Moreover, our results are diametrically generate by model without relying on any post-processing and all the predicted saliency maps are evaluated with the same evaluation code.

Fig. 9: Visual comparison of the proposed model with multi-layer attention (“Ours+MLA”) and without multi-layer attention (“Ours-MLA”).
Method Ours RANet20 RNet20 MRNet20 BASNet19
FPS 23 42 33 14 25
Method CPD19 PoolNet19 AFNet19 DGRL18 RADF18
FPS 62 27 23 6 18
TABLE VI: Running time comparisons.

Quantitative Comparisons. As a commonly used quantitative evaluation venue, we first investigate our model using the PR curves. As shown in the first row of Fig. 7, our model can consistently outperform the state-of-the-art models on all tested benchmark datasets. Specifically, the proposed model outperforms other models on DUT-OMRON datasets. Meanwhile, our model also is evaluated by F-measure curves as shown in the second row of Fig. 7, which also demonstrates the superiority of our method. The detailed F-measure, MAE, weighted F-measure and S-measure values are provided in Table III and Table IV, in which our method also performs favorably against other state-of-the-art approaches.

Qualitative Comparisons. We demonstrate the qualitative comparisons in Fig. 6. The proposed method not only detects the salient objects accurately and completely, but preserves subtle details also. Specifically, the proposed model can adapt to various scenarios as well, including the object occlusion case (raw 1), the complex background case (row 2), the small object case (row 3) and the low contrast case (row 4). Moreover, our method can consistently highlight the foreground regions with sharp object boundaries.

To further illustrate the complementary status between VGG16 and ResNet50, Fig. 8 shows the saliency maps of these two sub-branches in mining salient regions. We observe that these two sub-branches are capable of revealing different but complementary salient regions.

Fusion Method
max MAE max MAE max MAE
Conv w/ GCN (Ours) 0.857 0.044 0.884 0.038 0.945 0.036

Conv w/ GCN (LSTM)
0.834 0.046 0.864 0.045 0.934 0.042

Conv w/o GCN
0.821 0.049 0.844 0.051 0.927 0.048
Sum w/ GCN 0.848 0.047 0.873 0.044 0.925 0.043

Sum w/o GCN
0.813 0.055 0.845 0.052 0.897 0.049

Concat w/ GCN
0.827 0.049 0.862 0.047 0.908 0.046

Concat w/o GCN
0.802 0.059 0.847 0.058 0.887 0.054

Max w/ GCN
0.818 0.050 0.853 0.048 0.909 0.047

Max w/o GCN
0.813 0.054 0.836 0.054 0.887 0.053

TABLE VII: Performance comparisons of different fusion strategies, where “w/” denotes “with”, “w/o” denotes “without”; GCN: Gate Control Unit; Conv, Sum, Concat, Max are four conventional fusion schemes mentioned in Sec. IV-A. “Conv w/ GCN (LSTM)” denotes the performance using the gate control logic of LSTM.

Running Time and Model Complexity Comparisons. Table VI shows the running time comparisons. This evaluation was conducted on the same machine with an i7-6700 CPU and a GTX 1070 GPU, in which our model achieves 23 FPS. Furthermore, we compare model size, FLOPs and the number of parameters with other popular methods in Table V. In spite of using two feature extractors, our model complexity is not so much heavy and only slightly worse than CPD [CPD]. As shown in Table V, previous methods treat the feature backbones as the off-the-shelf tools and pay more attention to design complex decoder to improve the overall performance . In sharp contrast, the propose bi-stream network is concentrate on the encoder instead of devising a complex decoder and achieves new state-of-the-art performance, showing the importance of feature extractor.

V-E Component Evaluations

Effectiveness of the Proposed MD4K Dataset. To illustrate the advantages of the proposed dataset, we train the proposed bi-stream network on MD4K and DUTS-TR datasets respectively. Compared to train on the DUTS-TR dataset, the bi-stream network with the MD4K dataset achieves better performance in terms of different measures, which demonstrates the effectiveness of the proposed dataset. Besides, as shown in the rows 9-14 of Table III, three state-of-the-art methods (i.e., PoolNet19, CPD19 and AFNet19) are trained on either the DUT-OMRON dataset or our MD4K dataset respectively. Clearly, models trained on the MD4K dataset achieve better performance than the ones trained on the large-scale DUT-OMRON dataset, also showing the effectiveness of the proposed MD4K dataset. To demonstrate the importance of balanced semantic distribution, except for the proposed bi-stream network, we also train 3 state-of-the-art models on M4K and D4K which is randomly selected from MSRA10K and DUTS-TR respectively as shown in Table VIII. There is no exception, models trained on semantic balanced datasets achieves significantly improve their performance. The primary reason is that models, trained on a semantical category balanced dataset, make itself learned on more practical scenes and consequently will enhance generability of model to other datasets.

max MAE max MAE max MAE
Ours(MD4K) 0.857 0.044 0.884 0.038 0.945 0.036

0.825 0.048 0.838 0.051 0.905 0.048

0.820 0.060 0.823 0.052 0.887 0.050
CPD19(MD4K) 0.762 0.052 0.850 0.040 0.943 0.037

0.721 0.063 0.824 0.048 0.902 0.043

0.722 0.060 0.818 0.056 0.889 0.061

0.767 0.051 0.863 0.042 0.931 0.040

0.738 0.064 0.839 0.047 0.907 0.043

0.733 0.065 0.836 0.048 0.897 0.045

0.765 0.054 0.842 0.044 0.932 0.041
AFNet19(D4K) 0.737 0.065 0.823 0.057 0.891 0.062

0.728 0.063 0.830 0.053 0.895 0.060

w/ MLA
0.857 0.044 0.884 0.038 0.945 0.036

w/o MLA
0.834 0.050 0.858 0.043 0.923 0.044

TABLE VIII: Quantitative proofs regarding the effectiveness of our proposed small-scale training set (MD4K), where D4K (M4K) represents randomly extract 4172 images from DUTS-TR (MSRA10K) datasets.

Effectiveness of the Proposed Bi-stream Network. To demonstrate the effectiveness of the proposed bi-stream network, we also implement the proposed bi-stream network by using other sub-network combinations, i.e., “VGG16+VGG16” and “ResNet50+ResNet50”, see Table III. Compared to the “VGG16+VGG16” and “ResNet50+ResNet50” based model, which trained on the MD4K dataset, the proposed bi-stream network achieves better performance. In addition, we also report the performance of the proposed bi-stream network trained on the DUTS-TR dataset as shown in 2nd row of Table III. As we can see, our model trained on DUTS-TR achieves better performance than state-of-the-art models, which also suggests that the proposed bi-stream network is effective.

Effectiveness of the Gate Control Unit. To validate the exact contribution of the proposed Gate Control Unit (GCN), we first tested previously mentioned 4 fusion schemes (Sec. IV-A) without using our GCN as the baselines. Then, we apply our GCN into these conventional fusion schemes, and the corresponding quantitative results can be found in Table VII, in which our GCN can boost the conventional fusion schemes significantly.

Effectiveness of the Multi-layer Attention. As shown in the last row of Table VIII, the overall performance constantly improves after using the multi-layer attention, e.g., F-measure: , MAE: on the DUT-OMRON dataset. Additionally, Fig. 9 shows that the proposed multi-layer attention is capable of sharping the object boundaries.

Vi Conclusion

In this paper, we have provided a deeper insight into the interrelationship between the SOD performance and the training dataset, including the choice of training dataset and the amount of training data that the model requires. Inspired by our findings, we have built a small, hybrid, and scene category balanced training dataset to alleviate the demands for the large-scale training set. Moreover, the proposed training set can essentially improve the state-of-the-art methods performances, providing a paradigm regarding how to effectively design a training set. Meanwhile, we have proposed a novel bi-stream architecture with gate control unit and multi-layer attention to take full advantage of the proposed small-scale training set. Extensive experiments have demonstrated that the proposed bi-stream network can work well with the small training set, achieving new state-of-the-art performance on five benchmark datasets.