DeepAI
Log In Sign Up

Cross-layer Feature Pyramid Network for Salient Object Detection

Feature pyramid network (FPN) based models, which fuse the semantics and salient details in a progressive manner, have been proven highly effective in salient object detection. However, it is observed that these models often generate saliency maps with incomplete object structures or unclear object boundaries, due to the indirect information propagation among distant layers that makes such fusion structure less effective. In this work, we propose a novel Cross-layer Feature Pyramid Network (CFPN), in which direct cross-layer communication is enabled to improve the progressive fusion in salient object detection. Specifically, the proposed network first aggregates multi-scale features from different layers into feature maps that have access to both the high- and low-level information. Then, it distributes the aggregated features to all the involved layers to gain access to richer context. In this way, the distributed features per layer own both semantics and salient details from all other layers simultaneously, and suffer reduced loss of important information. Extensive experimental results over six widely used salient object detection benchmarks and with three popular backbones clearly demonstrate that CFPN can accurately locate fairly complete salient regions and effectively segment the object boundaries.

READ FULL TEXT VIEW PDF

page 1

page 3

page 4

page 5

page 6

page 7

08/22/2019

EGNet:Edge Guidance Network for Salient Object Detection

Fully convolutional neural networks (FCNs) have shown their advantages i...
01/07/2021

Progressive Self-Guided Loss for Salient Object Detection

We present a simple yet effective progressive self-guided loss function ...
11/14/2019

Progressive Feature Polishing Network for Salient Object Detection

Feature matters for salient object detection. Existing methods mainly fo...
12/24/2020

EDN: Salient Object Detection via Extremely-Downsampled Network

Recent progress on salient object detection (SOD) mainly benefits from m...
10/13/2021

Saliency Detection via Global Context Enhanced Feature Fusion and Edge Weighted Loss

UNet-based methods have shown outstanding performance in salient object ...
03/12/2020

Highly Efficient Salient Object Detection with 100K Parameters

Salient object detection models often demand a considerable amount of co...
03/23/2021

Salient Image Matting

In this paper, we propose an image matting framework called Salient Imag...

1 Introduction

Figure 1: Illustration of existing feature pyramid fusion based structure and the proposed CFPN. Top panel: (a) existing FPN based context fusion structure; (b) pipeline of the proposed cross-layer feature pyramid network (CFPN). Bottom Panel: (d) and (e) are examples of saliency maps produced by vanilla FPN based saliency methods PiCANet [21], PoolNet [20]; (f) saliency maps generated by our CFPN. Clearly, saliency maps produced by CFPN show clearer object contour and look closer to the ground truth.

Salient object detection aims to locate and segment the most visually distinctive objects or regions in a given image. It serves as a fundamental step in many computer vision tasks like object segmentation 

[40, 28], visual tracking [8, 10] and photo cropping [19]

. Recently, deep learning based approaches 

[22, 37, 11, 3, 47, 36, 21, 20, 42, 29] have achieved remarkable performance in salient object detection, outperforming the traditional methods [32, 4, 45, 44] by a large margin. Among them, those leveraging pyramid style fusion [18, 30, 27, 41], especially the feature pyramid network (FPN [18]) that progressively fuses multi-scale features in a top-down pathway, have received great attention due to their effectiveness for improving localization accuracy and recovering boundary details.

Despite their good performance, there is still a large room of improvement for this fusion based approach. As shown in Fig. 1 (a), the pyramid fusion structure stage-wisely fuses high-level semantics with low-level details via lateral connections. However, two drawbacks exist in such approach. First, the low-level visual information, such as object edge can only be accessed at the final fusion stage, making predicted saliency maps from those methods have low-quality object boundaries. Second, as pointed out in [20, 42]

, in such pyramid fusion structure, the high-level semantics are progressively transmitted to the shallower layers, and hence the semantically salient cues captured by deeper layers may be gradually diluted throughout the progressive fusion. As a result, the predicted results tend to have incomplete object structures or over-predicted foreground regions. To alleviate this limitation, attention models 

[21, 7, 39], gate functions [1, 51, 46], multi-scale feature integration [36, 49], and extra supervision (e.g., edge detection [20], boundary loss [24]) have been proposed in the literature. However, the information propagation is mainly limited between adjacent layers111In this paper, layers refer to the side-output features of the backbone. at each fusion stage. Thus, these models still suffer from similar problems as illustrated in Fig. 1 (d)-(e).

In this work, we propose a novel Cross-layer Feature Pyramid Network (CFPN) aiming at directly exchanging information across different layers and further boosting information propagation for better salient object detection. As illustrated in Fig. 1 (b), CFPN is built on FPN but adopts the following novel architecture designs. First, it contains a Cross-layer Feature Aggregation module (CFA) that incorporates multi-scale features from different and distant layers to allow communication among different layers. Among which, CFA dynamically generates a set of layer-specific aggregation weights to weigh different layer features according to their usefulness for salient object detection. Second, given the reweighted features from CFA, CFPN also contains a Cross-layer Feature Distribution module (CFD) to allocate the aggregated features to their corresponding layers for the subsequent stage-wise fusion. Collaborating with CFD, the distributed features at each layer have access to both semantics and fine details from all other layers simultaneously, and hence reducing the loss of important information during the progressive fusion. As a result, better saliency maps can be obtained as shown in Fig. 1 (f). Clearly, benefiting from more direct information propagation among all the layers, CFPN can predict more complete salient objects with more accurate boundaries.

Our main contributions are summarized as follows:

  • Through analyzing performance limitation of FPN-alike models, we propose that establishing direct information communication across multiple layers is important for salient object detection, which has not been considered before.

  • We design two novel modules, i.e., the cross-layer feature aggregation module (CFA) and the cross-layer feature distribution module (CFD), which together allow efficient information communication across multiple layers.

  • We develop the CFPN model based on the above two modules. It can bring consistent performance boost to a variety of backbones including VGG-16 [31], ResNet-18 [9] and ResNet-50 [9] for salient object detection. It establishes new state-of-the-arts on multiple benchmarks.

2 Related Work

Early salient object detection methods usually rely on hand-crafted features and heuristic priors 

[4, 45, 32, 2]

, achieving only limited performance due to lack of high-level semantic information. Recently, benefiting from convolutional neural networks (CNNs), salient object detection enjoys much progress 

[11, 14, 47, 38].

Some deep saliency methods [14, 16, 33, 50] divide images into patches or superpixels, and extract single or multiple scales features from each patch or superpixel for determining whether the image regions are salient. Though better performance has been achieved than traditional methods, processing images in a patch-wise way ignores the essential spatial information of the whole image, which limits the accuracy for detecting the entire salient objects.

Figure 2: Overall framework of CFPN. It first extracts local representations () with backbone. Then, a cross-layer feature aggregation module (CFA) and a cross-layer feature distribution module (CFD) are inserted into the feature pyramid network (FPN) to explore the salient regions. Details of CFA are shown in Fig. 3 and Sec. 3.2; details of CFD are presented in Sec. 3.3.

Some more effective models are developed based on fully convolutional networks (FCNs) [23]. Wang et al[35] exploit low-level cues to generate guidance saliency maps by leveraging cascaded FCN. Liu et al[22] develop a two-stage network which produces coarse saliency maps first and then integrates local context information to refine them recurrently and hierarchically. Hou et al[11] introduce short connections into the HED [43] architecture, and predict salient objects based on aggregated saliency maps from each side-output. Wang et al[36] propose to generate a coarse prediction map via FCN, and then refine it stage-wisely. Zhang et al[47] utilize multi-level context information for accurate salient object detection with the HED network. In [37], Wang et al. propose to recurrently locate salient objects with local saliency cues. Zhang et al[46] extract context-aware multi-level features and utilize a bi-directional gated structure to pass message between them.

Some works introduce the attention mechanism into the network design to exploit multi-level context information for saliency detection. For example, Zhang et al[21] and Liu et al[49] both devise attention guided networks in which multiple layer-wise attention is progressively integrated for saliency detection. Wang et al[39] first extend regular attention mechanisms with multi-scale information to represent visual saliency contents, and then further improve salient object segmentation performance using salient edge information.

More recently, the feature pyramid networks (FPNs) [18] that are designed in a top-down manner have received growing attention in salient object detection. Liu et al[20] propose a poolnet via plugging topmost level information into FPN fusion branch for detecting the salient objects jointly with the edge detection. Wu et al[42] propose a cascaded partial decoder framework cascading high-level feature maps to refined the low-level features. We propose to detect salient objects by conducting cross-layer communication to enhance the progressive fusion of FPN branch for salient object detection.

3 Method

3.1 Overall Architecture

Fig. 2 shows the overall architecture of our Cross-layer Feature Pyramid Network (CFPN). It consists of two novel components, i.e. a Cross-layer Feature Aggregation module (CFA) and a Cross-layer Feature Distribution module (CFD). The CFA first adaptively generates a set of fusion weights for enhancing the original features at each layer by allowing information exchange among multiple layers. With this, the features are enhanced to have richer contexts. After CFA, the CFD allocates the aggregated features back to their corresponding layers via multi-scale pooling. Finally, facilitated by the distributed feature maps, CFPN gradually merges them in a top-down manner, similar to FPN, to produce the final saliency output.

3.2 Cross-layer Feature Aggregation

As described earlier, FPN based approaches often produce incomplete saliency maps due to gradual dilution of semantics during the progressive fusion. See Row 1 and 4 in Fig. 4 for illustration. Though recent works [20, 42] propose to aggregate the most top layer information into FPN fusion branch, this problem still exists and harms final results, as demonstrated in Column 2 and 5 in Fig. 4. In order to enable direct and more efficient communication among different layers, we propose to improve the fusion mechanism in FPN by aggregating all layer features simultaneously. Specifically, since the importance of different layer features largely depends on the image content, we devise a Cross-layer Feature Aggregation (CFA) module to adaptively predict a set of weights according to the importance of each level feature for aggregation. In this way, the features more useful for salient object detection will be promoted.

Denote the multi-level features output by the first pooling layer and the following four convolutional blocks of the ResNet [9] backbone as . We first append a convolutional layer at each level for dimension reduction, resulting in features with channel numbers . CFA then applies global average pooling at each level to squeeze its spatial information, and further concatenates channel-wise statistics from all the levels to integrate local and global contexts to construct multi-scale representations. Formally, given each level feature , CFA calculates the channel-wise global representation by

(1)

where is the concatenation function, is the channel number of global representation . refers to the overall index of local feature levels, and the pair-wise is the spatial coordinate of the feature map at each level.

Figure 3: Detailed illustration of the proposed cross-layer feature aggregation module (CFA). is the learned layer-wise fusion weight for enhancing features per layer. GAP refers to global average pooling operation. c⃝ is the feature concatenation operation, and FC refers to the fully connected layer.

We attempt to leverage the aggregated information to make each level features focus on salient regions instead of the overall feature maps. To this end, our CFA learns a layer-wise fusion weight by using a simple gating mechanism for , i.e.,

(2)

Here and are two fully connected layers inspired by SENet [12], and denotes the transformed dimension of the global representation, which is set to 128 empirically.

denotes the ReLU activation function. With the fusion weight

, we dynamically enhance each original layer feature by

(3)

where is the -th element in the , and means the scalar multiplication between and . In this way, the adaptively enhanced multi-level features form a compact global image representation for guiding accurate saliency detection. To be more specific, we first upsample to the same resolution as

by bilinear interpolation, and then concatenate them to generate the global feature map

. Formally, this process can be expressed as

(4)

where refers to the concatenation operation, and denotes the upsampling function with bilinear interpolation.

Figure 4: (a) Example input images and corresponding ground-truth labels. (b-f) Visualizations of progressive fusion feature maps at different levels from FPN (Row 1, 3), PoolNet [20] (Row 2, 4), and CFD (Row 3, 6). (g) Saliency maps generated from FPN (Row 1, 3), PoolNet [20] (Row 2, 4) and CFD (Row 3, 6), respectively. As can be seen, with our CFD, feature maps at each level contain richer contexts, which can more precisely highlight the whole salient objects (Row 3) and effectively suppress the over-predicted foreground regions (Row 6), compared to the villain FPN based decoder branch (Rows 1, 2, 4, 5).

3.3 Cross-layer Feature Distribution

Given the aggregated features from the previous CFA module, a direct method for producing the saliency map is to convolute the integrated feature with a new convolutional layer. Although this method can detect salient objects with richer contexts, the prediction is still not satisfactory by using such single stage inference, as shown in Fig. 6 (b). Instead, we propose to combine the aggregated feature maps with FPN, and infer salient regions in a stage-wise fusion manner. Unlike vanilla FPN, each layer feature now has access to the full spectrum of multi-level representation during the stage-wise fusion, thanks to the aggregation of multi-scale features by the CFA module. Thus, the aforementioned limitations of FPN are largely alleviated. To this end, we devise a Cross-layer Feature Distribution Module (CFD) to allocate multi-level features by performing multi-scale pooling over the aggregated feature . In this way, both semantics and salient details can be adaptively accessed at each level of fusion, which boosts the stage-wise fusion in FPN and helps better predict the whole salient objects, as shown in Rows 3 and 6 in Fig. 4.

Specifically, CFD first feeds to the average pooling layers with pyramid downsampling rates to convert the aggregated features to different scale spaces. Taking the ResNet version of FPN as an example, the downsampling rates corresponding to levels are {1, 1, 2, 4, 8}, respectively. Then, a

convolutional layer along with batch-normalization

and activation is appended after each downsampling operation to regenerate feature maps with channel numbers {64, 128, 256, 256, 256} as , respectively. In this way, since the distributed feature maps at each fusion level simultaneously incorporate semantics and fine details, more discriminative and complementary representations can be well preserved along the progressive fusion path. The fusion effect is thus greatly enhanced for achieving more superior performance.

3.4 Model Training

Given the input image set and its corresponding annotations , we train our network with local and global saliency prediction jointly. This scheme can ensure salient objects uniformly highlighted and backgrounds suppressed, based on our comprehensive experiments.

With the CFA, we obtain the aggregated feature with size and channels. Then the global saliency map is predicted with the readout function : . For predicting the local saliency map , after learning the local representation from CFD, the prediction function : , is used to produce directly. According to

, our network is trained by formulating the loss function

(5)

where the network parameter is used to generate the saliency maps . The is the balanced binary cross entropy loss

(6)

where denotes pixel coordinate, and , are the foreground and background label sets, respectively. is the loss weight which is defined as . The salient confidence score .

4 Experiments

4.1 Settings

Datasets

To evaluate the proposed approach, we experiment on six saliency detection benchmark datasets, including ECSSD [44], PASCAL-S [17], DUT-OMRON [45], HKU-IS [15], SOD [26] and DUTS-test [34], which respectively contain 1,000, 850, 5,168, 4,447, 300 and 5,019 natural complex images with manually labeled pixel-wise ground-truths.

Implementation Details

We perform all experiments using the adam [13] optimizer with initial learning rate 5e-5, 0.9 momentum, 5e-4 weight decay, and batch size 14. Following previous works [20, 42, 21, 49, 46, 36], we use the training set of DUTS [34] dataset to train the proposed model. The training samples are augmented through random rotation and horizontal flipping. The backbone (VGG-16 [31], ResNet-18 [9], and ResNet-50 [9]

) parameters of our network are initialized with the corresponding models pretrained on ImageNet 

[5] and the rest are randomly initialized. In both training and testing phrases, input images are resized to . Different from some recent saliency models trained with extra supervision constraints (e.g., boundary [29, 37], edge [7, 39, 20]) or post processing operations (e.g., CRF [11, 21]), our network simply uses pixel-level saliency annotations, with no extra processes used when generating final saliency maps.

Evaluation Metrics

We adopt three metrics: precision-recall (PR) curves, F-measure, and mean absolute error (MAE) as our evaluation metrics. For F-measure, we report the maximum

(MaxF) for evaluating our method and state-of-the-art approaches, as similar to recent studies [47, 49, 11, 37, 48, 24, 6, 20].

4.2 Ablation Studies

We first analyze the contributions of each module in our method, namely CFA and CFD, to overall performance. Then, different configurations of feature enhancement strategies are compared to validate our CFA design. At last, by allocating different numbers of layer features over the aggregartion feature map, we verify the effect of CFD design on improving progressive fusion for detecting salient regions. All ablation experiments are conducted with ResNet-50 backbone on DUT-OMRON [45], PASCAL-S [17] and DUTS-TE [34] datasets.

Figure 5: Visualizing feature maps generated by directly aggregating the original multiple layer features (Row 2) and the CFA module (Row 3). Obviously, feature maps from CFA can more precisely capture the positions and contours of salient objects (Row 3).

Effectiveness of CFA and CFD

We compare three variants of backbone with the FPN baseline: w/ CFA, w/ CFD. Fig. 4, Fig. 5, Fig. 6 show some visualized results, and Tab. 1 shows MaxF and MAE scores of CFA and CFD on three challenging datasets.

  • w/ CFA: By comparing results of backbone Res50 (Row 1 in Tab. 1, w/o CFA), the addition of CFA (Row 2 in Tab. 1) obviously brings performance gain in terms of both MaxF and MAE scores. Besides, compared to Row 2 in Tab. 1, CFA consistently outperforms the vanilla FPN, with a margin of 2.1% and 2.9% on DUTS-O, PASCAL-S dataset w.r.t. MaxF, respectively. This validates the effectiveness of our dynamic cross-layer feature aggregation strategy.

    From visualization results in Fig. 5, when comparing Row 2 (w/o CFA) and Row 3 (w CFA), feature maps after CFA provide more discriminative information for distinguishing foregrounds from clutter backgrounds, and thus can better locate the entire salient object than those without CFA. Moreover, by adaptively aggregating multi-layer features, the CFA greatly improves the quality of generated global saliency maps, as shown in Fig. 6 (b). These results clearly demonstrate that saliency detection benefits from dynamic feature aggregation over information exchanging across multiple layer features.

    Figure 6: Visualization of saliency maps predicted by aggregated feature , FPN based models, and our method. (a) Source images. (b-f) Results of backbone + CFA, CASNet [42], PiCANet [21], PoolNet [20], backbone + CFA + CFD. (g) Ground Truth.
    No. Module DUT-O[45] PASCAL-S[17]
    MaxF  MAE  MaxF  MAE 
    1 Res50 0.761 0.084 0.833 0.128
    2 Res50 + FPN 0.796 0.065 0.845 0.087
    3 Res50 + CFA 0.817 0.061 0.874 0.079
    4 Res50 + CFA + CFD 0.834 0.053 0.886 0.072
    Table 1: Ablation analysis w.r.t. effectiveness of CFA/CFD. Res50 is the ResNet-50 backbone. CFA and CFD in our method are important for improving performance. Best and second best results are shown in black and red, respectively.
  • w/ CFD: Comparing Row 3 and 4 in Tab. 1, collaborating with CFD (Row 4), the MaxF scores are improved with a margin of 1.7% , 1.2% on DUT-O and PASCAL-S datasets, and the MAE values are decreased from 0.061 to 0.053 for DUT-O dataset, from 0.079 to 0.072 for PASCAL-S dataset, respectively. Moreover, by comparing results of FPN, applying both CFA and CFD greatly improve performance in both MaxF and MAE values.

    Fig. 4 gives the visualization feature maps at each level after CFD. Obviously, by comparing Rows 3 and 6 (w CFD) with Rows 1, 2, 4, and 5 (w/o CFD), the distributed feature maps at each fusion level provide rich semantics and clear object boundaries, ensuring that entire salient objects can be segmented with sharp object boundaries (Row 3 and 6 (g)).

    Fig. 6 (f) and (b) gives Some corresponding saliency maps between w/ and w/o CFD. Clearly, inaccurate saliency results, e.g. over-predicted and incomplete objects, blurred object boundaries, get greatly improved by collaborating with the CFD. These results consistently demonstrate the effectiveness of CFD.

Module DUT-O[45] PASCAL-S[17] DUTS-TE[34]
MaxF  MAE  MaxF  MAE  MaxF  MAE 
(A) 0.803 0.069 0.861 0.081 0.863 0.049
(B) 0.811 0.064 0.868 0.078 0.870 0.047
(C) 0.813 0.062 0.870 0.079 0.872 0.047
(D) 0.817 0.061 0.874 0.079 0.875 0.045
Table 2: Ablation analysis w.r.t. different configurations of CFA. Design of CFA achieves better performance than other settings. Best results are shown in red.
Figure 7: Comparison of saliency maps generated by our method and previous state-of-the-arts. It can be seen that our method can not only locate the entire foreground salient objects but also effectively suppress cluster backgrounds, even for some challenging scenes. Best viewed in color.

Configurations of CFA

We here analyze the effectiveness of our CFA design, which simultaneously considers multi-level features for adaptive layer-wise reweighting during aggregation. We compare our approach against the following baselines, including:

  1. No reweighting: The feature maps from each layer are directly concatenated, followed by a conv layer for saliency map prediction.

  2. Non-learnable reweighting: We use global average pooling (GAP) on each level features to obtain the layer-wise weights and multiply them with the original features for aggregation before producing .

  3. Independent layer-wise reweighting: Similar to (B), we apply GAP on each level features, followed by two fully connected layers before multiplying with the original features. This is performed independently on each level before concatenation.

  4. Collaborative layer-wise reweighting: We apply our CFA module to learn a set of layer-wise weights by simultaneously considering all the information among different layers for aggregation, as discussed in 3.2.

Tab. 2 reports the qualitative results of the above settings. As can be observed, both (B) and (C) significantly outperform the method (A). This confirms that dynamically leveraging multi-level features is crucial for saliency detection. However, (B) and (C) give inferior performance to (D), because the two designs reweight each level features by viewing global weights from themselves independently, which ignores the channel interdependencies among different levels. On the contrary, with collaborative layer-wise rewighting, CFA obviously achieves better performance for predicting . These results indicate that the design of CFA plays an important role in boosting saliency performance.

Configurations of CFD

To be better illustrating the distribution process in CFD, we allocate the aggregated feature to different numbers of level features. Tab. 3 reports the corresponding comparison results in terms of MaxF and MAE values on two challenging datasets. By comparing results of Row 1 in Tab. 3, the CFD module (Rows 26) contributes a lot to produce better saliency results. This further demonstrates that the stage-wise fusion performs better than the single stage fusion for saliency detection. Besides, by distributing into 15 levels for progressive fusion respectively, the performance is gradually improved, illustrating that each level feature in CFD plays an important role for the progressive fusion.

No. Settings PASCAL-S[17] DUTS-TE[34]
MaxF  MAE  MaxF  MAE 
1 (D) 0.874 0.079 0.875 0.045
2 0.880 0.075 0.882 0.040
3 , 0.879 0.076 0.885 0.040
4 , , 0.881 0.074 0.887 0.038
5 {, , , 0.883 0.073 0.889 0.038
6 , , , , 0.886 0.072 0.896 0.035
Table 3: Ablation analysis of CFD with different distribution configurations. (D) refers to w/o CFD module defined in Tab. 2. Each level feature in CFD contributes a lot to the progressive fusion. Best results are highlighted in red.

4.3 Comparison with State-of-the-Arts

We compare our proposed method with 14 deep saliency detection methods, including DCL [16], DSS [11], NLDF [24], Amulet [47], SRM [36], DGRL [37], R3Net [6], BMPM [46], PAGR [49], PiCANet [21], AFNet [7], BASNet [29], CASNet [42], and PoolNet [20]. For fair comparison, we cite the public comparison results provided by [25], which generate saliency maps from the source code released by the authors or directly provided by them. We evaluate all the competitors with the same evaluation code.

Visual Comparison

Fig. 7 shows visual comparisons of the proposed model (Ours) with previous state-of-the-art methods. We can clearly see that our model highlights salient objects closest to the ground-truth maps in various challenging scenarios, including images with cluster backgrounds and foregrounds (Row 3, 4), object having similar appearance to background (Row 1, 3, 4), multiple instances of the same object (Row 2, 5), and objects occluded by background objects (Row 3, 4). More importantly, our model can well segment the entire objects (Row 1, 2, 3, 5) with clear salient object boundaries (Row 1, 2, 3, 4, 5), demonstrating the effectiveness of the proposed CFPN.

Methods Backbone ECSSD [44] PASCAL-S [17] DUTS-TE [34] HKU-IS [15] SOD [26] DUT-OMRON [45]
MaxF  MAE  MaxF  MAE  MaxF  MAE  MaxF  MAE  MaxF  MAE  MaxF  MAE 
VGG backbone
DCL CVPR2016  [16] VGG-16 0.890 0.088 0.805 0.125 0.782 0.088 0.885 0.072 0.823 0.141 0.739 0.097
DSS CVPR2016  [11] VGG-16 0.916 0.053 0.836 0.096 0.825 0.057 0.911 0.041 0.844 0.121 0.771 0.066
NLDF CVPR2017  [24] VGG-16 0.905 0.063 0.831 0.099 0.812 0.066 0.902 0.048 0.841 0.124 0.753 0.080
Amulet ICCV2017  [47] VGG-16 0.915 0.059 0.837 0.098 0.778 0.085 0.895 0.052 0.806 0.141 0.742 0.098
BMPM CVPR2018  [46] VGG-16 0.929 0.045 0.862 0.074 0.851 0.049 0.921 0.039 0.855 0.107 0.774 0.064
PAGR CVPR2018  [49] VGG-19 0.927 0.061 0.856 0.093 0.855 0.056 0.918 0.048 - - 0.771 0.071
PiCANet CVPR2018  [21] VGG-16 0.931 0.047 0.868 0.077 0.851 0.054 0.921 0.042 0.853 0.102 0.794 0.068
AFNet CVPR2019  [7] VGG-16 0.935 0.042 0.868 0.071 0.862 0.046 0.923 0.036 0.856 0.109 0.797 0.057
PoolNet CVPR2019  [20] VGG-16 0.936 0.047 0.857 0.078 0.876 0.043 0.928 0.035 0.859 0.115 0.817 0.058
Ours (VGG) VGG-16 0.943 0.040 0.874 0.071 0.885 0.038 0.937 0.031 0.870 0.097 0.829 0.054
ResNet backbone
SRM ICCV2017  [36] ResNet-50 0.917 0.054 0.847 0.085 0.827 0.059 0.906 0.046 0.843 0.127 0.769 0.069
DGRL CVPR2018  [37] ResNet-50 0.922 0.041 0.854 0.078 0.829 0.056 0.910 0.036 0.845 0.104 0.774 0.062
R3Net IJCAI2018  [6] ResNeXt 0.931 0.046 0.845 0.097 0.828 0.059 0.917 0.038 0.836 0.136 0.792 0.061
PiCANet CVPR2018  [21] ResNet-50 0.935 0.047 0.881 0.087 0.860 0.051 0.919 0.043 0.858 0.109 0.803 0.065
BASNet CVPR2019  [20] ResNet-34 0.942 0.037 0.854 0.076 0.860 0.047 0.928 0.032 0.851 0.114 0.805 0.056
CASNet CVPR2019  [42] ResNet-50 0.939 0.037 0.864 0.072 0.865 0.043 0.925 0.034 - - 0.797 0.056
PoolNet CVPR2019  [20] ResNet-50 0.940 0.042 0.863 0.075 0.886 0.040 0.934 0.032 0.867 0.100 0.830 0.055
Ours (Res18) ResNet-18 0.942 0.039 0.879 0.074 0.887 0.039 0.933 0.032 0.872 0.085 0.821 0.055
Ours (Res50) ResNet-50 0.948 0.035 0.886 0.072 0.896 0.035 0.940 0.029 0.873 0.083 0.834 0.053
Table 4: Comparisons of max F-measure and MAE values on VGG [31] and ResNet [9] backbones are reported. Results of our method are shown in blue, black, and red, respectively. With different backbones, the proposed method consistently achieves better performance than the previous state-of-the-arts. Best viewed in color.
Figure 8: Precision and recall curves on ECSSD [44], HKUIS [14], and DUTS-TE [34] datasets. The proposed method outperforms previous state-of-the-arts on all the datasets. Best viewed in color.

F-measure and MAE Comparison

Tab. 4 reports the MaxF and MAE scores of our method using different backbones (VGG-16 [31], ResNet-18 [9], and ResNet-50 [9]) compared with other methods. Obviously, CFPN achieves excellent results on all the datasets with the similar backbones across the metrics. In particular, with both VGG-16 [31] and ResNet-50 [9] backbones, CFPN shows significantly improved -max scores compared with the second best PoolNet [20], on the more challenging benchmarks PASCAL-S (VGG-16: 0.874 vs 0.857; ResNet-50: 0.886 vs 0.863), DUTS-TE (VGG-16: 0.885 vs 0.876; ResNet-50: 0.896 vs 0.886), and HKUIS (VGG-16: 0.937 vs 0.928; ResNet-50: 0.940 vs 0.934). More importantly, when using ResNet-18 [9] as backbone, our CFPN not only outperforms all the previous VGG backbone approaches significantly, but also beats most of the ResNet-50 based methods, especially on the more challenging datasets including PASCAL-S, SOD, and DUTS-TE. These results clearly illustrate the superior performance and robustness of CFPN.

PR Curves Comparison

We also give the precision-recall curves in Fig. 8. Due to limited space, we simply show the PR curves of the previous methods implemented with ResNet-50 backbone over three widely used datasets. As can be seen, the PR curves of our CFPN, represented by the straight red lines, consistently outperform all other previous models over all datasets. These results convincingly demonstrate the effectiveness of our method.

5 Conclusion

In this paper, we identify the limitation of FPN based saleincy methods (i.e., indirect information propagation between deeper and shallower layers) and presented a novel architecture, CFPN, for salient object detection. It consists of two essential modules: a cross-layer feature aggregation module and a cross-layer feature distribution module. Benefiting from these two collaborative modules, efficient information communication across multiple layers is conducted, which reduces the information loss during FPN stage-wise fusion, and thus leads to more accurate saliency results. Comprehensive experiments on popular saliency detection benchmarks demonstrate the effectiveness and robustness of the proposed CFPN.

References

  • [1] M. Amirul Islam, M. Kalash, and N. D. Bruce (2018) Revisiting salient object detection: simultaneous detection, ranking, and subitizing of multiple salient objects. In CVPR, pp. 7142–7150. Cited by: §1.
  • [2] A. Borji, M. Cheng, Q. Hou, H. Jiang, and J. Li (2014) Salient object detection: a survey. Computational Visual Media, pp. 1–34. Cited by: §2.
  • [3] S. Chen, X. Tan, B. Wang, and X. Hu (2018) Reverse attention for salient object detection. In ECCV, pp. 236–252. Cited by: §1.
  • [4] M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S. Hu (2015) Global contrast based salient region detection. TPAMI 37 (3), pp. 569–582. Cited by: §1, §2.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §4.1.
  • [6] Z. Deng, X. Hu, L. Zhu, X. Xu, J. Qin, G. Han, and P. Heng (2018) RNet: recurrent residual refinement network for saliency detection. In IJCAI, pp. 684–690. Cited by: §4.1, §4.3, Table 4.
  • [7] M. Feng, H. Lu, and E. Ding (2019) Attentive feedback network for boundary-aware salient object detection. In CVPR, pp. 1623–1632. Cited by: §1, §4.1, §4.3, Table 4.
  • [8] W. Feng, R. Han, Q. Guo, J. Zhu, and S. Wang (2019) Dynamic saliency-aware regularization for correlation filter-based object tracking. IEEE Transactions on Image Processing 28 (7), pp. 3232–3245. Cited by: §1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: 3rd item, §3.2, §4.1, §4.3, Table 4.
  • [10] S. Hong, T. You, S. Kwak, and B. Han (2015) Online tracking by learning discriminative saliency map with convolutional neural network. In ICML, pp. 597–606. Cited by: §1.
  • [11] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr (2017) Deeply supervised salient object detection with short connections. In CVPR, pp. 5300–5309. Cited by: §1, §2, §2, §4.1, §4.1, §4.3, Table 4.
  • [12] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §3.2.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [14] G. Li and Y. Yu (2015)

    Visual saliency based on multiscale deep features

    .
    In CVPR, pp. 5455–5463. Cited by: §2, §2, Figure 8.
  • [15] G. Li and Y. Yu (2015) Visual saliency based on multiscale deep features. In CVPR, pp. 5455–5463. Cited by: §4.1, Table 4.
  • [16] G. Li and Y. Yu (2016) Deep contrast learning for salient object detection. In CVPR, pp. 478–487. Cited by: §2, §4.3, Table 4.
  • [17] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille (2014) The secrets of salient object segmentation. In CVPR, pp. 280–287. Cited by: §4.1, §4.2, Table 1, Table 2, Table 3, Table 4.
  • [18] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §1, §2.
  • [19] H. Ling (2018) A deep network solution for attention and aesthetics aware photo cropping. IEEE TPAMI abs/1612.03144. Cited by: §1.
  • [20] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. In CVPR, Cited by: Figure 1, §1, §1, §2, Figure 4, §3.2, Figure 6, §4.1, §4.1, §4.3, §4.3, Table 4.
  • [21] N. Liu, J. Han, and M. Yang (2018) PiCANet: learning pixel-wise contextual attention for saliency detection. In CVPR, pp. 3089–3098. Cited by: Figure 1, §1, §1, §2, Figure 6, §4.1, §4.3, Table 4.
  • [22] N. Liu and J. Han (2016) Dhsnet: deep hierarchical saliency network for salient object detection. In CVPR, pp. 678–686. Cited by: §1, §2.
  • [23] J. Long, E. Shelhamer, and T. Darrell (2014) Fully convolutional networks for semantic segmentation. TPAMI 39 (4), pp. 640–651. Cited by: §2.
  • [24] Z. Luo, A. K. Mishra, A. Achkar, J. A. Eichel, S. Li, and P. Jodoin (2017) Non-local deep features for salient object detection. In CVPR, pp. 6593–6601. Cited by: §1, §4.1, §4.3, Table 4.
  • [25] Mengyang Feng (2018) Evaluation toolbox for salient object detection. Note: https://github.com/ArcherFMY/sal_eval_toolbox Cited by: §4.3.
  • [26] V. Movahedi and J. H. Elder (2010) Design and perceptual validation of performance measures for salient object segmentation. In CVPRW, pp. 49–56. Cited by: §4.1, Table 4.
  • [27] A. Newell, K. Yang, and J. Deng (2016)

    Stacked hourglass networks for human pose estimation

    .
    In ECCV, pp. 483–499. Cited by: §1.
  • [28] F. Porikli (2015) Saliency-aware geodesic video object segmentation. In CVPR, pp. 3395–3402. Cited by: §1.
  • [29] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand (2019) BASNet: boundary-aware salient object detection. In CVPR, pp. 7479–7489. Cited by: §1, §4.1, §4.3.
  • [30] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1.
  • [31] K. Simonyan and A. Zisserman (2018) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: 3rd item, §4.1, §4.3, Table 4.
  • [32] J. Wang, H. Jiang, Z. Yuan, M. M. Cheng, X. Hu, and N. Zheng (2017) Salient object detection: a discriminative regional feature integration approach. IJCV 123 (2), pp. 1–18. Cited by: §1, §2.
  • [33] L. Wang, H. Lu, X. Ruan, and M. Yang (2015) Deep networks for saliency detection via local estimation and global search. In CVPR, pp. 3183–3192. Cited by: §2.
  • [34] L. Wang, H. Lu, Y. Wang, M. Feng, B. Yin, and X. Ruan (2017) Learning to detect salient objects with image-level supervision. In CVPR, pp. 136–145. Cited by: Figure 8, §4.1, §4.1, §4.2, Table 2, Table 3, Table 4.
  • [35] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan (2016) Saliency detection with recurrent fully convolutional networks. In ECCV, pp. 825–841. Cited by: §2.
  • [36] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu (2017) A stagewise refinement model for detecting salient objects in images. In ICCV, pp. 4019–4028. Cited by: §1, §1, §2, §4.1, §4.3, Table 4.
  • [37] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji (2018) Detect globally, refine locally: a novel approach to saliency detection. In CVPR, pp. 3127–3135. Cited by: §1, §2, §4.1, §4.1, §4.3, Table 4.
  • [38] W. Wang, Q. Lai, H. Fu, J. Shen, and H. Ling (2019) Salient object detection in the deep learning era: an in-depth survey. arXiv preprint arXiv:1904.09146. Cited by: §2.
  • [39] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji (2019) Salient object detection with pyramid attention and salient edges. In CVPR, pp. 1448–1457. Cited by: §1, §2, §4.1.
  • [40] Y. Wei, X. Liang, Y. Chen, X. Shen, M. M. Cheng, J. Feng, Y. Zhao, and S. Yan (2015) STC: a simple to complex framework for weakly-supervised semantic segmentation. TPAMI 39 (11), pp. 2314–2320. Cited by: §1.
  • [41] R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding (2019) A mutual learning method for salient object detection with intertwined multi-supervision. In CVPR, pp. 8150–8159. Cited by: §1.
  • [42] Z. Wu, L. Su, and Q. Huang (2019) Cascaded partial decoder for fast and accurate salient object detection. In CVPR, pp. 3907–3916. Cited by: §1, §1, §2, §3.2, Figure 6, §4.1, §4.3, Table 4.
  • [43] S. Xie and Z. Tu (2015) Holistically-nested edge detection. In ICCV, pp. 1395–1403. Cited by: §2.
  • [44] Q. Yan, L. Xu, J. Shi, and J. Jia (2013) Hierarchical saliency detection. In CVPR, pp. 1155–1162. Cited by: §1, Figure 8, §4.1, Table 4.
  • [45] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. In CVPR, pp. 3166–3173. Cited by: §1, §2, §4.1, §4.2, Table 1, Table 2, Table 4.
  • [46] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang (2018) A bi-directional message passing model for salient object detection. In CVPR, pp. 1741–1750. Cited by: §1, §2, §4.1, §4.3, Table 4.
  • [47] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan (2017) Amulet: aggregating multi-level convolutional features for salient object detection. In ICCV, pp. 202–211. Cited by: §1, §2, §2, §4.1, §4.3, Table 4.
  • [48] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin (2017) Learning uncertain convolutional features for accurate saliency detection. In ICCV, pp. 212–221. Cited by: §4.1.
  • [49] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang (2018) Progressive attention guided recurrent network for salient object detection. In CVPR, pp. 714–722. Cited by: §1, §2, §4.1, §4.1, §4.3, Table 4.
  • [50] R. Zhao, W. Ouyang, H. Li, and X. Wang (2015) Saliency detection by multi-context deep learning. In CVPR, pp. 1265–1274. Cited by: §2.
  • [51] T. Zhao and X. Wu (2019) Pyramid feature attention network for saliency detection. In CVPR, pp. 3085–3094. Cited by: §1.