Feature matters for salient object detection. Existing methods mainly focus on designing a sophisticated structure to incorporate multi-level features and filter out cluttered features. We present Progressive Feature Polishing Network (PFPN), a simple yet effective framework to progressively polish the multi-level features to be more accurate and representative. By employing multiple Feature Polishing Modules (FPMs) in a recurrent manner, our approach is able to detect salient objects with fine details without any post-processing. A FPM parallelly updates the features of each level by directly incorporating all higher level context information. Moreover, it can keep the dimensions and hierarchical structures of the feature maps, which makes it flexible to be integrated with any CNN-based models. Empirical experiments show that our results are monotonically getting better with increasing number of FPMs. Without bells and whistles, PFPN outperforms the state-of-the-art methods significantly on five benchmark datasets under various evaluation metrics.READ FULL TEXT VIEW PDF
Fully convolutional neural networks (FCNs) have shown outstanding perfor...
Feature pyramid network (FPN) based models, which fuse the semantics and...
Single-frame infrared small target (SIRST) detection aims at separating ...
Most of existing salient object detection models have achieved great pro...
Camouflaged object detection (COD), which aims to identify the objects t...
We solve the problem of salient object detection by investigating how to...
With the goal of identifying pixel-wise salient object regions from each...
Salient object detection, which aims to extract the most attractive regions in an image, is widely used in computer vision tasks, including video compression, visual tracking 
, and image retrieval.
Benefitting from the hierarchical structure of CNN, deep models can extract multi-level features that contain both low-level local details and high-level global semantics. To make use of detailed and semantic information, a straightforward integration of the multi-level context information with concatenation or element-wise addition of different level features can be applied. However, as the features can be cluttered and inaccurate at some levels, this kind of simple feature integrations tends to get suboptimal results. Therefore, most recent attractive progress focuses on designing a sophisticated integration of these multi-level features. We point out the drawbacks of current methods in three folds. First, many methods [40, 19] employ the U-Net  like structure in which the information flow from high level to low level during feature aggregation, while BMPM  uses a bidirectional message passing between consecutive levels to incorporate semantic concepts and fine details. However, these integrations, performed indirectly among multi-level features, may be deficient because of the incurred long-term dependency problem . Second, other works [42, 37, 12] recursively refine the predicted results in a deep-to-shallow manner to supplement details. However, predicted saliency maps have lost the rich information and the capability of refinement is limited. Furthermore, although valuable human priors can be introduced by designing sophisticated structures to incorporate multi-level features, this process can be complicated and the structure might lack generality.
To make full use of semantic and detailed information, we present a novel Progressive Feature Polishing Network (PFPN) for salient object detection, which is simple and tidy, yet effective. First, PFPN adopts a recurrent manner to progressively polish every level features in parallel. With the gradually polishing, cluttered information would be dropt out and multi-level features would be rectified. As this parallel structure could keep the feature levels in backbone, some common decoder structures can be easily applied. In one feature polishing step, each level feature is updated with the fusion of all deeper level features directly. Therefore, high level semantic information can be integrated directly to all low level features to avoid the long-term dependency problem. In summary, the progressive feature polishing network greatly improves the multi-level representations, and even with the simplest concatenation feature fusion, PFPN works well to detect salient objects accurately. Our contributions are as follows:
We propose a novel multi-level representation refinement method for salient object detection, as well as a simple and tidy framework PFPN to progressively polish the features in a recurrent manner.
For each polishing step, we propose the FPM to refine the representations, which preserves the dimensions and hierarchical structure of the feature maps. It integrates high level semantic information directly to all low level features to avoid the long-term dependency problem.
Empirical evaluations show that our proposed method significantly outperforms state-of-the-art methods on five benchmark datasets under various evaluation metrics.
mainly focus on heuristic saliency priors and low-level handcrafted features, such as center prior, boundary prior, and color contrast.
In recent years, deep convolutional networks have achieved impressive results in various computer vision tasks and also been introduced to salient object detection. Early attempts of deep saliency models include Li  which exploits multi-scale CNN contextual features to predict the saliency of each image segment, and Zhao  which utilizes both the local and global context to score each superpixel. While these methods achieve obvious improvements over handcrafted methods, their scoring one image patch with the same saliency prediction drops the spatial information and results in low prediction resolution. To solve this problem, many methods based on Fully Convolutional Network  are proposed to generate pixel-wise saliency. Roughly speaking, these methods can be categorized into two lines.
Although multi-level features extracted by CNN contain rich information about both high level semantics and low level details, the reduced spatial feature resolution and the likely inaccuracy at some feature levels make it an active line of work to design sophisticated feature integration structures. Lin adotps RefineNet to gradually merge high-level and low-level features from backbone in bottom-up method. Wang  propose to better localize salient objects by exploiting contextual information with attentive mechanism. Zhuge  employ a structure which embeds prior information to generate attentive features and filter out cluttered information. Different from above methods which design sophisticated structure to make information fusion, we use simple structure to polish multi-level features in recurrent manner and in parallel. Meanwhile, the multi-level structure would be kept and the polished multi-level features can be applied in common decoder modules. Zhang  use a bidirectional message passing between consecutive levels to incorporate semantic concepts and fine details. However, the incorporating the features in between adjacent feature levels results in long-term dependency. Our method directly aggregates features of all higher levels at each polishing step and thus high level information could be fused to lower level features sufficiently during multiple steps.
Another line focuses on progressively refining the predicted saliency map by rectifying previous errors. DHSNet  first learns a coarse global prediction and then progressively refines the details of saliency map by integrating local context features. Wang  propose to recurrently apply an encoder-decoder structure to previous predicted saliency map to perform refinement. DSS  adotps short connections to make progressive refining on saliency maps. CCNN  cascads local saliency refiner to refines the details from initial predicted salient map. However, since the predicted results have severe information loss than original representations, the refinement might be deficient. Different from these methods, our approach progressively improves the multi-level representations in a recurrent manner instead of attempting to rectify the predicted results. Besides, most previous refinements are performed in a deep-to-shallow manner, in which at each step only the features specific to that step are exploited. In contrast to that, our method polishes the representations at every level with multi-level context information at each step. Moreover, many methods utilize an extra refinement module, either as a part of their model or as a post-process, to further recover the details of the predicted results, such as DenseCRF [12, 19], BRN  and GFRN . In contrast, our method delivers superior performance without such modules.
In this section, we first describe the architecture overview of the proposed Progressive Feature Polishing Network (PFPN). Then we detail the structure of the Feature Polishing Module (FPM) and the design of feature fusion module. Finally we present some implementation details.
In this work, we propose the Progressive Feature Polishing Network (PFPN) for salient object detection. An overview of this architecture is shown in Fig. 2. Our model consists of four kinds of modules: the Backbone, two Transition Modules (TM), a series of Feature Polishing Modules (FPM), and a Fusion Module (FM).
The input image is first fed into the backbone network to extract multi-scale features. The choice of backbone structure is flexible and ResNet-101  is used in the paper to be consistent with previous work . Results of VGG-16  version is also reported in experiments. Specifically, the ResNet-101 
network can be grouped into five blocks by a serial of downsampling operations with a stride of 2. The outputs of these blocks are used as the multi-level feature maps:Conv-1, Res-2, Res-3, Res-4, Res-5. To reduce feature dimensions and keep the implementation tidy, these feature maps are passed through the first transition module (TM1 in Fig. 2), in which the features at each level are transformed in parallel into a same number of dimensions, such as 256 in our implementation, by 1x1 convolutions. After obtaining the multi-level feature maps with the same dimension, a series of Feature Polishing Modules (FPM) are performed on these features successively to improve them progressively. Fig. 2 shows an example with . In each FPM, high level features are directly introduced to all low level features to improve them, which is efficient and notably reduces information loss than indirect ways. The inputs and outputs of FPM have the same dimensions and all FPMs share the same network structure. We use different parameters for each FPM in expectation that they could learn to focus on more and more refined details gradually. Experiments show that the model with outperforms the state-of-the-art and also has a fast speed of 20 fps, while the accuracy of saliency predictions converges at with marginal improvements over . Then we exploit the second transition module (TM2 in Fig. 2
), which consists of a bilinear upsampling followed by a 1x1 convolution, to interpolate all features to the original input resolution and reduce the dimension of them to 32. At last, a fusion module (FM) is used to integrate the multi-scale features and obtain the final saliency map. Owing to the more accurate representations after FPMs, the FM is implemented with a simple concatenation strategy. Our network is trained in an end-to-end manner.
The Feature Polishing Module (FPM) plays a core role in our proposed PFPN. FPM is a simple yet effective module that can be incorporated with any deep convolutional backbones to polish the feature representation. It keeps the multi-level structure of the representations generated by CNNs, such as the backbone or preceding FPM, and learns to update them with residual connections.
For feature maps , FPM will also generate polished features maps with the same size. As is shown in Fig. 2, FPM consists of parallel FPM blocks, each of which corresponds to a separate feature map and is denoted as FPM-. Specifically, a series of short connections  from deeper side to shallower side are adopted. As a result, higher level features with global information are injected directly to lower ones to help better discover the salient regions. Taking the FPM1-3 in Fig. 2 as an example, all features of Res-3, Res-4, Res-5 are utilized through short connections to update the features of Res-3. FPM also takes advantage of residual connections  so that it can update the features and gradually filter out the cluttered information. This is illustrated by the connection surrounding each FPM block in Fig. 2.
The implementation of FPM- block is formally formulated as Eq. 1:
It takes in feature maps, i.e. . For feature map. These features are then combined with a concatenation along channels and fused by a 1x1 convolutional layer to reduce the dimension, obtaining . Finally, is used as a residual function to update the original feature map to compute the with element-wise addition. An example of this procedure with is illustrated in Fig. 3.
We use the Fusion Module (FM) to finally integrate the multi-level features and detect salient objects. As result of our refined features, the FM can be quite simple. As is illustrated in Fig. 2
, the multi-level features from TM2 are first concatenated and then fed into two successive convolutional layers with 3x3 kernels. At last, a 1x1 convolutional layer followed by a sigmoid function is applied to obtain the final saliency map.
We use the cross-entropy loss between the final predicted saliency map and ground truth to train our model end-to-end. Following previous works [12, 19, 42], side outputs are also employed to calculate auxiliary losses. In detail, 1x1 convolutional layers are performed on the multi-level feature maps before the Fusion Module to obtain a series of intermediate results. The total loss is as follows:
where is the final result of our model, denotes the -th intermediate result, and represents the ground truth. The weights are set empirically to bias towards the final result.
We implement our method with Pytorch framework. The last average pooling layer and fully connected layer of the pre-trained ResNet-101 
are removed. We initialize the layers of backbone with the weights pre-trained on ImageNet classification task and randomly initialize the rest layers. We follow source code of PiCA given by author and FQN  and freeze the BatchNorm statistics of the backbone.
|MAE||max F||mean F||S||MAE||max F||mean F||S||MAE||max F||mean F||S||MAE||max F||mean F||S||MAE||max F||mean F||S|
We conduct experiments on five well-known benchmark datasets: ECSSD, HKU-IS, PASCAL-S, DUT-OMRON and DUTS. ECSSD  consists of 1,000 images. This dataset contains salient objects with complex structures in multiple scales. HKU-IS  consists of 4,447 images and most images are chosen to contain mutliple disconnected salient objects. PASCAL-S  includes 850 natural images. These images are selected from PASCAL VOC 2010 segmentation challenge and are pixel-wise annotated. DUT-O  is a challenging dataset in that each image contains one or more salient objects with fairly complex scenes. This dataset has 5,168 high-quality images. DUTS  is a large scale dataset which consists of 15,572 images, which are selected from ImageNet DET  and SUN  dataset. It has been split into two parts: 10,553 for training and 5,019 for testing. We evaluate the performance of different salient object detection algorithms through 4 main metrics, including the precision-recall curves (PR curves), F-measure, mean absolute error(MAE), S-measure 
. By binarizing the predicted saliency map with thresholds in [0,255], a sequence of precision and recall pairs are calculated for each image of the dataset. The PR curve is plotted using the average precision and recall of the dataset at different thresholds. F-measure is calculated as a weighted combination of Precision and Recall with the formulation as follows:
where is usually set to to emphasize Precision more than Recall as suggested in .
Following the conventional practice [19, 37, 37], our proposed model is trained on the training set of DUTS dataset. We also perform a data augmentation similar to  during training to mitigate the over-fitting problem. Specifically, the image is first resized to 300x300 and then a 256x256 image patch is randomly cropped from it. Random horizontal flipping is also applied. We use Adam optimizer to train our model without evaluation until the training loss convergences. The initial learning rate is set to 1e-4 and the overall training procedure takes about 16000 iterations. For testing, the images are scaled to 256x256 to feed into the network and then the predicted saliency maps are bilinearly interpolated to the size of the original image.
We compare our proposed model with 16 state-of-the-art methods. For fair comparison, the metrics of these 16 methods are obtained from a public leaderboard  or their original papers, and we evaluate our method in the same way as . We report the results of our model with ResNet-101  as backbone and two FPMs (i.e. ) if not otherwise mentioned. The saliency maps for visual comparisons are provided by the authors.
Quantitative Evaluation. The quantitative performances of all methods can be found in Table 1 and Fig. 4. Table 1 shows the comparisons of MAE and F-measure. Note that is adopted by almost all methods except DEF , which only reports . We report the MAE, F-measure and S-measure of our method for a direct comparison. Our ResNet based model achieves best results and consistently outperforms all other methods on all five datasets under different measurements, demonstrating the effectiveness of our proposed model. Moreover, our VGG based model also ranks the top among VGG based methods. This confirms that our proposed feature polishing method is effective and compatible with different backbone structures. In Fig. 4, we compare the PR curves and F-measure curves of different approaches on five datasets. We can see that the PR curves of our method show better performance than others with a significant margin. In addition, the F-measure curves of our method locate consistently higher than other methods. This verifies the robustness of our method.
Visual Comparison. Fig. 5 shows some example results of our model along with other six state-of-the-art methods for visual comparisons. We observe that our method gives superior results in complex backgrounds (row 1-2) and low contrast scenes (row 3-4). And it recovers meticulous details (row 5-6, note the suspension cable of Golden Gate and the legs of the mantis). From this comparison, we can see that our method performs robustly facing these challenges and produces better saliency maps.
. To demonstrate the capability of our proposed method to cooperate with different backbones, we introduce how it is applied to the multi-level features computed from VGG-16. This adaption is straightforward. VGG-16 contains 13 convolutional layers and 2 fully connected layers, along with 5 max-pooling layers which split the network into 6 blocks. The 2 fully connected layers are first transformed to convolutional layers, and then the 6 blocks generate outputs with decreasing spatial resolutions, i.e. 256, 128, 64, 32, 16, 8, if the input image is set to the fixed size of 256x256. These multi-level feature maps are fed into PFPN as described in SectionApproach to obtain the saliency map. Table 1 shows the comparisons with other VGG based state-of-the-art methods and Table 2 shows the evaluations of various number of FPMs. We can see that our method based on VGG-16 also shows excellent performance, which confirms that our method is effective for feature refining and generalizable to different backbones.
|MAE||max F||mean F||S||MAE||max F||mean F||S|
|PFPN (0 FPM)||0.048||0.928||0.894||0.911||0.052||0.851||0.811||0.862|
|PFPN (1 FPM)||0.036||0.946||0.921||0.928||0.040||0.884||0.848||0.883|
|PFPN (2 FPM)‡||0.041||0.944||0.914||0.924||0.043||0.876||0.840||0.884|
|PFPN (2 FPM)||0.033||0.949||0.926||0.932||0.037||0.888||0.858||0.887|
|PFPN (3 FPM)||0.032||0.950||0.929||0.932||0.037||0.888||0.862||0.889|
|PFPN-V (0 FPM)||0.057||0.911||0.883||0.890||0.058||0.825||0.793||0.837|
|PFPN-V (1 FPM)||0.045||0.931||0.905||0.908||0.046||0.853||0.825||0.862|
|PFPN-V (2 FPM)||0.040||0.938||0.915||0.916||0.042||0.868||0.836||0.864|
|PFPN-V (3 FPM)||0.040||0.939||0.915||0.920||0.043||0.868||0.839||0.873|
Feature Polishing Module. To confirm the effectiveness of the proposed FPM, we conduct an ablation evaluation by varying the number of FPM employed. The results with ranging from 0 to 3 on ECSSD and DUTS-TE are shown in Table 2. For , two transition modules are directly connected without employing FPM, and for , FPM is applied times in between the two transition modules, as illustrated in Fig. 2. Other settings, including the loss and training strategy, are kept the same for these evaluations. For ResNet based models, we can see that FPM significantly boosts the performance than the plain baseline with no FPM, and the performances increase gradually with using more FPMs. Actually the PFPN with 1 FPM and 2 FPMs both have great improvement progressively. When it comes to , the lift of accuracy converges and the improvement is marginal. Similar phenomena can be observed with the VGG based PFPN. This supports our argument that multiple FPMs progressively polish the representations so as to improve the final results. We suppose the accuracy converges due to the limited scale of current dataset. And we also conduct an experiment that a PFPN with 2 FPMs share the same weights. The conclusion is that compared to PFPN (0 FPM), it has great improvement. However, compared to PFPN (1 FPM) and PFPN (2 FPMs), the performance decay. Although PFM can refine multi-level features, separate weights make FPM learning to refine features better according to different refinement stages.
|MAE||max F||mean F||S|
DenseCRF. The dense connected conditional random field (DenseCRF ) is widely used by many methods [12, 19] as a post-process to refine the predicted results. We investigate the effects of DenseCRF on our method. The results are listed in Table 3. DSS  reports the results with DenseCRF. Both results with or without DenseCRF are reported for PiCA  and our method. We can see that previous works can benefit from the long range pixel similarity prior brought by DenseCRF. Furthermore, even without DenseCRF post-processing, our method performs better than other models with DenseCRF. However, DenseCRF does not bring benefits for our method, where we find that DenseCRF only improves the MAE on a few datasets, but decreases the F-measure on all datasets. This indicates that our method already sufficiently captures the information about the saliency objects from the data, so that heuristic prior fails to provide more help.
In this section, we present an intuitive understanding of the procedure of feature polishing. Since directly visualizing the intermediate features are not straightforward, we instead compare the results of our model with different numbers of FPMs. Several example saliency maps are illustrated in Fig. 1 and Fig. 6. We can see that the quality of predicted saliency maps is monotonically getting better with increasing number of FPMs, which is consistent with quantitative results in Table 2. Specifically, the model with can roughly detect the salient objects in the images, which benefits from rich semantic information of multi-level feature maps. As more FPMs are employed, more details are recovered and cluttered results are eliminated.
We have presented a novel Progressive Feature Polishing Network for salient object detection. PFPN focuses on improving the multi-level representations by progressively polishing the features in a recurrent manner. For each polishing step, a Feature Polishing Module is designed to directly integrate high level semantic concepts to all lower level features, which reduces information loss. Although the overall structure of PFPN is quite simple and tidy, empirical evaluations show that our method significantly outperforms 16 state-of-the-art methods on five benchmark datasets under various evaluation metrics.
This work was supported by Alibaba Group through Alibaba Innovative Research Program. Xiaogang Jin is supported by the Key Research and Development Program of Zhejiang Province (No. 2018C03055) and the National Natural Science Foundation of China (Grant Nos. 61972344, 61732015).
IEEE transactions on neural networks5 (2), pp. 157–166. Cited by: Introduction.
2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 23–30. Cited by: Introduction.
Visual saliency based on multiscale deep features. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5455–5463. Cited by: Related Work, Datasets and metrics.
Salient object detection using cascaded convolutional neural networks and adversarial learning. IEEE Transactions on Multimedia. Cited by: Refinement on saliency map.
Sun database: large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: Datasets and metrics.
Saliency detection by multi-context deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1265–1274. Cited by: Related Work.
Thirty-Third AAAI Conference on Artificial Intelligence, Cited by: Introduction, Feature Integration, Refinement on saliency map, Overview of PFPN, Implementation Details, Table 1, Comparison with the state-of-the-art.