Is Depth Really Necessary for Salient Object Detection?

05/30/2020 ∙ by Jiawei Zhao, et al. ∙ 15

Salient object detection (SOD) is a crucial and preliminary task for many computer vision applications, which have made progress with deep CNNs. Most of the existing methods mainly rely on the RGB information to distinguish the salient objects, which faces difficulties in some complex scenarios. To solve this, many recent RGBD-based networks are proposed by adopting the depth map as an independent input and fuse the features with RGB information. Taking the advantages of RGB and RGBD methods, we propose a novel depth-aware salient object detection framework, which has following superior designs: 1) It only takes the depth information as training data while only relies on RGB information in the testing phase. 2) It comprehensively optimizes SOD features with multi-level depth-aware regularizations. 3) The depth information also serves as error-weighted map to correct the segmentation process. With these insightful designs combined, we make the first attempt in realizing an unified depth-aware framework with only RGB information as input for inference, which not only surpasses the state-of-the-art performances on five public RGB SOD benchmarks, but also surpasses the RGBD-based methods on five benchmarks by a large margin, while adopting less information and implementation light-weighted. The code and model will be publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Salient object detection (SOD) aims to detect and segment objects that attract human attention most visually. With the proposals of large datasets (Ju et al., 2014; Peng et al., 2014; Niu et al., 2012; Yan et al., 2013; Li and Yu, 2015; Wang et al., 2017)

and deep learning techniques 

(He et al., 2016; Long et al., 2015), recent works have made significant progress in accurately segmenting salient objects, which can serve as an important prerequisite for a wide range of computer vision tasks, such as semantic segmentation (Lai and Gong, 2016), visual tracking (Hong et al., 2015)

, and image retrieval

(Shao and Brady, 2006).

Figure 1. Motivation of our depth-aware salient object detection. b): captured depth groundtruth. c): predicted depth awareness by DASNet. d): depth-aware error weights for salient correction. e) and f) are generated by two RGBD SOD models.

Recent years have witnessed significant progress in the field of salient object detection. Previous works (Cheng et al., 2014a; Yan et al., 2013; Klein and Frintrop, 2011; Liu et al., 2018; Jun Wei, 2020; Wu et al., 2019; Zhao and Wu, 2019; Su et al., 2019; Qin et al., 2019) take only the RGB information as inputs, which is relatively lightweight and can be easily trained end-to-end. For example, Wu et al. (Wu et al., 2019) propose a coarse-to-fine feature aggregation framework to generate saliency maps. However, the reasoning of salient regions can not be well solved when there exist multiple contrasting region proposals or ambiguous object contours. Therefore, the depth information can be a complementary guidance to deduct the overlapping and viewpoint issues, which can be beneficial to salient object detection.

Combing the RGB information with the auxiliary depth inputs, recent research efforts (Han et al., 2017; Piao et al., 2019; Zhao et al., 2019) have verified its effectiveness in improving the object segmentation process. These methods usually introduce an additional depth stream to encode depth map and then fuse the RGB stream and depth stream to deduct the salient objects. For example, Han et al. (Han et al., 2017) propose a two-stream network to extract RGB features and depth features, and then fuse them with a combination layer. Piao et al. (Piao et al., 2019) propose a two-stream network and fuse paired multi-level side-out features to refine the final saliency results. The main drawbacks of RGBD-based methods are twofold. On the one hand, the additional depth branch introduces heavy computation costs compared to the methods with bare RGB inputs. On the other hand, the object segmentation process heavily relies on the acquisition of depth maps, which are usually unavailable in some extreme occasions or realistic industrial applications. Keeping these cues in our mind, a natural concern arises: is depth information really necessary for salient object detection and what roles should depth play in salient object detection?

Taking the essence and discarding the dregs of RGB and RGBD methods, we set out to create a unified framework, which only takes the depth information as supervision in the training stage. Hence the network can take only the RGB images as inputs, and meanwhile is aware of depth prior knowledge with the learnt network parameters. That is to say, we make use of depth information to regularize the learning process of salient object detection (See Fig. 1

). First, we force the feature maps in different levels of network to be aware of depth information. This can be conducted in a multi-task learning trend when learning the object segmentation and estimating the depth map simultaneously. The estimated depth awareness map can be found in Fig.

1 c). Although the estimated depth map is not highly accurate as captured one (in Fig. 1 b)), but focuses on more contrastive depth regions, which are desirable for the segmentation process. Second, the estimated depth awareness can also be considered as an indicator to find the most ambiguous regions. We calculate the logarithmic error map of the estimation and ground truth depth to generate an adaptive weight map in Fig. 1 d). The network is further forced to pay more attention to pixels with higher error-weighted responses, hence some semantic confusions can be improved. Comparing to other state-of-the-art RGBD-based models (Zhao et al., 2019; Fan et al., 2019) in Fig. 1 e) and f), the proposed approach can better tackle the salient confusions while generating clear object boundaries.

In this paper, we make three insightful designs to construct our framework in Fig. 2, which make full use of training data from multiple sources. i.e., data from RGB source and RGBD source can be separately fed into this framework with different learning constraints to promote the final performance. To achieve this framework, we first propose a depth awareness module, to regularize the features in different levels of the network stage while learning the object segmentation in the meantime. This forces the segmentation features to be aware of constrastive object in the depth of field. Second, we propose a generalized channel-aware fusion model (CAF) to aggregate the features from top to bottom levels in these two relevant branches. Then the final depth features and segmentation features are fused with the same CAF module in this coarse-to-fine scheme. Last but not least, we utilize a depth error-weighted map to emphasize the saliency ambiguous regions, i.e., objects salient in images but not in depth, or vice versa. These regions are attached with more attention in the overall learning procedure to alleviate the object confusions and generate clear object boundaries. Experimental evidences demonstrate the effectiveness in promoting RGBD salient object detection with only RGB inputs and the potential in promoting RGB tasks with auxiliary training depth.

Contributions of this paper are summarized as follows: 1) We first set out a novel setting to use depth data as training priors to facilitate the salient object detection and propose a unified framework to solve this important problem. 2) We propose a channel-aware fusion model (CAF) to comprehensively fuse multi-level features, which can retain rich details and pay more attention to the significant features. 3) We propose a novel joint depth awareness module to facilitate the understanding of saliency and design a depth-aware error loss to mine ambiguous pixels. 4) Experimental evidences demonstrate that the proposed model achieves the state-of-the-art performance both on five RGBD benchmarks and five RGB benchmarks.

2. Related Work

RGB-based Salient object detection. Early traditional RGB SOD methods mainly rely on hand-crafted cues such as color constrant (Cheng et al., 2014a), texture (Yan et al., 2013) and local/global contrast (Klein and Frintrop, 2011). Borji et al. (Borji et al., 2015) comprehensively review these methods for details with both deep learning and conventional techniques. Recently CNN-based RGB SOD methods have achieved impressive improvements over traditional methods (Cheng et al., 2014a; Yan et al., 2013; Klein and Frintrop, 2011). Most of them follow an end-to-end architecture as shown in Fig. 3 a). Liu et al. (Liu et al., 2018) utilize pixel-wise contextual attention to selectively attend to global and local context information. Wu et al. (Wu et al., 2019) propose a coarse to fine aggregation framework, which discards low-level features to reduce the complexity. Zhao et al. (Zhao and Wu, 2019) propose a pyramid feature attention network, which adopts channel-wise attention and spatial attention to focus more on valuable features. Su et al.(Su et al., 2019) propose a boundary-aware network to fuse the boundary and interior features with a compensation mechanism and an adaptive manner. Qin et al.(Qin et al., 2019) design a hybrid loss to focus on the boundary quality of salient objects. Wei et al.(Jun Wei, 2020) propose a cross-feature module to fuse features of different levels.

RGBD-based Salient object detection. Although existing RGB methods have achieved very high performance, they might fail when dealing with complex scenarios, e.g., low contrast, occlusions. It is shown that depth is an important and effective cue for saliency detection (Desingh et al., 2013) especially in these complex scenarios. Existing RGB-D SOD methods mainly rely on extracting salient features from RGB image and depth map respectively, and then fuse them in the early or late network stages. Peng et al. (Peng et al., 2014) directly concatenate RGB-D pairs as 4-channel inputs to predict saliency maps. Han et al. (Han et al., 2017) propose a two-stream network to extract RGB features and depth features, and then fuse them with a combination layer. Chen et al. (Chen and Li, 2018) propose a progressive fusion strategy in a coarse-to-fine manner. Zhao et al. (Zhao and Wu, 2019) propose a fluid pyramid integration strategy to make full use of depth enhanced features. Piao et al. (Piao et al., 2019) develop a two-stream network and fuse paired multi-level side-out features to refine the final salient object detection.

Figure 2. The overall architecture of our model. Our depth-awareness SOD framework is mainly composed of three parts, i.e., a salient object detection module, a depth awareness module and an error-weighted correction. ASPP denotes atrous spatial pyramid pooling. CAF denotes the proposed channel-aware fusion module. DEC denotes the proposed depth error-weighted correction. The dashed line denotes supervision.

Single image depth estimation. Monocular depth estimation can be divided into three categories according to the input: monocular video (Wang et al., 2018, 2019), stereo image pairs (Garg et al., 2016; Tosi et al., 2019) and single image (Eigen et al., 2014; Eigen and Fergus, 2015; Laina et al., 2016; Fu et al., 2018; Yin et al., 2019), in which taking single image as input is the hardest case because there is no geometric information in only a single image. Thanks to the powerful deep networks like VGG (Simonyan and Zisserman, 2014) and ResNet (He et al., 2016), single image depth estimation has been boosted to a new accuracy level. Eigen et al.(Eigen et al., 2014; Eigen and Fergus, 2015) propose the first CNN-based framework for single image depth estimation, which applies a stage-wisely multi-scale network to refine depth estimation. Laina et al. (Laina et al., 2016) introduce a fully convolutional architecture and design reverse Huber loss to smoothness effect of L2 norm. Fu et al. (Fu et al., 2018) propose a spacing-increasing discretization strategy to discretize depth and recast depth estimation as an ordinal regression problem. Yin et al.(Yin et al., 2019) propose a global geometric constraint to improve the depth estimation accuracy. As an important cue in many vision tasks, there are many works utilize multi-task learning to joint depth estimation and other per-pixel vision tasks, such as semantic segmentation (Mousavian et al., 2016), surface normal (Yin et al., 2019).

3. Methodology

3.1. Overview

Depth-Awareness SOD Network. In this section, we present a novel joint Depth-Awareness SOD Network (DASNet) for RGBD-based and RGB-based salient object detection tasks, which is mainly composed of three modules, i.e., the SOD module, the depth awareness module and the depth error-weighted correction (see Fig. 2). The first two modules share similar structures but focus on different tasks, which are supervised by saliency maps and depth maps respectively. The SOD module and the depth awareness module utilize our proposed channel-aware fusion model (CAF) to fuse high-level and low-level features. Taking these two branches into combination, we finally refine the saliency results by the proposed depth error-weighted constraint, which could mine hard pixels with the supervision of depth maps.

Figure 3. Different types of SOD architecture. a) : Typical RGB-based SOD network architecture. b): Typical RGBD-based SOD network architecture. c): Proposed Depth-awareness SOD network architecture.

Relations and discussions. Our intuitive idea comes from the RGB and RGBD salient object detection tasks, which is shown in Fig. 3. The conventional RGB SOD in Fig. 3 a) takes the original image as input with a encoder-decoder framework. With the depth as auxiliary input in Fig. 3 b), the overall framework requires two independent encoders to extract the depth and RGB features separately, which main computation costs are usually lied on. Moreover, the depth and RGB encoders are separately trained and the relationships between these multi-modal data are not fully explored.

Taking only RGB inputs as well as saving the computation costs, the depth-aware salient object detection in Fig. 3

c) provides us a new perspective to utilize the depth data in this segmentation task. In the testing phase, the network only takes the RGB as input and the object segmentation results are regularized by the depth-awareness constraints in the training phase. In this manner, the network not only builds an explicit relationship between depth and SOD, but also saves the additional costs in feature extraction.

3.2. Channel-Aware Fusion Module

The crucial problem in salient object detection is to select the most discriminative features and pass them in the coarse-to-fine scheme. However, aggregating features from different levels in a encoder-decoder fashion usually leads to missing details or introduces ambiguous features, which both make the network fail to optimize. Notably, this phenomenon appears more frequently when it comes to aggregating features from different domains. Therefore, a selective feature fusion strategy is in high demand, especially for RGBD salient object understanding.

Toward this end, we propose a novel Channel-Aware Fusion module (CAF), which adaptively selects the discriminative features for object understanding. Instead of using different specific structures for different aggregation strategies in previous works (Su et al., 2019; Chen et al., 2020; Piao et al., 2019), we advocate using a generalized module to fuse any common types of features, e.g., features from different levels and features from different sources.

Figure 4. The proposed channel-aware fusion module. Blocks denote basic convolutional units and G is the fused output.

The proposed CAF has some meaningful designs, which are illustrated in Fig. 4. First, given two types of source feature , we use a pixel-wise multiplication to enhance the common pixels in feature maps, while alleviates the ambiguous ones. The enhanced features are then concatenated with the transformed features with a lightweight encoder . It can be formally represented as:

(1)

where and denote the feature concatenation operation and pixel-wise multiplication respectively. Each encoder is typically composed of a

convolutional layer followed by a Batch Normalization and a ReLU activation. Specially, when aggregating the multi-level features, the features

and are first upsampled to the same scale, which is omitted for better view in Fig. 4.

After obtaining rich feature by (1), the second main concern is how to select the most relevant features that are highly-responded in the segmentation target. Inspired by channel-attention mechanism (Hu et al., 2018; Chen et al., 2017b), we thus propose to use global features for a contextual understanding for the attention weights. The are then squeezed with a global average pooling, followed by a sigmoid normalization

, and transformed as the vector shape to align the dimensions with feature channels. This serialized operation has the form:

(2)
(3)

The

is a linear transformation to reorganize the pooling features and

denotes the learnt attention weighted features. Therefore features relevant to the salient target could be prominent in each group of source features and . This can be achieved by a channel-aware attention mechanism:

(4)

where denotes the typical decoder and denotes the typical decoder with dimensional reductions as original input. Hence the relevant features to target object can be enhanced in the final output . In addition, to implement the whole framework in a lightweight trend, the channel dimension is empirically set as to achieve the state-of-the-art performance.

3.3. Depth-awareness Constraint

What roles does depth play in salient object detection? To answer this aforementioned question, in this paper, we propose an innovative depth-awareness constraint from two complementary aspects, i.e., multi-level depth awareness and depth error-weighted correction. These two aspects work collaboratively to regularize the salient features being aware of contrastive depth regions and contextual salient confusions, which facilitates the segmentation process in different learning stages.

Multi-level depth awareness. As mentioned in Section 3.2, the key issue in salient object detection lies on the utilization of multi-level features in different network stages. Besides the aggregation strategy, the other exploitation is to regularize the features focusing on meaningful regions, which would provide useful contextual information before aggregation. Taking the advantages of depth information and the hierarchical network architecture, we force the segmentation features to focus on depth regions in different network learning stages, which is elaborated in Fig. 2. This means in each network learning stages, the features should be aware of the object information as well as the contrastive depth regions. We use an additional depth branch to regress the ground-truth depth.

With this collaborative learning of SOD and depth regression, we further fuse these two modules to refine the salient object (see Fig. 2), which builds strong correlations between these two different types of features. Notably, this refinement strategy can also be well handled by our proposed CAF, with the same segmentation supervision at multiple levels. As a result, the salient features stand as a predominant place in the final optimization and the depth map becomes a leading guidance.

Depth error-weighted correction. To make a thorough exploitation of depth information, we further propose a depth error-weighted correction (DEC) which aims to regularize hard pixels with higher weights if the predicted depth make mistakes. As it stands, the network itself naturally tends to be highly responded to the salient regions and then form a holistic salient object. However, this would guide the predicted depth features focusing on salient regions. This would cause a severe misalignment between the predicted depth and ground truth data. Remarkably, the error regions where the predicted depth make mistakes are usually the semantic ambiguous regions, which we need to pay more attention to the learning process.

In order to solve this misalignment as well as to exploit it, we thus introduce a logarithmic depth error weight. Let and be the predicted depth and groundtruth depth respectively, the error weight of each pixel has the form:

(5)

where and are the width and height of the error window, which aims to represent the error of central pixel with the mean value of a local region. The detailed ablations to decide and can be found in Tab. 5. In this way, the ambiguous pixels are treated with more attention in the early learning phase. With the optimization goes through, the regularized features become depth-aware and errors are progressively corrected. This learning progress is exhibited in Fig. 5, where the highly-responded regions in the error map shrink along with the learning stage. This verifies that the final optimized features are aware of depth information and better at handling semantic confusions.

Figure 5.

Qualitative visualization of depth error weights during the training stage, with epoch 2, 16, and 32.

3.4. Learning Objective

Our overall learning objective is composed of three modules, as in Fig. 2, the SOD module, the depth awareness module and the error-weighted correction. Let be the predicted salient mask and corresponding groudtruth, the SOD module is supervised with the BCE loss:

(6)

However, the BCE loss usually leads to noisy predictions which does not form a holistic object. To make the salient object with clear boundaries, we adopt a IoU (Intersection over Union) loss (Jun Wei, 2020; Qin et al., 2019) as the auxiliary loss:

(7)

For the depth awareness module, we adopt the log mean square error (logMSE) for supervision (Eigen et al., 2014; Eigen and Fergus, 2015) to generate smooth depth map, and meanwhile providing the error weights :

(8)

For the error-weighted correction module, we adopt a error-weighted BCE loss to attach more importance to wrongly-predicted pixels:

(9)

This error loss adopts the same supervision as the SOD module with a binary segmentation mask. To implement the multi-level supervision in a unified framework, the overall loss can be formulated as:

(10)

where denotes the weight of different level loss and is set as 5 with five stages in ResNet. Here we follow GCPANet (Chen et al., 2020) and set as [1, 0.8, 0.6, 0.4, 0.2].

methods NJUD-TE NLPR-TE STEREO DES SSD
DF (Qu et al., 2017) .804 .744 .141 .763 .778 .682 .085 .802 .757 .616 .141 .757 .766 .566 .093 .752 .735 .709 .142 .747
AFNet (Wang and Gong, 2019) .775 .764 .100 .772 .771 .755 .058 .799 .823 .806 .075 .825 .728 .713 .068 .770 .687 .672 .118 .714
CTMF (Han et al., 2017) .845 .788 .085 .849 .825 .723 .056 .860 .831 .786 .086 .848 .844 .765 .055 .863 .729 .709 .099 .776
MMCI (Chen et al., 2019) .852 .813 .079 .858 .815 .729 .059 .856 .863 .812 .068 .873 .822 .750 .065 .848 .781 .748 .082 .813
PCF (Chen and Li, 2018) .872 .844 .059 .877 .841 .794 .044 .874 .860 .845 .064 .875 .804 .763 .049 .842 .807 .786 .062 .841
TANet (Chen and Li, 2019) .874 .844 .060 .878 .863 .796 .041 .886 .861 .828 .060 .871 .827 .795 .046 .858 .810 .767 .063 .839
CPFP(Zhao et al., 2019) .876 .850 .053 .879 .869 .840 .036 .888 .874 .842 .051 .879 .838 .815 .038 .872 .766 .747 .082 .807
DMRA (Piao et al., 2019) .886 .872 .051 .886 .879 .855 .031 .899 .868 .847 .066 .835 .888 .857 .030 .900 .844 .821 .058 .857
D3Net (Fan et al., 2019) .889 .860 .051 .895 .885 .853 .030 .904 .881 .844 .054 .904 .885 .859 .030 .904 .847 .818 .058 .866
Ours .911 .894 .042 .902 .929 .907 .021 .929 .915 .894 .037 .910 .928 .892 .023 .908 .881 .857 .042 .885
Table 1. Performance comparison with 9 state-of-the-art RGBD-based SOD methods on five benchmarks. Smaller , larger , and indicates better performance. The best results are highlighted in bold.
Figure 6. Qualitative comparison of the state-of-the-art RGBD-based methods and our approach. Obviously, saliency maps produced by our model are clearer and more accurate than others in various challenging scenarios.

4. Experiments

4.1. Datasets and Evaluation Metrics

RGBD-based SOD datasets. To evaluate the RGBD performance of the proposed approach, we conduct experiments on five benchmarks (Ju et al., 2014; Peng et al., 2014; Niu et al., 2012; Cheng et al., 2014b; Zhu and Li, 2017), including NJUD (Ju et al., 2014) with 1,985 images captured by Fuji W3 stereo camera, NLPR (Peng et al., 2014) with 1,000 images captured by Kinect. STEREO (Niu et al., 2012) with 1,000 images collected in the Internet. DES (Cheng et al., 2014b) with 135 images captured by Kinect. SSD (Zhu and Li, 2017) with 80 images picked up from stereo movies. Following (Zhao et al., 2019; Piao et al., 2019), We split 1,500 samples from NJUD and 700 samples from NLPR for training, the rest images in these two datasets and the other three datasets are used for testing.

RGB-based SOD datasets. To verify the effectiveness for RGB datasets, we adopt five RGB benchmarks (Wang et al., 2017; Yang et al., 2013; Yan et al., 2013; Li et al., 2014; Li and Yu, 2015), including DUTS (Wang et al., 2017) with 15,572 images, ECSSD (Yan et al., 2013) with 1,000 images, DUT-OMRON (Yang et al., 2013) with 5,168 images, PASCAL-S (Li et al., 2014) with 850 images, HKU-IS (Li and Yu, 2015) with 4,447 images. DUTS is currently the largest SOD dataset, following (Wang et al., 2017), we split 10,553 images (DUT-TR) from DUTS for training and 5,019 images (DUT-TE) from DUTS for testing, the other four datasets are also used for testing.

Evaluation Metrics. To quantitatively evaluate the performance of our approach and state-of-the-art methods, we adopt 4 commonly used metrics: max F-measure (), mean F-measure (), mean absolute error () and structure similarity measure ((Fan et al., 2017) on both RGB-based methods and RGBD-based methods.

We use

to measure both Precision and Recall comprehensively.

is computed based on Precision and Recall pairs as follows:

(11)

where we set =0.3 to emphasize more on Precision than Recall, and compute , using different thresholds as in (Borji et al., 2015).

We use to measure structure similarity for a more comprehensive evaluation. combines the region-aware () and object-aware () structural similarity as follows:

(12)

where we set =0.5 as suggested in (Fan et al., 2017).

4.2. Implementation Details

We adopt ResNet-50 (He et al., 2016)

pre-trained on ImageNet 

(Deng et al., 2009) as our backbone. The atrous rate of ASPP follows the prior work (Chen et al., 2017a), which is set as (6, 12, 18). In the training stage, we resize each image to and adopt horizontal flip, random crop and multi-scale resize as data augmentation. We use SGD optimizer with the batch size=32 for 32 epochs. Inspired by (Jun Wei, 2020; Chen et al., 2020), we adopt warm-up and linear decay strategies to adjust the learning rate with the maximum learning rate 0.005 for ResNet-50 backbone and 0.05 for other parts. We set momentum and decay rate to 0.9 and 5e-4, respectively. It only takes us 1 hour for RGBD-based task and 3 hours for RGB-based task to train a model on one NVIDIA 1080Ti GPU.

For the RGBD-based salient object detection, we utilize both RGB images and depth maps from training sets to train our model. During the testing stage, we only need RGB images as inputs to predict saliency maps on RGBD test sets. For the RGB-based salient object detection task, we first estimate depth maps for DUT-TR by pre-trained VNLNet (Yin et al., 2019)

directly, which works well in single image depth estimation task. Then we utilize both DUT-TR and its corresponding predicted depth maps to train our model. During the inference stage, we only need RGB images as inputs to predict saliency maps on RGB test sets. The PyTorch implementation will be publicly available.

111Link is masked for blind review policy.

methods ECSSD DUT-TE DUT-OMRON HKU-IS PASCAL-S
BMPM (Zhang et al., 2018a) .929 .894 .045 .911 .851 .762 .049 .861 .774 .698 .064 .808 .921 .875 .039 .905 .862 .803 .073 .840
PAGR (Zhang et al., 2018b) .927 .894 .061 .889 .854 .784 .056 .838 .771 .711 .071 .775 .918 .886 .048 .887 .854 .803 .094 .815
R3Net (Deng et al., 2018) .929 .883 .051 .910 .829 .716 .067 .837 .793 .690 .067 .819 .910 .853 .047 .894 .837 .775 .101 .809
PiCA-R (Liu et al., 2018) .935 .901 .047 .918 .860 .816 .051 .868 .803 .762 .065 .829 .919 .880 .043 .905 .881 .851 .077 .845
BANet (Su et al., 2019) .939 .917 .041 .924 .872 .829 .040 .879 .782 .750 .061 .832 .923 .893 .037 .913 .847 .839 .079 .852
PoolNet (Liu et al., 2019) .944 .915 .039 .921 .880 .809 .040 .883 .808 .747 .055 .833 .933 .899 .032 .916 .869 .822 .074 .845
BASNet (Qin et al., 2019) .943 .880 .037 .916 .859 .791 .048 .866 .805 .756 .056 .836 .928 .895 .032 .909 .857 .775 .078 .832
CPD-R (Wu et al., 2019) .939 .917 .037 .918 .865 .805 .043 .869 .797 .747 .056 .825 .925 .891 .034 .905 .864 .824 .072 .842
F3Net (Jun Wei, 2020) .945 .925 .033 .924 .890 .840 .035 .888 .813 .766 .053 .838 .937 .910 .028 .917 .880 .840 .064 .855
GCPANet (Chen et al., 2020) .948 .919 .035 .927 .888 .817 .040 .891 .812 .748 .056 .839 .938 .898 .031 .920 .876 .836 .064 .861
Ours .950 .932 .032 .927 .896 .853 .034 .894 .827 .783 .050 .845 .942 .917 .027 .922 .885 .849 .064 .860
Table 2. Performance comparison with 10 state-of-the-art RGB-based SOD methods on five benchmarks. Smaller , larger , and correspond to better performance. The best results are highlighted in bold.
Figure 7. Qualitative comparison of the state-of-the-art RGB-based methods and our approach. Obviously, saliency maps produced by our model are clearer and more accurate than others in various challenging scenarios.

4.3. Comparisons with the state-of-the-art

RGBD-based SOD Benchmark. As shown in Tab. 1, we compare our model denoted as DASNet with 9 state-of-the art methods, including DF (Qu et al., 2017), AFNet (Wang and Gong, 2019), CTMF (Han et al., 2017), MMCI (Chen et al., 2019), PCF (Chen and Li, 2018), TANet (Chen and Li, 2019), CPFP (Zhao et al., 2019), DMRA (Piao et al., 2019), D3Net (Fan et al., 2019). For fair comparisons, we obtain the saliency maps from the reported results. Our proposed approach surpasses 9 state-of-the-art RGBD-based saliency detection methods on five benchmarks. As shown in Tab. 1, it is obviously that our method achieves a new performance leader-board with no depth image as inputs, which puts our model in inferior places for comparison. Especially for the and metric, our model outperforms over 3%, which means our method has a good capability to utilize depth information for more precise saliency maps.

In Fig. 6, we exhibit the saliency maps predicted by our model and other approaches. Among all the methods, our model performs best both on completeness and clarity. In the first, second, and third rows, our method could obtain more accurate and clearer saliency maps than others with ambiguous depth cues. In forth and fifth rows, our method could obtain more complete results than others. Our proposed framework could utilize depth cues much better in various challenging scenarios. Besides, the object boundaries predicted by our model are clearer and sharper than others.

RGB-based SOD Benchmark. As shown in Tab. 2, we compare our proposed DASNet with 10 state-of-the-art methods, i.e., BMPM (Zhang et al., 2018a), PAGR (Zhang et al., 2018b), R3Net (Deng et al., 2018), PiCANet (Liu et al., 2018), PoolNet (Liu et al., 2019), BANet (Su et al., 2019), CPDNet (Wu et al., 2019), BASNet (Qin et al., 2019), F3Net (Jun Wei, 2020), GCPANet (Chen et al., 2020). As shown in Tab. 2, we can see our proposed DSANet still outperforms other methods and ranks first on all datasets and almost all metrics. However, this performance is achieved with only estimated depth maps as training priors. we believe that with the captured real data, the final performance would be improved steadily, which is vailidated on the RGBD benchmarks.

As shown in Fig. 7, comparing with visual results of different methods, our approach shows an advantage in completeness and clarity. In first and second rows, our method could distinguish foreground and background and obtain more accurate results than other methods in complex scenarios with similar foreground and background. In third row, our method could obtain more complete results in complex scenarios with low contrast, while other methods might fail to detect salient objects in the same scenarios. In forth and fifth rows, our method can provide accurate object localization when salient objects touching image boundaries. Besides, the object boundaries predicted by our model are clearer and sharper than others.

BCE CAF IoU DAM DEC MLS NJUD-TE
.838 .058
.853 .056
.857 .051
.871 .048
.875 .047
.880 .045
.886 .043
.894 .042
Table 3.

Ablation study for different components. BCE, IoU, DEC are different loss functions mentioned above. CAF denotes the proposed channel aware fusion module. DAM denotes the depth awareness module. MLS represents multi-level supervision.

Figure 8. Qualitative results on RGBD datasets. The third column without depth awareness is hard to distinguish complex scenarios with similar foreground and background, while our model in the forth column shows better performance.

4.4. Performance Analysis

To investigate the effectiveness of each key component in our proposed model, we first conduct a thorough ablation study and then measure the computation complexity for the state-of-the-art models to show its superiority. Finally an experiment for finding hyper-parameters can be found in Tab. 5.

Channel-Aware Fusion. To evaluate the effectiveness of our feature fusion module, we reconstruct our model with different ablation factors. Tab. 3 shows the ablations on NJUD-TE dataset. In the first row, we first build our model with widely-used lateral connections between different levels of features, and then fuse them by pixel-wise summation as our baseline. In the second row, we replace the fusion strategy aforementioned with proposed CAF. This more effective fusion strategy can improve of baseline from 0.838 to 0.853.

Depth-awareness Constraint. Then we test our proposed DAM and DEC on the baseline using only BCE loss , and both BCE and IoU loss respectively. Comparing with model using CAF and only BCE loss, our proposed DAM and DEC can improve 1.8% in total. Compared with the baseline using CAF and both BCE loss and IoU loss, our proposed DAM and DEC can improve from 0.875 to 0.886 and from 0.047 to 0.043. At last, we add multi-level supervision to refine our results. As shown in Tab. 3, all components contribute to the performance improvement, which demonstrates the necessity of each component of our proposed model to obtain the best saliency detection results. Qualitative results can be found in Fig. 8. In the third column, our model without the DAM and DEC would be confused in regions with similar foreground and background. With DAM and DEC, our model could distinguish these confusing features and generate more accurate and clearer saliency maps.

Computational efficiency. Tab. 4

shows the parameters and computational cost measured by multiply-adds (MAdds) of our proposed model and other open-sourced RGB-based models and RGBD-based models. Our model could achieve obvious higher performance in a light-weight fashion. Notably, CPD-R 

(Wu et al., 2019) discards features of two shallower layers to improve the computation efficiency, but sacrifices the accuracy and clarity of results. For fair comparisons, we obtain the deployment codes released by authors and evaluate them with the same configuration.

Hyper-paramters. To evaluate the effectiveness as well as to find the adequate window sizes in (5), we tune the to be different sizes and choose to achieve the best performance. This means that the error weight should be locally aware thus to generate clear object details. This also indicates that amplifying the local receptive field of error-weighted correction module in an adequate range is effective to reach higher scores.

Methods Platform Params(M) MAdds(G)
RGB&RGBD Ours pytorch 36.68 11.57
RGB GCPANet (Chen et al., 2020) pytorch 67.06 26.61
BASNet (Qin et al., 2019) pytorch 87.06 97.51
CPD-R (Wu et al., 2019) pytorch 47.85 7.19
BANet (Su et al., 2019) caffe 55.90 35.83
RGBD CPFP (Zhao et al., 2019) caffe 72.94 21.25
DMRA (Piao et al., 2019) pytorch 59.66 113.09
Table 4. Complexity comparison with RGB-based models and RGBD-based models. Models ranking the first and second place are viewed in bold and underlined.
.924 .929 .925 .929 .926 .927
.895 .904 .898 .907 .904 .897
.024 .021 .022 .021 .023 .022
.924 .928 .926 .929 .926 .925
Table 5. Error correction results on NLPR-TE with different window sizes.

5. Conclusions

In this paper, we rethink the problem of depth in the field of salient object detection and propose a new perspective of containing the depth constraints in learning process, rather than using the captured depth as inputs. To make a deeper exploitation of depth information, we develop a multi-level depth awareness constraints and a depth error-weighted loss to alleviate the salient confusions. These advanced designs endow our model lightweight and be free of depth input. Experimental results reveal that with only RGB inputs, the proposed network not only surpasses the state-of-the-art RGBD methods by a large margin but well demonstrates its effectiveness in RGB application scenarios.

References

  • A. Borji, M. Cheng, H. Jiang, and J. Li (2015) Salient object detection: a benchmark. IEEE transactions on image processing 24 (12), pp. 5706–5722. Cited by: §2, §4.1.
  • H. Chen, Y. Li, and D. Su (2019) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for rgb-d salient object detection. Pattern Recognition 86, pp. 376–385. Cited by: Table 1, §4.3.
  • H. Chen and Y. Li (2018) Progressively complementarity-aware fusion network for rgb-d salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3051–3060. Cited by: §2, Table 1, §4.3.
  • H. Chen and Y. Li (2019) Three-stream attention-aware network for rgb-d salient object detection. IEEE Transactions on Image Processing 28 (6), pp. 2825–2835. Cited by: Table 1, §4.3.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017a) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §4.2.
  • L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017b)

    Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667. Cited by: §3.2.
  • Z. Chen, Q. Xu, R. Cong, and Q. Huang (2020) Global context-aware progressive aggregation network for salient object detection. arXiv preprint arXiv:2003.00651. Cited by: §3.2, §3.4, §4.2, §4.3, Table 2, Table 4.
  • M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu (2014a) Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 569–582. Cited by: §1, §2.
  • Y. Cheng, H. Fu, X. Wei, J. Xiao, and X. Cao (2014b) Depth enhanced saliency detection method. In Proceedings of international conference on internet multimedia computing and service, pp. 23–27. Cited by: §4.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.2.
  • Z. Deng, X. Hu, L. Zhu, X. Xu, J. Qin, G. Han, and P. Heng (2018) R3net: recurrent residual refinement network for saliency detection. In

    Proceedings of the 27th International Joint Conference on Artificial Intelligence

    ,
    pp. 684–690. Cited by: §4.3, Table 2.
  • K. Desingh, K. M. Krishna, D. Rajan, and C. Jawahar (2013) Depth really matters: improving visual salient region detection with depth.. In BMVC, Cited by: §2.
  • D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision, pp. 2650–2658. Cited by: §2, §3.4.
  • D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: §2, §3.4.
  • D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: a new way to evaluate foreground maps. In Proceedings of the IEEE international conference on computer vision, pp. 4548–4557. Cited by: §4.1, §4.1.
  • D. Fan, Z. Lin, Z. Zhang, M. Zhu, and M. Cheng (2019) Rethinking rgb-d salient object detection: models, datasets, and large-scale benchmarks. arXiv preprint arXiv:1907.06781. Cited by: §1, Table 1, §4.3.
  • H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §2.
  • R. Garg, V. K. BG, G. Carneiro, and I. Reid (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In European Conference on Computer Vision, pp. 740–756. Cited by: §2.
  • J. Han, H. Chen, N. Liu, C. Yan, and X. Li (2017) CNNs-based rgb-d saliency detection via cross-view transfer and multiview fusion. IEEE transactions on cybernetics 48 (11), pp. 3171–3183. Cited by: §1, §2, Table 1, §4.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2, §4.2.
  • S. Hong, T. You, S. Kwak, and B. Han (2015)

    Online tracking by learning discriminative saliency map with convolutional neural network

    .
    In

    International conference on machine learning

    ,
    pp. 597–606. Cited by: §1.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §3.2.
  • R. Ju, L. Ge, W. Geng, T. Ren, and G. Wu (2014) Depth saliency based on anisotropic center-surround difference. In 2014 IEEE international conference on image processing (ICIP), pp. 1115–1119. Cited by: §1, §4.1.
  • Q. H. Jun Wei (2020) F3Net: fusion, feedback and focus for salient object detection. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §2, §3.4, §4.2, §4.3, Table 2.
  • D. A. Klein and S. Frintrop (2011) Center-surround divergence of feature statistics for salient object detection. In 2011 International Conference on Computer Vision, pp. 2214–2219. Cited by: §1, §2.
  • B. Lai and X. Gong (2016) Saliency guided dictionary learning for weakly-supervised image parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3630–3639. Cited by: §1.
  • I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pp. 239–248. Cited by: §2.
  • G. Li and Y. Yu (2015)

    Visual saliency based on multiscale deep features

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5455–5463. Cited by: §1, §4.1.
  • Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille (2014) The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–287. Cited by: §4.1.
  • J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3917–3926. Cited by: §4.3, Table 2.
  • N. Liu, J. Han, and M. Yang (2018) Picanet: learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3089–3098. Cited by: §1, §2, §4.3, Table 2.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
  • A. Mousavian, H. Pirsiavash, and J. Košecká (2016) Joint semantic segmentation and depth estimation with deep convolutional networks. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 611–619. Cited by: §2.
  • Y. Niu, Y. Geng, X. Li, and F. Liu (2012) Leveraging stereopsis for saliency analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 454–461. Cited by: §1, §4.1.
  • H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji (2014) Rgbd salient object detection: a benchmark and algorithms. In European conference on computer vision, pp. 92–109. Cited by: §1, §2, §4.1.
  • Y. Piao, W. Ji, J. Li, M. Zhang, and H. Lu (2019) Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7254–7263. Cited by: §1, §2, §3.2, Table 1, §4.1, §4.3, Table 4.
  • X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand (2019) Basnet: boundary-aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7479–7489. Cited by: §1, §2, §3.4, §4.3, Table 2, Table 4.
  • L. Qu, S. He, J. Zhang, J. Tian, Y. Tang, and Q. Yang (2017) RGBD salient object detection via deep fusion. IEEE Transactions on Image Processing 26 (5), pp. 2274–2285. Cited by: Table 1, §4.3.
  • L. Shao and M. Brady (2006) Specific object retrieval based on salient regions. Pattern Recognition 39 (10), pp. 1932–1948. Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  • J. Su, J. Li, Y. Zhang, C. Xia, and Y. Tian (2019) Selectivity or invariance: boundary-aware salient object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3799–3808. Cited by: §1, §2, §3.2, §4.3, Table 2, Table 4.
  • F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia (2019) Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9799–9809. Cited by: §2.
  • C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey (2018) Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2022–2030. Cited by: §2.
  • L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan (2017) Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 136–145. Cited by: §1, §4.1.
  • N. Wang and X. Gong (2019) Adaptive fusion for rgb-d salient object detection. IEEE Access 7, pp. 55277–55284. Cited by: Table 1, §4.3.
  • R. Wang, S. M. Pizer, and J. Frahm (2019) Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5555–5564. Cited by: §2.
  • Z. Wu, L. Su, and Q. Huang (2019) Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3907–3916. Cited by: §1, §2, §4.3, §4.4, Table 2, Table 4.
  • Q. Yan, L. Xu, J. Shi, and J. Jia (2013) Hierarchical saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1155–1162. Cited by: §1, §1, §2, §4.1.
  • C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3166–3173. Cited by: §4.1.
  • W. Yin, Y. Liu, C. Shen, and Y. Yan (2019) Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5684–5693. Cited by: §2, §4.2.
  • L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang (2018a) A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1741–1750. Cited by: §4.3, Table 2.
  • X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang (2018b) Progressive attention guided recurrent network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 714–722. Cited by: §4.3, Table 2.
  • J. Zhao, Y. Cao, D. Fan, M. Cheng, X. Li, and L. Zhang (2019) Contrast prior and fluid pyramid integration for rgbd salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3927–3936. Cited by: §1, §1, Table 1, §4.1, §4.3, Table 4.
  • T. Zhao and X. Wu (2019) Pyramid feature attention network for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3085–3094. Cited by: §1, §2, §2.
  • C. Zhu and G. Li (2017) A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 3008–3014. Cited by: §4.1.