BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network

07/06/2020 ∙ by Fan Deng-Ping, et al. ∙ Nankai University IEEE 0

Multi-level feature fusion is a fundamental topic in computer vision for detecting, segmenting, and classifying objects at various scales. When multi-level features meet multi-modal cues, the optimal fusion problem becomes a hot potato. In this paper, we make the first attempt to leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to develop a novel cascaded refinement network. In particular, we 1) propose a bifurcated backbone strategy (BBS) to split the multi-level features into teacher and student features, and 2) utilize a depth-enhanced module (DEM) to excavate informative parts of depth cues from the channel and spatial views. This fuses RGB and depth modalities in a complementary way. Our simple yet efficient architecture, dubbed Bifurcated Backbone Strategy Network (BBS-Net), is backbone independent, runs in real-time (48 fps), and significantly outperforms 18 SOTAs on seven challenging datasets using four metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-modal and multi-level feature fusion [37] is essential for many computer vision tasks, such as object detection [8, 68, 20, 26], semantic segmentation [28, 32, 65, 29], co-attention tasks [18, 71] and classification [38, 40, 51]. Here, we attempt to utilize this idea for RGB-D salient object detection (SOD) [72, 7], which aims at finding and segmenting the most visually prominent object(s) [2, 73] in a scene according to the RGB and depth cues.

To efficiently integrate the RGB and depth cues for SOD, researchers have explored several multi-modal strategies [3, 6], and have achieved encouraging results. Existing RGB-D SOD methods, however, still face the following challenges:

(1) Effectively aggregating multi-level features. As discussed in [43, 61], teacher features111Note that we use the terms ‘high-level features & low-level features’ and ‘teacher features & student features’ interchangeably. provide discriminative semantic information that serves as strong guidance for locating salient objects, while student features carry affluent details that are beneficial for refining edges. Therefore, previous RGB-D SOD algorithms focus on leveraging multi-level features, either via a progressive merging process [46, 74] or by using a dedicated aggregation strategy [50, 72]. However, these operations directly fuse multi-level features without considering level-specific characteristics, and thus suffer from the inherent noise often introduced by low-level features [7, 63]. Thus, some methods tend to get distracted by the background (e.g., first row in Fig. 1).

[width=.98]figurefirst

Figure 1: Saliency maps of state-of-the-art (SOTA) CNN-based methods (i.e., DMRA [50], CPFP [72], TANet [7], and our BBS-Net) and methods based on hand crafted features (i.e., SE [27] and LBE [24]). Our method generates higher quality saliency maps and suppresses background distractors for challenging scenarios (first row: complex background; second row: depth with noise) more effectively.

(2) Excavating informative cues from the depth modality. Previous methods combine RGB and depth cues by regarding the depth map as a fourth-channel input [13, 49] or fusing RGB and depth modalities by simple summation [22, 23] or multiplication [9, 75]. These algorithms treat depth and RGB information the same and ignore the fact that depth maps mainly focus on the spatial relations among objects, whereas RGB information captures color and texture. Thus, such simple combinations are not efficient due to the modality difference. Besides, depth maps are sometimes low-quality, which may introduce feature noise and redundancy into the network. As an example, the depth map shown in the second row of Fig. 1 is blurry and noisy, and that is why many methods, including the top-ranked model (DMRA-iccv19 [50]), fail to detect the complete salient object.

To address these issues, we propose a novel Bifurcated Backbone Strategy Network (BBS-Net) for RGB-D salient object detection. As shown in Fig. 2 (b), BBS-Net consists of two cascaded decoder stages. In the first stage, teacher features are aggregated by a standard cascaded decoder to generate an initial saliency map . In the second stage, student features are refined by an element-wise multiplication with the initial saliency map and are then integrated by another cascaded decoder to predict the final map .

To the best of our knowledge, BBS-Net is the first work to explore the cascaded refinement mechanism for the RGB-D SOD task. Our main contributions are as follows:

  • We exploit multi-level features in a bifurcated backbone strategy (BBS) to suppress distractors in the lower layers. This strategy is based on the observation that high-level features provide discriminative semantic information without redundant details [43, 63], which may contribute significantly to eliminating distractors in lower layers.

  • To fully capture the informative cues in the depth map and improve the compatibility of RGB and depth features, we introduce a depth-enhanced module (DEM), which contains two sequential attention mechanisms: channel attention and spatial attention. The channel attention utilizes the inter-channel relations of the depth features, while the spatial attention aims to determine where informative depth cues are carried.

  • We demonstrate that the proposed BBS-Net exceeds SOTAs on seven public datasets, by a large margin. Our experiments show that our framework has strong scalability in terms of various backbones. This means that the bifurcated backbone strategy via a cascaded refinement mechanism is promising for multi-level and multi-modal learning tasks. In addition, the model runs in real-time (48 fps) on a single GTX 1080Ti GPU, making it a potential solution for real-world applications.

[width=.98]FrameworkCompare

Figure 2: (a) Existing multi-level feature aggregation methods for RGB-D SOD [3, 72, 50, 7, 74, 60, 46]. (b) In this paper, we propose to adopt a bifurcated backbone strategy (BBS) to split the multi-level features into student and teacher features. The initial saliency map is utilized to refine the student features to effectively suppress distractors. Then, the refined features are passed to another cascaded decoder to generate the final saliency map .

2 Related Works

Although RGB-based SOD has been thoroughly studied in recent years  [5, 69, 58, 67, 39], most of algorithms fail under complicated scenarios (e.g., cluttered backgrounds [16], low-intensity environments, or varying illuminations) [7, 50]. As a complementary modality to RGB information, depth cues contain rich spatial distance information [50] and contribute significantly to understanding challenging scenes. Therefore, researchers have started to solve the SOD problem by combining RGB images with depth information [15].

Traditional Models. Previous RGB-D SOD algorithms mainly focused on hand crafted features [9, 75]. Some of these methods largely relied on contrast-based cues by calculating color, edge, texture and region contrast to measure the saliency in a local region. For example, [15] adopted the region based contrast to compute contrast strengths for the segmented regions. In [10], the saliency value of each pixel depended on the color contrast and surface normals. However, the local contrast methods focued on the boundaries of salient objects and were easily affected by high-frequency content [52]. Therefore, some algorithms, such as global contrast [11], spatial prior [9], and background prior [54], proposed to calculate the saliency by combining local and global information. To effectively combine saliency cues from the RGB and depth modalities, researchers have explored various fusion strategies. Some methods [49, 13] regarded depth images as the fourth-channel inputs and processed the RGB and depth channes together (early fusion). This operation seems simple but disregards the differences between the RGB and depth modalities and thus cannot achieve reliable results. Therefore, to effectively extract the saliency information from the two modalities separately, some algorithms [22, 75] first leveraged two backbones to predict saliency maps and then fused the saliency results (late fusion). Besides, considering that the RGB and depth modalities may positively influence each other, yet other methods [24, 34] fused RGB and depth features in a middle stage and then predicted the saliency maps from the fused features (middle fusion). In fact, these three fusion strategies are also explored in the current deep models, and our model can be considered as a middle fusion.

Deep Models. Early deep algorithms [52, 54] extracted hand crafted features first, and then fed them to CNNs to compute saliency confidence scores. However, these methods need to design low-level features first and cannot be trained in an end-to-end manner. More recently, researchers have exploited CNNs to extract RGB and depth features in a bottom-up way [30]

. Compared with hand crafted features, deep features contain more semantic and contextual information that can better capture representations of the RGB and depth modalities and achieve encouraging performance. The success of these deep models 

[6, 50] stems from two aspects of feature fusion. The first is the extracting of multi-scale features from different layers and then the effective fusion of these features. The second is the mechanism of fusing features from the two different modalities.

To effectively aggregate multi-scale features, researchers have designed various network architectures. For example, [46] fed a four-channel RGB-D image into a single backbone and then obtained saliency map outputs from each side-out features (single stream). Chen et al. [3]

leveraged two networks to extract RGB features and depth features respectively, and then fused them in a progressive complementary way (double stream). Further, to exploit cross-modal complements in the bottom-up feature extraction process, Chen

et al. [7] proposed a three-stream network that contains two modality-specific streams and a parallel cross-modal distillation stream to learn supplementary features (three streams). However, depth maps are often of low quality and thus may contain a lot of noise and misleading information. This greatly decreases the performance of SOD models. To address this problem, Zhao et al. [72] designed a contrast-enhanced network to improve the quality of depth maps by the contrast prior. Fan et al. [19] proposed a depth depurator unit that can evaluate the quality of the depth images and then filter out the low-quality maps automatically. Two recent famous works have also explored uncertainty [66], bilateral attention [70], and a joint learning strategy [25] and achieve good performance.

[width=1.0]BBS-Net-min

Figure 3: The architecture of BBS-Net. Feature Extraction: ‘Conv1Conv5’ denote different layers from ResNet-50 [31]. Multi-level features () from the depth branch are enhanced by the (a) DEM and then fused with features (i.e., ) from the RGB branch. Stage 1: cross-modal teacher features () are first aggregated by the (b) cascaded decoder to produce the initial saliency map . Stage 2: Then, student features () are refined by the initial saliency map and are integrated by another cascaded decoder to predict the final saliency map .

3 Proposed Method

3.1 Overview

Existing popular RGB-D SOD models directly aggregate multi-level features (Fig. 2(a)). As shown in Fig. 3, the network flow of our BBS-Net is different from the above mentioned models. We first introduce the bifurcated backbone strategy with the cascaded refinement mechanism in 3.2. To fully use informative cues in the depth map, we introduce a new depth-enhanced module (3.3).

3.2 Bifurcated Backbone Strategy (BBS)

We propose to excavate the rich semantic information in high-level cross-modal features to suppress background distractors in a cascaded refinement way. We adopt a bifurcated backbone strategy (BBS) to divide the multi-level cross-modal features into two groups, i.e., = {Conv1, Conv2, Conv3} and {Conv3, Conv4, Conv5}, with the Conv3 as the split point. Each group still preserves the original multi-scale information.

Cascaded Refinement Mechanism. To effectively leverage the features of the two groups, the whole network is trained with a cascaded refinement mechanism. This mechanism first produces an initial saliency map with three cross-modal teacher features (i.e., ) and then improves the details of the initial saliency map with three cross-modal student features (i.e., ) refined by the initial saliency map itself. Using this mechanism, our model can iteratively refine the details in the low-level features. This is based on the observation that high-level features contain rich global contextual information which is beneficial for locating salient objects, while low-level features carry much detailed micro-level information that can contribute significantly to refining the boundaries. In other words, this strategy efficiently eliminates noise in low-level cross-modal features, by exploring the specialties of multi-level features, and predicts the final saliency map in a progressive refinement manner.

Specifically, we first compute cross-modal features by merging RGB and depth features processed by the DEM (Fig. 3(a)). In stage one, the three cross-modality teacher features (i.e., ) are aggregated by the first cascaded decoder, which is formulated:

(1)

where is the initial saliency map, is the first cascaded decoder and represents two simple convolutional layers that change the channel number from to . In stage two, the initial saliency map is leveraged to refine the three cross-modal student features, which is defined as:

(2)

where () denotes the refined features and represents the element-wise multiplication. Then, the three refined student features are integrated by another decoder followed by a progressively transposed module (PTM), which is defined as,

(3)

where is the final saliency map. represents the PTM module and denotes the second cascaded decoder. Finally, we jointly optimize the two stages by defining the total loss:

(4)

in which represents the widely used binary cross entropy loss and controls the trade-off between the two parts of the losses. The is computed as:

(5)

in which is the predicted saliency map and denotes the ground-truth binary saliency map.

Cascaded Decoder. Given the two groups of multi-level, cross-modal features () fused by the RGB and depth features from different layers, we need to efficiently utilize the multi-level, multi-scale information in each group to carry out our cascaded refinement strategy. Thus, we introduce a light-weight cascaded decoder [63] to aggregate the two groups of multi-level, cross-modal features. As shown in Fig. 3(b), the cascaded decoder contains three global context modules (GCM) and a simple feature aggregation strategy. The GCM is refined from the RFB module [45]

with an additional branch to enlarge the receptive field and a residual connection 

[31] to preserve the original information. Specifically, as illustrated in Fig. 3(c), the GCM module contains four parallel branches. For all of these branches, a convolution is first applied to reduce the channel size to . For the branch (), a convolution operation with a kernel size of and dilation rate of 1 is applied. This is followed by another convolution operation with a dilation rate of . The goal here is to extract the global contextual information from the cross-modal features. Next, the outputs of the four branches are concatenated together and their channel number is reduced to 32 with a 11 convolution operation. Finally, the concatenated features form a residual connection with the input feature. The outputs of the GCM modules in the two cascaded decoders are defined by:

(6)

To further improve the representational ability of cross-modal features, we leverage a pyramid multiplication and concatenation feature aggregation strategy to integrate the cross-modal features (). As shown in Fig. 3(b), first, each refined feature is updated by multiplying it with all higher-level features:

(7)

where , or , . represents the standard 33 convolution operation, and denotes the upsampling operation if these features are not in the same scale. represents the element-wise multiplication. Second, the updated features are aggregated by a progressive concatenation strategy to generate the output:

(8)

where is the generated salient map, , and denotes the concatenation operation of and . In the first stage, T represents two sequential convolutional layers (), while it denotes the PTM module () for the second stage. The output (8888) of the second decoder is of the ground-truth resolution (352352), so directly up-sampling the output to the ground-truth size will result in a loss of some details. To address this problem, we design a simple yet effective progressively transposed module (PTM, Fig. 3(d)) to predict the final saliency map () in a progressive upsampling way. It is composed of two sequential residual-based transposed blocks [33] and three sequential convolutions. Each residual-based transposed block consists of a convolution and a residual-based transposed convolution.

Note that our cascaded refinement mechanism is different from the recent refinement mechanisms R3Net [14], CRN [4], and RFCN [59] in its usage of multi-level features and the initial map. The obvious difference and superiority of our design is that we only need one round of saliency refinement to obtain a good performance, while R3Net, CRN, and RFCN all need more iterations, which will increase the training time and computational resources. In addition, our cascaded strategy is different from CPD [63] in that it exploits the details in low-level features and semantic information in high-level features, while suppressing the noise in low-level features, simultaneously.

3.3 Depth-Enhanced Module (DEM)

There are two main problems when trying to fuse RGB and depth features. One is the compatibility of the two due to the intrinsic modality difference, and the other is the redundancy and noise in low-quality depth features. Inspired by [62], we introduce a depth-enhanced module (DEM) to improve the compatibility of multi-modal features and to excavate the informative cues from the depth features.

Specifically, let , denote the feature maps of the () side-out layer from the RGB and depth branches, respectively. As shown in Fig. 3, each DEM is added before each side-out feature map from the depth branch to improve the compatibility of the depth features. Such a side-out process enhances the saliency representation of depth features and preserves the multi-level, multi-scale information. The fusion process of the two modalities is formulated as:

(9)

where represents the cross-modal features of the layer. As illustrated in Fig. 3(a), the DEM includes a sequential channel attention operation and a spatial attention operation. The operation of the DEM is defined as:

(10)

where and denote the spatial and channel attention, respectively. More specifically,

(11)

where

represents the global max pooling operation for each feature map,

denotes the input feature map,

is a multi-layer (two-layer) perceptron, and

denotes the multiplication by the dimension broadcast. The spatial attention is implemented as:

(12)

where represents the global max pooling operation for each point in the feature map along the channel axis. Our depth enhanced module is different from previous RGB-D models. Previous models fuse the corresponding multi-level features from the RGB and depth branches by direct concatenation [3, 7, 74], enhance the depth map by contrast prior [72] or process the multi-level depth features by a simple convolutional layer [50]. To the best of our knowledge, we are the first to introduce the attention mechanism to excavate informative cues from depth features in multiple side-out layers. Our experiments (see Tab. 4 and Fig. 5) show the effectiveness of this approach in improving the compatibility of multi-modal features.

Moreover, the spatial and channel attention mechanisms are different from the operation proposed in [62]. We only leverage a single global max pooling [48] operation to excavate the most critical cues in the depth features and reduce the complexity of the module simultaneously, which is based on the intuition that SOD aims at finding the most important area in an image.

4 Experiments

4.1 Experimental Settings

Datasets. We tested seven challenging RGB-D SOD datasets, i.e., NJU2K [34], NLPR [49], STERE [47], SIP [19], DES [9], LFSD [41], and SSD [76].

Training/Testing. Following the same training settings as in [72, 50], we use samples from the NJU2K dataset and samples from the NLPR dataset as our training set. The remaining images in the NJU2K and NLPR datasets and the whole datasets of STERE, DES, LFSD, SSD, and SIP are used for testing.

Evaluation Metrics. We adopt four widely used metrics, including S-measure ([17], maximum E-measure ([21], maximum F-measure ([1], mean absolute error (MAE). Evaluation code: http://dpfan.net/d3netbenchmark/.

Contenders. We compare the proposed BBS-Net with ten models that use hand crafted features [9, 12, 24, 27, 34, 42, 49, 53, 75, 56] and eight models [3, 7, 6, 30, 50, 52, 60, 72]

based on deep learning. We train and test the above models using their default settings, as proposed in the original papers. For those models without released source codes, we used their published results for comparisons.

Inference Time. In terms of speed, BBS-Net achieves 48 fps on a single GTX 1080Ti, which is suitable for real-time applications.

Implementation Details.

We perform our experiments using the PyTorch 

[57] framework on a single 1080Ti GPU. Parameters of the backbone network (ResNet-50 [31]

) are initialized from the model pre-trained on ImageNet 

[36]. We discard the last pooling and fully connected layers of ResNet-50 and leverage each middle output of the five convolutional blocks as the side-out feature maps. The two branches do not share weights and the only difference between them is that the depth branch has the input channel number set to . The other parameters are initialized as the PyTorch default settings. We use the Adam algorithm [35] to optimize the proposed model. The initial learning rate is set to 1e-4 and is divided by

every 60 epochs. We resize the input RGB and depth images to

for both the training and test phases. All the training images are augmented using multiple strategies (i.e., random flipping, rotating and border clipping). It takes about hours to train our model with a mini-batch size of 10 (batch size is 10 in the testing phase) for 200 epochs.

Dataset Metric Hand-crafted-Features-Based Models CNNs-Based Models BBS-Net
LHM CDB DESM GP CDCP ACSD LBE DCMC MDSF SE DF AFNet CTMF MMCI PCF TANet CPFP DMRA
 [49]  [42] [9]  [53]  [75]  [34]  [24]  [12]  [56]  [27]  [52]  [60]  [30]  [6]  [3]  [7]  [72]  [50] Ours

NJU2K

 [34] .514 .624 .665 .527 .669 .699 .695 .686 .748 .664 .763 .772 .849 .858 .877 .878 .879 .886 .921
.632 .648 .717 .647 .621 .711 .748 .715 .775 .748 .804 .775 .845 .852 .872 .874 .877 .886 .920
.724 .742 .791 .703 .741 .803 .803 .799 .838 .813 .864 .853 .913 .915 .924 .925 .926 .927 .949
.205 .203 .283 .211 .180 .202 .153 .172 .157 .169 .141 .100 .085 .079 .059 .060 .053 .051 .035

NLPR

 [49] .630 .629 .572 .654 .727 .673 .762 .724 .805 .756 .802 .799 .860 .856 .874 .886 .888 .899 .930
.622 .618 .640 .611 .645 .607 .745 .648 .793 .713 .778 .771 .825 .815 .841 .863 .867 .879 .918
.766 .791 .805 .723 .820 .780 .855 .793 .885 .847 .880 .879 .929 .913 .925 .941 .932 .947 .961
.108 .114 .312 .146 .112 .179 .081 .117 .095 .091 .085 .058 .056 .059 .044 .041 .036 .031 .023

STERE

 [47] .562 .615 .642 .588 .713 .692 .660 .731 .728 .708 .757 .825 .848 .873 .875 .871 .879 .835 .908
.683 .717 .700 .671 .664 .669 .633 .740 .719 .755 .757 .823 .831 .863 .860 .861 .874 .847 .903
.771 .823 .811 .743 .786 .806 .787 .819 .809 .846 .847 .887 .912 .927 .925 .923 .925 .911 .942
.172 .166 .295 .182 .149 .200 .250 .148 .176 .143 .141 .075 .086 .068 .064 .060 .051 .066 .041

DES

 [9] .562 .645 .622 .636 .709 .728 .703 .707 .741 .741 .752 .770 .863 .848 .842 .858 .872 .900 .933
.511 .723 .765 .597 .631 .756 .788 .666 .746 .741 .766 .728 .844 .822 .804 .827 .846 .888 .927
.653 .830 .868 .670 .811 .850 .890 .773 .851 .856 .870 .881 .932 .928 .893 .910 .923 .943 .966
.114 .100 .299 .168 .115 .169 .208 .111 .122 .090 .093 .068 .055 .065 .049 .046 .038 .030 .021

LFSD

 [41] .553 .515 .716 .635 .712 .727 .729 .753 .694 .692 .783 .738 .788 .787 .786 .801 .828 .839 .864
.708 .677 .762 .783 .702 .763 .722 .817 .779 .786 .817 .744 .787 .771 .775 .796 .826 .852 .859
.763 .766 .811 .824 .780 .829 .797 .856 .819 .832 .857 .815 .857 .839 .827 .847 .863 .893 .901
.218 .225 .253 .190 .172 .195 .214 .155 .197 .174 .145 .133 .127 .132 .119 .111 .088 .083 .072

SSD

 [76] .566 .562 .602 .615 .603 .675 .621 .704 .673 .675 .747 .714 .776 .813 .841 .839 .807 .857 .882
.568 .592 .680 .740 .535 .682 .619 .711 .703 .710 .735 .687 .729 .781 .807 .810 .766 .844 .859
.717 .698 .769 .782 .700 .785 .736 .786 .779 .800 .828 .807 .865 .882 .894 .897 .852 .906 .919
.195 .196 .038 .180 .214 .203 .278 .169 .192 .165 .142 .118 .099 .082 .062 .063 .082 .058 .044

SIP

 [19] .511 .557 .616 .588 .595 .732 .727 .683 .717 .628 .653 .729 .716 .833 .842 .835 .850 .806 .879
.574 .620 .669 .687 .505 .763 .751 .618 .698 .661 .657 .712 .694 .818 .838 .830 .851 .821 .883
.716 .737 .770 .768 .721 .838 .853 .743 .798 .771 .759 .819 .829 .897 .901 .895 .903 .875 .922
.184 .192 .298 .173 .224 .172 .200 .186 .167 .164 .185 .118 .139 .086 .071 .075 .064 .085 .055
Table 1: Quantitative comparison of models using S-measure (), max F-measure (), max E-measure () and MAE () scores on seven datasets. () denotes that the higher (lower) the better. The best score in each row is highlighted in bold.

4.2 Comparison with SOTAs

Quantitative Results. As shown in Tab. 1, our method performs favorably against all algorithms based on hand crafted features as well as SOTA CNN-based methods by a large margin, in terms of all four evaluation metrics. Performance gains over the best compared algorithms (ICCV’19 DMRA [50] and CVPR’19 CPFP [72]) are (, , , ) for the metrics (, , , ) on seven challenging datasets.

[width=1.0]resultmap-min

Figure 4: Qualitative visual comparison of the proposed model versus 8 SOTAs.

Visual Comparison. Fig. 4 provides sample saliency maps predicted by the proposed method and several SOTA algorithms. Visualizations cover simple scenes (a) and various challenging scenarios, including small objects (b), multiple objects (c), complex backgrounds (d), and low-contrast scenes (e). First, (a) is an easy example. The flower in the foreground is evident in the original RGB image, but the depth image is low-quality and contains some misleading information. The top two models, i.e., DMRA and CPFP, fail to predict the whole extent of the salient object due to the interference from the depth map. Our method can eliminate the side-effects of the depth map by utilizing the complementary depth information more effectively. Second, two examples of small objects are shown in (b). Despite the handle of the teapot in the first row being tiny, our method can accurately detect it. Third, we show two examples with multiple objects in the image in (c). Our method locates all salient objects in the image. It segments the objects better and generates sharper edges compared to other algorithms. Even though the depth map in the first row of (c) lacks clear information, our algorithm predicts the salient objects correctly. Fourth, (d) shows two examples with complex backgrounds. Here, our method produces reliable results, while other algorithms confuse the background as a salient object. Finally, (e) presents two examples in which the contrast between the object and background is low. Many algorithms fail to detect and segment the entire extent of the salient object. Our method produces satisfactory results by suppressing background distractors and exploring the informative cues from the depth map.

Models NJU2K [34] NLPR [49] STERE [47] DES [9] LFSD [41] SSD [76] SIP [19]
CPFP [72] .879 .053 .888 .036 .879 .051 .872 .038 .828 .088 .807 .082 .850 .064
DMRA [50] .886 .051 .899 .031 .835 .066 .900 .030 .839 .083 .857 .058 .806 .085
BBS-Net(VGG-16) .916 .039 .923 .026 .896 .046 .908 .028 .845 .080 .858 .055 .874 .056
BBS-Net(VGG-19) .918 .037 .925 .025 .901 .043 .915 .026 .852 .074 .855 .056 .878 .054
BBS-Net(ResNet-50) .921 .035 .930 .023 .908 .041 .933 .021 .864 .072 .882 .044 .879 .055
Table 2: Performance comparison using different backbones.
# Settings NJU2K [34] NLPR [49] STERE [47] DES [9] LFSD [41] SSD [76] SIP [19]
1 Low3 .881 .051 .882 .038 .832 .070 .853 .044 .779 .110 .805 .080 .760 .108
2 High3 .902 .042 .911 .029 .886 .048 .912 .026 .845 .080 .850 .058 .833 .073
3 All5 .905 .042 .915 .027 .891 .045 .901 .028 .845 .082 .848 .060 .839 .071
4 BBS-NoRF .893 .050 .904 .035 .843 .072 .886 .039 .804. .105 .839 .069 .843 .076
5 BBS-RH .913 .040 .922 .028 .881 .054 .919 .027 .833 .085 .872 .053 .866 .063
6 BBS-RL (ours) .921 .035 .930 .023 .908 .041 .933 .021 .864 .072 .882 .044 .879 .055
Table 3: Comparison of different feature aggregation strategies on seven datasets.

5 Discussion

Scalability. There are three popular backbone architectures (i.e., VGG-16 [55], VGG-19 [55] and ResNet-50 [31]) that are used in deep RGB-D models. To further validate the scalability of the proposed method, we provide performance comparisons using different backbones in Tab. 2. We find that our BBS-Net exceeds the SOTA methods (e.g., CPFP [72], and DMRA [50]) with all of these popular backbones, showing the strong scalability of our framework.

Aggregation Strategies. We conduct several experiments to validate the effectiveness of our cascaded refinement mechanism. Results are shown in Tab. 3 and Fig. 5(a). ‘Low3’ means that we only integrate the student features (Conv13) using the decoder without the refinement from the initial map for training and testing. Student features contain abundant details that are beneficial for refining the object edges, but at the same time introduce a lot of background distraction. Integrating only low-level features produces unsatisfactory results and generates many distractors (e.g., - row in Fig. 5(a)) or fails to locate the salient objects (e.g., the row in Fig. 5(a)). ‘High3’ only integrates the teacher features (Conv35) using the decoder to predict the saliency map. Compared with student features, teacher features are ‘sophisticated’ and thus contain more semantic information. As a result, they help locate the salient objects and preserve edge information. Thus, integrating teacher features leads to better results. ‘All5’ aggregates features from all five levels (Conv15) directly using a single decoder for training and testing. It achieves comparable results with the ‘High3’ but may generate background noise introduced by the student features. ‘BBS-NoRF’ indicates that we directly remove the refinement flow of our model. This leads to poor performance. ‘BBS-RH’ can be seen as a reverse refinement strategy to our cascaded refinement mechanism, where teacher features (Conv35) are first refined by the initial map aggregated by student features (Conv13) and are then integrated to generate the final saliency map. It performs worse than our final mechanism (BBS-RL), because, with this reverse refinement strategy, noise in student features cannot be effectively suppressed. Besides, compared to ‘All5’, our method fully utilizes the features at different levels, and thus achieves significant performance improvement with fewer background distractors and sharper edges (i.e., ‘BBS-RL’ in Fig. 5(a)).

# Settings NJU2K [34] NLPR [49] STERE [47] DES [9] LFSD [41] SSD [76] SIP [19]
BM CA SA PTM
1 .908 .045 .918 .029 .882 .055 .917 .027 .842 .083 .862 .057 .864 .066
2 .913 .042 .922 .027 .896 .048 .923 .025 .840 .086 .855 .057 .868 .063
3 .912 .045 .918 .029 .891 .054 .914 .029 .855 .083 .872 .054 .869 .063
4 .919 .037 .928 .026 .900 .045 .924 .024 .861 .074 .873 .052 .869 .061
5 .921 .035 .930 .023 .908 .041 .933 .021 .864 .072 .882 .044 .879 .055
Table 4: Ablation study of our BBS-Net. ‘BM’ = base model. ‘CA’ = channel attention. ‘SA’ = spatial attention. ‘PTM’ = progressively transposed module.
Strategies NJU2K [34] NLPR [49] STERE [47] DES [9] LFSD [41] SSD [76] SIP [19]
Element-wise sum .915 .037 .925 .025 .897 .045 .925 .022 .856 .073 .868 .050 .880 .052
Cascaded decoder .921 .035 .930 .023 .908 .041 .933 .021 .864 .072 .882 .044 .879 .055
Table 5: Effectiveness analysis of the cascaded decoder.

[width=1.0]ablation

Figure 5: (a): Visual comparison of different aggregation strategies, (b): Visual effectiveness of gradually adding modules. ‘#’ denotes the corresponding row of Tab. 4.

Impact of Different Modules. As shown in Tab. 4 and Fig. 5(b), we conduct an ablation study to test the effectiveness of different modules in our BBS-Net. The base model (BM) is our BBS-Net without additional modules (i.e., CA, SA, and PTM). Note that just the BM performs better than the SOTA methods over almost all datasets, as shown in Tab. 1 and Tab. 4. Adding the channel attention (CA) and spatial attention (SA) modules enhances performance on most of the datasets. See the results shown in the second and third rows of Tab. 4. When we combine the two modules (fourth row in Tab. 4), the performance is greatly improved on all datasets, compared with the BM. We can easily conclude from the ‘#3’ and ‘#4’ columns in Fig. 5(b) that the spatial attention and channel attention mechanisms in DEM allow the model to focus on the informative parts of the depth features, which results in better suppression of background distraction. Finally, we add a progressively transposed block before the second decoder to gradually upsample the feature map to the same resolution as the ground truth. The results in the fifth row of Tab. 4 and the ’#5’ column of Fig. 5(b) show that the ‘PTM’ achieves impressive performance gains on all datasets and generates sharper edges with fine details.

To further analyze the effectiveness of the cascaded decoder, we experiment by changing the decoder to an element-wise summation mechanism. That is to say, we first change the features from different layers using convolution and upsampling operation to the same dimension and then fuse them by element-wise summation. Experimental results in Tab. 5 demonstrate the effectiveness of the cascaded decoder.

Methods CPD [63] PoolNet [43] PiCANet [44] PAGRN [69] R3Net [14] Ours (w/o D) Ours (w/ D)
NJU2K [34] .894 .046 .887 .045 .847 .071 .829 .081 .837 .092 .914 .038 .921 .035
NLPR [49] .915 .025 .900 .029 .834 .053 .844 .051 .798 .101 .925 .026 .930 .023
DES [9] .897 .028 .873 .034 .854 .042 .858 .044 .847 .066 .912 .025 .933 .021
Table 6: comparison with SOTA RGB SOD methods on three datasets. ‘w/o D’ and ‘w/ D’ represent training and testing the proposed method without/with the depth.

Benefits of the Depth Map. To explore whether or not the depth information can really contribute to the performance of SOD, we conduct two experiments in Tab. 6: (i) We compare the proposed method with five SOTA RGB SOD methods (i.e., CPD [64], PoolNet [43], PiCANet [44], PAGRN [69] and R3Net [14]) when neglecting the depth information. We train and test these methods using the same training and testing sets as our BBS-Net. It is shown that the proposed methods (i.e., Ours (w/ D)) can significantly exceed SOTA RGB SOD methods due to the usage of depth information. (ii) We train and test the proposed method without using the depth information by setting the inputs of the depth branch to zero (i.e., Ours (w/o D)). Comparing the results of Ours (w/ D) with Ours (w/o D), we find that the depth information can effectively improve the performance of the proposed model. The above two experiments together demonstrate the benefits of the depth information for SOD, since depth maps can be seen as prior knowledge that provides spatial-distance information and contour guidance to detect salient objects.

6 Conclusion

We presented a new multi-level multi-modality learning framework that demonstrates state-of-the-art performance on seven challenging RGB-D salient object detection datasets using several evaluation measures, and has real-time speed (48 fps). Our BBS-Net is based on a novel bifurcated backbone strategy (BBS) with a cascaded refinement mechanism. Importantly, our simple architecture is backbone independent, making it promising for further research on other related topics, including semantic segmentation, object detection and classification.

Acknowledgments

This work was supported by the Major Project for New Generation of AI Grant (NO. 2018AAA0100403), NSFC (NO.61876094, U1933114), Natural Science Foundation of Tianjin, China (NO.18JCYBJC15400, 18ZXZNGX00110), the Open Project Program of the National Laboratory of Pattern Recognition (NLPR), and the Fundamental Research Funds for the Central Universities.

References

  • [1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk (2009) Frequency-tuned salient region detection. In CVPR, pp. 1597–1604. Cited by: §4.1.
  • [2] A. Borji, M. Cheng, H. Jiang, and J. Li (2015) Salient object detection: a benchmark. IEEE TIP 24 (12), pp. 5706–5722. Cited by: §1.
  • [3] H. Chen and Y. Li (2018) Progressively complementarity-aware fusion network for RGB-D salient object detection. In CVPR, pp. 3051–3060. Cited by: Figure 2, §1, §2, §3.3, §4.1, Table 1.
  • [4] Q. Chen and V. Koltun (2017) Photographic image synthesis with cascaded refinement networks. In CVPR, pp. 1511–1520. Cited by: §3.2.
  • [5] S. Chen, X. Tan, B. Wang, H. Lu, X. Hu, and Y. Fu (2020) Reverse attention-based residual network for salient object detection. IEEE TIP 29, pp. 3763–3776. Cited by: §2.
  • [6] Chen,Hao, Li,Youfu, and Su,Dan (2019) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. IEEE TOC 86 (), pp. 376–385. Cited by: §1, §2, §4.1, Table 1.
  • [7] Chen,Hao and Li,Youfu (2019) Three-stream attention-aware network for RGB-D salient object detection. IEEE TIP 28 (6), pp. 2825–2835. Cited by: Figure 1, Figure 2, §1, §1, §2, §2, §3.3, §4.1, Table 1.
  • [8] G. Cheng, J. Han, P. Zhou, and D. Xu (2018)

    Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection

    .
    IEEE TIP 28 (1). Cited by: §1.
  • [9] Y. Cheng, H. Fu, X. Wei, J. Xiao, and X. Cao (2014) Depth enhanced saliency detection method. In ICIMCS, pp. 23–27. Cited by: §1, §2, §4.1, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
  • [10] Ciptadi,Arridhana, Hermans,Tucker, and M. Rehg,James (2013) An in depth view of saliency. In BMVC, Cited by: §2.
  • [11] R. Cong, J. Lei, H. Fu, J. Hou, Q. Huang, and S. Kwong (2019) Going from RGB to RGBD saliency: a depth-guided transformation model. IEEE TOC, pp. 1–13. Cited by: §2.
  • [12] R. Cong, J. Lei, C. Zhang, Q. Huang, X. Cao, and C. Hou (2016) Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion. IEEE SPL 23 (6), pp. 819–823. Cited by: §4.1, Table 1.
  • [13] Cong,Runming, Lei,Jianjun, Fu,Huazhu, Huang,Qingming, Cao,Xiaochun, and Ling,Nam (2019) HSCS: hierarchical sparsity based co-saliency detection for RGBD images. IEEE TMM 21 (7), pp. 1660–1671. Cited by: §1, §2.
  • [14] Z. Deng, X. Hu, L. Zhu, X. Xu, J. Qin, G. Han, and P. Heng (2018) R3net: recurrent residual refinement network for saliency detection. In IJCAI, pp. 684–690. Cited by: §3.2, Table 6, §5.
  • [15] Desingh,Karthik, Rajanand,Deepu, and Jawahar,C.V (2013) Depth really matters: improving visual salient region detection with depth. In BMVC, pp. 1–11. Cited by: §2, §2.
  • [16] D. Fan, M. Cheng, J. Liu, S. Gao, Q. Hou, and A. Borji (2018) Salient objects in clutter: bringing salient object detection to the foreground. In ECCV, pp. 186–202. Cited by: §2.
  • [17] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: a new way to evaluate foreground maps. In ICCV, pp. 4548–4557. Cited by: §4.1.
  • [18] D. Fan, Z. Lin, G. Ji, D. Zhang, H. Fu, and M. Cheng (2020) Taking a deeper look at co-salient object detection. In CVPR, pp. 2919–2929. Cited by: §1.
  • [19] D. Fan, Z. Lin, Z. Zhang, M. Zhu, and M. Cheng (2020) Rethinking RGB-D salient object detection: Models, datasets, and large-scale benchmarks. IEEE TNNLS. Cited by: §2, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
  • [20] D. Fan, W. Wang, M. Cheng, and J. Shen (2019) Shifting more attention to video salient object detection. In CVPR, pp. 8554–8564. Cited by: §1.
  • [21] Fan,Deng-Ping, Gong,Cheng, Y. Cao, Ren,Bo, Cheng,Ming-Ming, and Borji,Ali (2018) Enhanced-alignment measure for binary foreground map evaluation. In IJCAI, pp. 698–704. Cited by: §4.1.
  • [22] Fan,Xingxing, Liu,Zhi, and Sun,Guangling (2014) Salient region detection for stereoscopic images. In DSP, pp. 454–458. Cited by: §1, §2.
  • [23] Fang,Yuming, Wang,Junle, Narwaria,Manish, Le Callet,Patrick, and Lin,Weisi (2014) Saliency detection for stereoscopic images. IEEE TIP 23 (6), pp. 2625–2636. Cited by: §1.
  • [24] D. Feng, N. Barnes, S. You, and C. McCarthy (2016) Local background enclosure for RGB-D salient object detection. In CVPR, pp. 2343–2350. Cited by: Figure 1, §2, §4.1, Table 1.
  • [25] K. Fu, D. Fan, G. Ji, and Q. Zhao (2020) JL-DCF: Joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In CVPR, pp. 3052–3062. Cited by: §2.
  • [26] S. Gao, Y. Tan, M. Cheng, C. Lu, Y. Chen, and S. Yan (2020) Highly efficient salient object detection with 100k parameters. In ECCV, Cited by: §1.
  • [27] J. Guo, T. Ren, and J. Bei (2016) Salient object detection for RGB-D image via saliency evolution. In ICME, pp. 1–6. Cited by: Figure 1, §4.1, Table 1.
  • [28] J. Han, L. Yang, D. Zhang, X. Chang, and X. Liang (2018) Reinforcement cutting-agent learning for video object segmentation. In CVPR, pp. 9080–9089. Cited by: §1.
  • [29] Q. Han, K. Zhao, J. Xu, and M. Cheng (2020) Deep hough transform for semantic line detection. In ECCV), Cited by: §1.
  • [30] Han,Junwei, Chen,Hao, Liu,Nian, Yan,Chenggang, and Li,Xuelong (2018) CNNs-Based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE TOC 48 (11), pp. 3171–3183. Cited by: §2, §4.1, Table 1.
  • [31] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Figure 3, §3.2, §4.1, §5.
  • [32] X. He, S. Yang, G. Li, H. Li, H. Chang, and Y. Yu (2019) Non-local context encoder: robust biomedical image segmentation against adversarial attacks. In AAAI019, pp. 8417–8424. Cited by: §1.
  • [33] X. Hu, K. Yang, L. Fei, and K. Wang (2019) Acnet: attention based network to exploit complementary features for RGBD semantic segmentation. In ICIP, pp. 1440–1444. Cited by: §3.2.
  • [34] Ju,Ran, Ge,Ling, W. Geng, Ren,Tongwei, and Wu,Gangshan (2014) Depth saliency based on anisotropic center-surround difference. In ICIP, pp. 1115–1119. Cited by: §2, §4.1, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
  • [35] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §4.1.
  • [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114. Cited by: §4.1.
  • [37] G. Li and Y. Yu (2015) Visual saliency based on multiscale deep features. In CVPR, pp. 5455–5463. Cited by: §1.
  • [38] G. Li, X. Zhu, Y. Zeng, Q. Wang, and L. Lin (2019) Semantic relationships guided representation learning for facial action unit recognition. In AAAI, pp. 8594–8601. Cited by: §1.
  • [39] H. Li, G. Chen, G. Li, and Y. Yu (2019) Motion guided attention for video salient object detection. In ICCV, pp. 7274–7283. Cited by: §2.
  • [40] J. Li, Y. Song, J. Zhu, L. Cheng, Y. Su, L. Ye, P. Yuan, and S. Han (2019) Learning from large-scale noisy web data with ubiquitous reweighting for image classification. IEEE TPAMI (). Cited by: §1.
  • [41] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu (2014) Saliency detection on light field. In CVPR, pp. 2806–2813. Cited by: §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
  • [42] F. Liang, L. Duan, W. Ma, Y. Qiao, Z. Cai, and L. Qing (2018) Stereoscopic saliency model using contrast and depth-guided-background prior. Neurocomputing 275, pp. 2227–2238. Cited by: §4.1, Table 1.
  • [43] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. In CVPR, pp. 3917–3926. Cited by: item 1, §1, Table 6, §5.
  • [44] N. Liu, J. Han, and M. Yang (2018) PiCANet: learning pixel-wise contextual attention for saliency detection. In CVPR, pp. 3089–3098. Cited by: Table 6, §5.
  • [45] S. Liu, D. Huang, and Y. Wang (2018) Receptive field block net for accurate and fast object detection. In ECCV, pp. 404–419. Cited by: §3.2.
  • [46] Z. Liu, S. Shi, Q. Duan, W. Zhang, and P. Zhao (2019) Salient object detection for RGB-D image by single stream recurrent convolution neural network. Neurocomputing 363, pp. 46–57. Cited by: Figure 2, §1, §2.
  • [47] Y. Niu, Y. Geng, X. Li, and F. Liu (2012) Leveraging stereopsis for saliency analysis. In CVPR, pp. 454–461. Cited by: §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
  • [48] M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2015)

    Is object localization for free? - weakly-supervised learning with convolutional neural networks

    .
    In CVPR, pp. 685–694. Cited by: §3.3.
  • [49] H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji (2014) RGBD salient object detection: a benchmark and algorithms. In ECCV, pp. 92–109. Cited by: §1, §2, §4.1, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
  • [50] Y. Piao, W. Ji, J. Li, M. Zhang, and H. Lu (2019) Depth-induced multi-scale recurrent attention network for saliency detection. In ICCV, pp. 7254–7263. Cited by: Figure 1, Figure 2, §1, §1, §2, §2, §3.3, §4.1, §4.1, §4.2, Table 1, Table 2, §5.
  • [51] L. Qiao, Y. Shi, J. Li, Y. Wang, T. Huang, and Y. Tian (2019) Transductive episodic-wise adaptive metric for few-shot learning. In ICCV, pp. 3603–3612. Cited by: §1.
  • [52] Qu,Liangqiong, He,Shengfeng, Zhang,Jiawei, Tian,Jiandong, Tang,Yandong, and Yang,Qingxiong (2017) RGBD salient object detection via deep fusion. IEEE TIP 26 (5), pp. 2274–2285. Cited by: §2, §2, §4.1, Table 1.
  • [53] J. Ren, X. Gong, L. Yu, W. Zhou, and M. Ying Yang (2015) Exploiting global priors for RGB-D saliency detection. In CVPRW, pp. 25–32. Cited by: §4.1, Table 1.
  • [54] R. Shigematsu, D. Feng, S. You, and N. Barnes (2017) Learning RGB-D salient object detection using background enclosure, depth contrast, and top-down features. In ICCVW, pp. 2749–2757. Cited by: §2, §2.
  • [55] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.
  • [56] H. Song, Z. Liu, H. Du, G. Sun, O. Le Meur, and T. Ren (2017) Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning. IEEE TIP 26 (9), pp. 4204–4216. Cited by: §4.1, Table 1.
  • [57] B. Steiner, Z. DeVito, S. Chintala, S. Gross, A. Paszke, F. Massa, A. Lerer, G. Chanan, Z. Lin, E. Yang, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In NIPS, pp. 8024–8035. Cited by: §4.1.
  • [58] J. Su, J. Li, Y. Zhang, C. Xia, and Y. Tian (2019) Selectivity or invariance: boundary-aware salient object detection. In ICCV, pp. 3798–3807. Cited by: §2.
  • [59] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan (2018) Salient object detection with recurrent fully convolutional networks. IEEE TPAMI 41 (7), pp. 1734–1746. Cited by: §3.2.
  • [60] N. Wang and X. Gong (2019) Adaptive fusion for RGB-D salient object detection. IEEE Access 7, pp. 55277–55284. Cited by: Figure 2, §4.1, Table 1.
  • [61] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji (2018) Detect globally, refine locally: a novel approach to saliency detection. In CVPR, pp. 3127–3135. Cited by: §1.
  • [62] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) CBAM: convolutional block attention module. In ECCV, pp. 3–19. Cited by: §3.3, §3.3.
  • [63] Z. Wu, L. Su, and Q. Huang (2019) Cascaded partial decoder for fast and accurate salient object detection. In CVPR, pp. 3907–3916. Cited by: item 1, §1, §3.2, §3.2, Table 6.
  • [64] Z. Wu, L. Su, and Q. Huang (2019) Stacked cross refinement network for edge-aware salient object detection. In ICCV, pp. 7264–7273. Cited by: §5.
  • [65] Y. Zeng, Y. Zhuge, H. Lu, and L. Zhang (2019) Joint learning of saliency detection and weakly supervised semantic segmentation. In ICCV, pp. 7223–7233. Cited by: §1.
  • [66] J. Zhang, D. Fan, Y. Dai, S. Anwar, F. S. Saleh, T. Zhang, and N. Barnes (2020)

    UC-Net: uncertainty inspired RGB-D saliency detection via conditional variational autoencoders

    .
    In CVPR, pp. 8582–8591. Cited by: §2.
  • [67] L. Zhang, J. Wu, T. Wang, A. Borji, G. Wei, and H. Lu (2020) A multistage refinement network for salient object detection. IEEE TIP 29 (), pp. 3534–3545. Cited by: §2.
  • [68] Q. Zhang, N. Huang, L. Yao, D. Zhang, C. Shan, and J. Han (2020) RGB-t salient object detection via fusing multi-level cnn features. IEEE TIP 29 (), pp. 3321–3335. Cited by: §1.
  • [69] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang (2018) Progressive attention guided recurrent network for salient object detection. In CVPR, pp. 714–722. Cited by: §2, Table 6, §5.
  • [70] Z. Zhang, Z. Lin, J. Xu, W. Jin, S. Lu, and D. Fan (2020) Bilateral attention network for RGB-D salient object detection. arXiv preprint arXiv:2004.14582. Cited by: §2.
  • [71] Z. Zhang, J. Xu, and M. Cheng (2020) Gradient-induced co-saliency detection. In ECCV, Cited by: §1.
  • [72] J. Zhao, Y. Cao, D. Fan, M. Cheng, X. Li, and L. Zhang (2019) Contrast prior and fluid pyramid integration for RGBD salient object detection. In CVPR, pp. 3927–3936. Cited by: Figure 1, Figure 2, §1, §1, §2, §3.3, §4.1, §4.1, §4.2, Table 1, Table 2, §5.
  • [73] J. Zhao, J. Liu, D. Fan, Y. Cao, J. Yang, and M. Cheng (2019) EGNet: Edge guidance network for salient object detection. In CVPR, pp. 8779–8788. Cited by: §1.
  • [74] C. Zhu, X. Cai, K. Huang, T. H. Li, and G. Li (2019) PDNet: prior-model guided depth-enhanced network for salient object detection. In ICME, pp. 199–204. Cited by: Figure 2, §1, §3.3.
  • [75] C. Zhu, G. Li, W. Wang, and R. Wang (2017) An innovative salient object detection using center-dark channel prior. In ICCVW, pp. 1509–1515. Cited by: §1, §2, §4.1, Table 1.
  • [76] C. Zhu and G. Li (2017) A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In ICCVW, pp. 3008–3014. Cited by: §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.