M2RNet: Multi-modal and Multi-scale Refined Network for RGB-D Salient Object Detection

09/16/2021
by   Xian Fang, et al.
Nankai University
0

Salient object detection is a fundamental topic in computer vision. Previous methods based on RGB-D often suffer from the incompatibility of multi-modal feature fusion and the insufficiency of multi-scale feature aggregation. To tackle these two dilemmas, we propose a novel multi-modal and multi-scale refined network (M2RNet). Three essential components are presented in this network. The nested dual attention module (NDAM) explicitly exploits the combined features of RGB and depth flows. The adjacent interactive aggregation module (AIAM) gradually integrates the neighbor features of high, middle and low levels. The joint hybrid optimization loss (JHOL) makes the predictions have a prominent outline. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 7

page 8

03/22/2021

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion

RGB-D salient object detection (SOD) is usually formulated as a problem ...
08/13/2021

Modal-Adaptive Gated Recoding Network for RGB-D Salient Object Detection

The multi-modal salient object detection model based on RGB-D informatio...
04/20/2021

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

The widespread dissemination of forged images generated by Deepfake tech...
06/07/2021

Progressive Multi-scale Fusion Network for RGB-D Salient Object Detection

Salient object detection(SOD) aims at locating the most significant obje...
10/21/2021

LC3Net: Ladder context correlation complementary network for salient object detection

Currently, existing salient object detection methods based on convolutio...
01/29/2017

MSCM-LiFe: Multi-scale cross modal linear feature for horizon detection in maritime images

This paper proposes a new method for horizon detection called the multi-...
09/10/2021

ACFNet: Adaptively-Cooperative Fusion Network for RGB-D Salient Object Detection

The reasonable employment of RGB and depth data show great significance ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Salient object detection (SOD) aims to identify the most conspicuous object that attracts humans in the scene. It has been successfully applied in various fields, such as image retrieval

Gao et al. (2015); Yang et al. (2015), robot navigation Craye et al. (2016), person re-identification Zhao et al. (2013) and many more.

The SOD methods have exhibited broad prospects owing to the powerful representation ability of convolutional neural networks (CNNs)

LeCun et al. (1998) and fully convolutional networks (FCNs) Long et al. (2015). Most of them resort to a single RGB information, which is difficult to achieve satisfactory results in complex scenes. At present, depth information has become growing popular thanks to the emergence of affordable and portable devices. As a supplement to RGB features, depth features provide rich distance information. However, the inherent differences between the multiple modalities lead to the bottleneck of feature fusion. Moreover, although the features of each scale have the detail or semantic information, it is hard to adequately aggregate them.

Figure 1: Visual comparison of our method and two state-of-the-art approaches on several examples. (a) RGB; (b) Depth; (c) Ground truth; (d) Ours; (e) SMA Liu et al. (2020); (f) CPFP Zhao et al. (2019a).

Figure 2: The overall architecture of the proposed MRNet. The backbone of our network is VGG-16 Simonyan and Zisserman (2014).

To this end, we propose a novel multi-modal and multi-scale refined network (MRNet) for RGB-D salient object detection. Specifically, the network is composed of the presented nested dual attention module (NDAM), adjacent interactive aggregation module (AIAM) and joint hybrid optimization loss (JHOL). In NDAM, we sequentially leverage channel and spatial attention to explicitly understand what and where is meaningful, thereby emphasizing or suppressing RGB and depth information that is important or unnecessary, for the purpose of boosting the fusion of RGB and depth features. In AIAM, we leverage the interaction of progressive and jumping connections in parallel to gradually learn information in abundant resolution for the purpose of boosting the aggregation of high level, middle level and low level features. In JHOL, we are devoted to guaranteeing inter-class discrimination and intra-class consistency by taking the local and global correlation of each pixel into account. These three components work together to achieve remarkable detection results. The visual comparison is shown in Fig. 1. Comparing with the saliency maps obtained by state-of-the-art approaches, those of our method are more exact.

Our main contributions are summarized as follows:

  • We propose a multi-modal and multi-scale refined network (MRNet), which is equipped with the presented nested dual attention module (NDAM), adjacent interactive aggregation module (AIAM) and joint hybrid optimization loss (JHOL) components. Our network is capable of refining the multi-modal and multi-scale features simultaneously, nearly without extra computing cost under specific supervision.

  • We conduct extensive experiments on seven datasets and demonstrate that our method achieves consistently superior performance against 12 state-of-the-art approaches including three RGB salient object detection approaches and nine RGB-D salient object detection approaches in terms of six evaluation metrics.

2 Related Work

2.1 RGB Saliency Detection

Lots of RGB salient object detection methods have been developed during the past decades.

For instance, Zhang et al. Zhang et al. (2018) proposed a PAGR, which selectively integrates multi-level contextual information. Liu et al. Liu et al. (2018) proposed a PiCANet, which generates attention over the context regions for each pixel. Su et al. Su et al. (2019)

proposed a BANet, which enhances the feature selectivity at boundaries and keeps the feature invariance at interiors. Zhao et al.

Zhao et al. (2019b) proposed an EGNet, which explores edge information to preserve salient object boundaries. Wu et al. Wu et al. (2019) proposed a CPD, which uses cascaded partial decoder to discards low-level features. Liu et al. Liu et al. (2019) proposed a PoolNet, which explores the potentials of pooling. Zhang et al. Zhang et al. (2019)

proposed a CapSal, which uses image captioning for detecting. Qin et al.

Qin et al. (2019) proposed a BASNet, which focuses on end-to-end boundary-aware. Pang et al. Pang et al. (2020b) proposed a MINet, which exchanges information between multi-scale. Wei et al. Wei et al. (2020) proposed a LDF, which decouples the saliency label into body map and detail map for iterative information exchange.

2.2 RGB-D Saliency Detection

Most recently, RGB-D salient object detection methods have rapidly aroused the concern of researchers and made impressive progress.

For instance, Chen et al. Chen and Li (2019) proposed a TANet, which combines the bottom-up stream and the top-down stream to learn cross-modal complementarity. Wang et al. Wang and Gong (2019) proposed an AFNet, which adaptively fuses the predictions from the separate RGB and depth streams using the switch map. Piao et al. Piao et al. (2019)

proposed a DMRA, which involves residual connections, multi-scale weighting and recurrent attention. Zhao et al.

Zhao et al. (2019a) proposed a CPFP by making use of feature fusion of contrast prior and fluid pyramid. Fan et al. Fan et al. (2019) proposed a DNet, which automatically discards the low-quality depth maps via gate connection. Li et al. Li et al. (2020a) proposed an ICNet, which can learn the optimal conversion of RGB features and depth features to autonomously merge them. Pang et al. Pang et al. (2020a) proposed a HDFNet, in which the features of the network are densely connected and through the dynamic expansion pyramid. Fan et al. Fan et al. (2020) proposed a BBS-Net, in which multi-level features are partitioned into teacher and student features in the cascade network. Zhao et al. Zhao et al. (2020) proposed a DANet, which explores early fusion and middle fusion between RGB and depth. Zhao et al. Li et al. (2020b) proposed a CMWNet, which weights the fusion of low, medium and high levels to encourage feature interaction. Fu et al. Fu et al. (2020) proposed a JL-DCF for joint learning and densely-cooperative fusion. Zhang et al. Zhang et al. (2020)

proposed an UC-Net, which learns the distribution of saliency maps by conditional variational autoencoder. Piao et al.

Piao et al. (2020) proposed an A2dele, which uses network prediction and attention as two bridges to transfer deep knowledge from deep stream to RGB stream. Liu et al. Liu et al. (2020) proposed a SMA, which reweights the mutual attention for filtering out unreliable modality information.

3 Proposed Method

3.1 Network Overview

The proposed multi-modal and multi-scale refined network (MRNet) is an encoder-decoder architecture, covering nested dual attention module (NDAM), adjacent interactive aggregation module (AIAM) and joint hybrid optimization loss (JHOL), as shown in Fig. 2. To be concise, we denote the output features of RGB branch in the encoder as , the output features of depth branch in the encoder as , and the output features in the decoder as . Let denote the combined features of RGB flow and depth flow, we utilize NDAM to strengthen their robustness to get the corresponding enhanced features . For each group of three consecutive features {, , } , we utilize AIAM to produce their aggregated features . In order to facilitate the optimization, we embed JHOL as the auxiliary loss.


Figure 3: Illustration of NDAM.

Figure 4: Illustration of AIAM.
Dataset Evaluation EGNet CPD PoolNet DF CTMF MMCI TANet AFNet DMRA CPFP DNet SMA MRNet
metric Zhao et al. (2019b) Wu et al. (2019) Liu et al. (2019) Qu et al. (2017) Han et al. (2017) Chen et al. (2019) Chen and Li (2019) Wang and Gong (2019) Piao et al. (2019) Zhao et al. (2019a) Fan et al. (2019) Liu et al. (2020) (ours)
STEREO Niu et al. (2012) 0.757 0.848 0.873 0.871 0.825 0.752 0.879 0.891 0.890 0.899
0.789 0.848 0.877 0.878 0.848 0.802 0.889 0.897 0.895 0.913
0.742 0.771 0.829 0.835 0.807 0.762 0.830 0.833 0.855 0.867
0.549 0.698 0.760 0.787 0.752 0.647 0.817 0.815 0.825 0.851
0.838 0.870 0.905 0.916 0.887 0.816 0.907 0.911 0.926 0.929
0.141 0.086 0.068 0.060 0.075 0.086 0.051 0.054 0.051 0.042
NLPR Peng et al. (2014) 0.867 0.885 0.867 0.769 0.860 0.856 0.886 0.798 0.899 0.888 0.905 0.915 0.918
0.857 0.889 0.844 0.753 0.840 0.841 0.876 0.816 0.888 0.888 0.905 0.910 0.921
0.800 0.840 0.791 0.682 0.723 0.729 0.795 0.746 0.855 0.821 0.832 0.846 0.862
0.774 0.829 0.771 0.524 0.691 0.688 0.789 0.699 0.846 0.819 0.833 0.855 0.848
0.910 0.925 0.900 0.838 0.869 0.871 0.916 0.884 0.942 0.923 0.932 0.937 0.941
0.047 0.037 0.046 0.099 0.056 0.856 0.041 0.060 0.031 0.036 0.034 0.030 0.033
RGBD135 Cheng et al. (2014) 0.876 0.891 0.886 0.681 0.863 0.848 0.858 0.770 0.900 0.872 0.904 0.941 0.934
0.900 0.910 0.906 0.626 0.865 0.839 0.853 0.775 0.907 0.882 0.917 0.944 0.937
0.843 0.869 0.864 0.573 0.778 0.762 0.795 0.730 0.866 0.829 0.876 0.906 0.910
0.780 0.824 0.807 0.383 0.686 0.650 0.739 0.641 0.843 0.787 0.831 0.892 0.903
0.930 0.930 0.940 0.806 0.911 0.904 0.919 0.874 0.944 0.927 0.956 0.974 0.971
0.037 0.032 0.032 0.132 0.055 0.065 0.046 0.068 0.030 0.038 0.030 0.021 0.019
LFSD Li et al. (2014) 0.818 0.806 0.826 0.776 0.796 0.787 0.801 0.738 0.847 0.828 0.832 0.837 0.842
0.838 0.834 0.846 0.854 0.815 0.813 0.827 0.780 0.872 0.850 0.849 0.862 0.861
0.803 0.808 0.790 0.811 0.780 0.779 0.786 0.742 0.849 0.813 0.801 0.820 0.825
0.745 0.753 0.757 0.618 0.695 0.663 0.718 0.671 0.811 0.775 0.756 0.772 0.786
0.854 0.856 0.852 0.841 0.851 0.840 0.845 0.810 0.899 0.867 0.860 0.876 0.874
0.102 0.097 0.094 0.151 0.120 0.132 0.111 0.133 0.075 0.088 0.099 0.094 0.088
NJU2K Ju et al. (2015) 0.869 0.862 0.872 0.735 0.849 0.859 0.878 0.771 0.886 0.895 0.910
0.880 0.880 0.887 0.790 0.857 0.868 0.888 0.804 0.896 0.903 0.922
0.846 0.853 0.850 0.744 0.779 0.803 0.844 0.766 0.872 0.819 0.841
0.808 0.821 0.816 0.553 0.731 0.749 0.812 0.699 0.853 0.839 0.854
0.905 0.908 0.908 0.818 0.864 0.878 0.909 0.846 0.921 0.891 0.904
0.060 0.059 0.057 0.151 0.085 0.079 0.061 0.103 0.051 0.051 0.049
DUT-RGBD Piao et al. (2019) 0.872 0.874 0.892 0.719 0.830 0.791 0.808 0.888 0.749 0.903 0.903
0.897 0.892 0.907 0.775 0.842 0.804 0.823 0.908 0.787 0.909 0.925
0.861 0.863 0.866 0.748 0.790 0.751 0.771 0.883 0.735 0.866 0.892
0.797 0.819 0.829 0.514 0.681 0.626 0.703 0.852 0.636 0.856 0.864
0.914 0.915 0.924 0.842 0.882 0.855 0.866 0.930 0.815 0.921 0.935
0.060 0.059 0.050 0.150 0.097 0.113 0.093 0.048 0.100 0.046 0.042
SIP Fan et al. (2019) 0.653 0.716 0.833 0.835 0.720 0.800 0.850 0.864 0.882
0.704 0.720 0.840 0.851 0.756 0.847 0.870 0.882 0.902
0.673 0.684 0.795 0.809 0.705 0.815 0.819 0.831 0.868
0.406 0.535 0.712 0.748 0.617 0.734 0.788 0.793 0.840
0.794 0.824 0.886 0.894 0.815 0.858 0.899 0.903 0.921
0.185 0.139 0.086 0.075 0.118 0.088 0.064 0.063 0.049
Table 1: Quantitative comparison of different methods on six datasets in terms of six evaluation metrics. and indicate that the larger and smaller scores are better, respectively. The best three results are highlight in red, blue and green. and mean using VGG-19 Simonyan and Zisserman (2014) and ResNet-50 He et al. (2016) as the backbone, respectively. If not marked, VGG-16 Simonyan and Zisserman (2014) is used as the backbone. ’–’ means no data available.

3.2 Nested Dual Attention Module

There are some main issues in fusing RGB and depth features. The key point is that RGB and depth features are incompatible to a certain extent, which is due to the inherent differences between the two modalities. Besides, low-quality depth maps also inevitably bring more noise than cues.

In view of these, we propose a nested dual attention module (NDAM) to promote the coordination of multi-modal features and reduce the noise contamination of depth map. The two-phase attention mechanism is elaborately designed to mine potential features. Here, the channel attention mechanism of each phase is responsible for excavating the inter-channel relationship of features, while the spatial attention mechanism of each phase is responsible for excavating the inter-spatial relationship of features. By directly merging the RGB features and the depth features , the combined features are easily calculated. Furthermore, the corresponding enhanced features are eventually obtained after the reinforcement of nested attention. The procedure of RGB and depth feature fusion can be described as:

(1)
(2)

where and denote the channel attention and spatial attention, respectively. The nested attention is roughly divided into two phases of and , as shown in Fig. 3.

In the first phase , given an intermediate feature , in which , and , and are the channel, height and width of the feature , respectively, the dual attentions and are defined as:

(3)
(4)

where denotes the transpose operation and is the Softmax function. Note that the channel of the feature in Eq. (4) is set to 1/8 of the original channel of that for computation efficiency.

In the second phase , for the feature , which can be reshaped with the feature , the dual attentions and are defined as:

(5)
(6)

where denotes the element-wise multiplication,

is the Sigmoid function,

represents the multi-layer perceptron,

and

represent the global max pooling operation and global max pooling along the channel operation, respectively. Note that we use a global max pooling rather than a global average pooling since our goal is to find the area with the biggest visual influence.

Figure 5: The PR curves of different methods on six datasets. (a) STEREO Niu et al. (2012); (b) NLPR Peng et al. (2014); (c) RGBD135 Cheng et al. (2014); (d) LFSD Li et al. (2014); (e) NJU2K Ju et al. (2015); (f) DUT-RGBD Piao et al. (2019); (g) SIPFan et al. (2019).
Figure 6: The F curves of different methods on six datasets. (a) STEREO Niu et al. (2012); (b) NLPR Peng et al. (2014); (c) RGBD135 Cheng et al. (2014); (d) LFSD Li et al. (2014); (e) NJU2K Ju et al. (2015); (f) DUT-RGBD Piao et al. (2019); (g) SIPFan et al. (2019).

3.3 Adjacent Interactive Aggregation Module

In general, shallower features have more detail information, while deeper features have more semantic information. The aggregation of multi-level features with different resolutions enables the context information to be integrated as sufficient as possible.

Relying on this, we propose an adjacent interactive aggregation module (AIAM) to guide the interaction of multi-scale features. In this way, the neighbor features that are of intimate correlation are constantly complemented. The interaction behind the features of three triples {, , } helps the generation of desired resulting features {} . The procedure of different levels feature aggregation can be described as:

(7)
(8)

where and denote two kinds of feature interactions, as shown in Fig. 4

. The difference between them is that the interaction of the former is progressive, while that of the latter is jumping in the initial stage. After that, these relevant features repeatedly pass through some convolutional layers, batch normalization layers and ReLU layers, and finally are associated with a residual block.

3.4 Joint Hybrid Optimization Loss

Let and denote the prediction saliency map and ground truth saliency map, respectively. As the most classical loss function, binary cross entropy loss (BCEL), symbolized by , can be formulated as:

(9)

where and are the height and width of the image, respectively.

Figure 7: Qualitative comparison of some methods on several representative examples. (a) RGB; (b) Depth; (c) Ground truth; (d) Ours; (e) DNet Fan et al. (2019); (f) DMRA Piao et al. (2019); (g) TANet Chen and Li (2019); (h) DF Qu et al. (2017).

To reinforce the capability of supervision from both local and global aspects, we propose a joint hybrid optimization loss (JHOL), symbolized by , as an auxiliary loss function, which can be formulated as:

(10)

where , , and are the parameters to control the trade-off among the four terms of the loss. For the sake of simplicity, they are all set to 1. To be specific, , , and are defined as:

(11)
(12)
(13)
(14)

On the one hand, and focus on the local correlation of each pixel to ensure a certain degree of discrimination. On the other hand, and focus on the global correlation of each pixel to ensure a certain degree of consistency. As a result, jointly makes the model capture the foreground region as smoothly as possible and filter the background region as steadily as possible.

Therefore, the total loss can be written as:

(15)

where is the parameter to control the trade-off between and . In practice, it is also set to 1.

No. Baseline +NDAM +AIAM +JHOL
+ + + + + + + +
1 0.885 0.820 0.775 0.899 0.872 0.064
2 0.893 0.833 0.790 0.907 0.880 0.060
3 0.887 0.827 0.779 0.903 0.875 0.063
4 0.894 0.837 0.796 0.910 0.882 0.059
5 0.896 0.839 0.804 0.914 0.885 0.058
6 0.890 0.832 0.788 0.906 0.878 0.061
7 0.891 0.865 0.812 0.919 0.873 0.055
8 0.891 0.865 0.812 0.919 0.873 0.055
9 0.892 0.812 0.786 0.899 0.866 0.061
10 0.893 0.848 0.819 0.916 0.876 0.053
11 0.887 0.819 0.778 0.900 0.871 0.063
12 0.899 0.856 0.831 0.921 0.880 0.050
13 0.913 0.873 0.854 0.929 0.897 0.043
Table 2: Ablation study of our method in terms of uniform evaluation metrics. and indicate that the larger and smaller scores are better, respectively. The best results are bold.
Figure 8: Visual comparison of the impact of each component of our method on several examples. (a) RGB; (b) Depth; (c) Ground truth; (d) Baseline; (e) +NDAM; (f) +AIAM; (g) +JHOL; (h) +NDAM+AIAM+JHOL.

4 Experiments

4.1 Experimental Setup

Datasets. We choose seven benchmark datasets including STEREO Niu et al. (2012), NLPR Peng et al. (2014), RGBD135 Cheng et al. (2014), LFSD Li et al. (2014), NJU2K Ju et al. (2015), DUT-RGBD Piao et al. (2019) and SIP Fan et al. (2019) as the experimental material. STEREO, also known as SSB1000, contains 1000 pairs of stereoscopic images gathered from the Internet. NLPR contains 1000 images taken under different illumination conditions. RGBD135 is also called DES, which contains 135 images about some indoor scenarios. LFSD is relatively small and contains 100 images. NJU2K contains 1985 images collected from the Internet, 3D movies and photographs. DUT-RGBD contains 1200 images taken in varied real-life situations. SIP is relatively new, which contains 929 human images. Following Pang et al. (2020a), we use 700 samples from NLPR, 1485 samples from NJU2K, and 800 samples from DUT-RGBD as the training set. The remaining samples and other datasets are used as the testing set.

Evaluation Metrics. We employ six typical evaluation metrics including S-measure () Fan et al. (2017), maximum F-measure () Achanta et al. (2009), average F-measure () Achanta et al. (2009), weighted F-measure () Margolin et al. (2014), E-measure () Fan et al. (2018) and mean absolute error () Perazzi et al. (2012) to comprehensively evaluate the performance of competitors. In addition, we also plot the precision-recall (PR) curves and F-measure (F) curves.

Implementation Details.

We implement our method based on the PyTorch toolbox with a single GeForce RTX 2080 Ti GPU. The VGG-16

Simonyan and Zisserman (2014) is adopted as the backbone. For each input image, it is simply resized to 320320 and then fed into the network to obtain prediction without any other pre-processing (e.g., HHA Gupta et al. (2014)) or post-processing (e.g., CRF Krähenbühl and Koltun (2011)

). To avoid over-fitting, the techniques of flipping, cropping and rotation act as data augmentation. The stochastic gradient descent (SGD) optimizer is used with the batch size of 4, the momentum of 0.9 and the weight decay of 5e-4. The whole network is stopped after 30 epochs.

4.2 Comparison with State-of-the-arts

The proposed method is compared with other 12 state-of-the-art approaches including three RGB approaches (i.e., EGNet Zhao et al. (2019b), CPD Wu et al. (2019) and PoolNet Liu et al. (2019)) and nine RGB-D approaches (i.e., DF Qu et al. (2017), CTMF Han et al. (2017), MMCI Chen et al. (2019), TANet Chen and Li (2019), AFNet Wang and Gong (2019), DMRA Piao et al. (2019), CPFP Zhao et al. (2019a), DNet Fan et al. (2019) and SMA Liu et al. (2020)). For fair comparisons, all saliency maps of these methods are provided by the authors or computed by their released codes with default settings.

Quantitative Comparison. The quantitative comparison results are reported in Table 1. It can be seen that our method performs best in almost all cases. For an intuitive comparison, the PR curves and F curves are shown in Fig. 5 and Fig. 6, respectively. Obviously, the curves generated by our method are closer to the top and straighter in a large range than others, which reflects its excellence and stability.

Qualitative Comparison. The qualitative comparison results as shown in Fig. 7. It can be observed that our method can handle a wide variety of challenging scenes, such as blurred foreground, cluttered background, low contrast and multiple objects. More specifically, our method yields clear foreground, clean background, complete structure and sharp boundary. These results prove that our method is able to utilize cross-modal complementary information, which can not only achieve the reinforcement from the reliable depth maps but also prevent the contamination from the unreliable depth maps.

4.3 Ablation Studies

A series of ablation studies are conducted to investigate the impact of each core component of our method. The ablation experiment results are reported in Table 2. The baseline, corresponding to scheme No. 1 (i.e., the 1st rows), refers to the network like FPNs Lin et al. (2017). It should be pointed out that the uniform evaluation metrics are redefined as a weighted sum of scores according to the proportion of each dataset in all datasets. The visual comparison results are shown in Fig. 8.

Effect of NDAM. Actually, NDAM consists of two parts, which are and . By adding them into the baseline in the individual and collective manner, corresponding to scheme No. 2-3 (i.e., the 2nd and 3rd rows) and scheme No. 4 (i.e., the 4th rows), the performance is well improved. This confirms that NDAM does indeed offer additional valuable information from the channel and spatial perspectives.

Effect of AIAM. Similarly, AIAM consists of two parts, which are and . By adding them into the baseline in the individual and collective manner, corresponding to scheme No. 5-6 (i.e., the 5th and 6th rows) and scheme No. 7 (i.e., the 7th rows), the performance is greatly improved. This reveals the advantage of feature aggregation of AIAM.

Effect of JHOL. Also similarly, JHOL (i.e., ) consists of four parts, which are , , and . By adding them into the baseline in the individual and collective manner, corresponding to scheme No. 8-11 (i.e., the 8th, 9th, 10th and 11th rows) and scheme No. 12 (i.e., the 12th rows), the performance is significantly improved. In particular, when these parts are collectively added into the baseline, there are 1.4%, 3.6%, 5.6%, 2.2%, 0.8% and 1.4% improvement in terms of uniform evaluation metrics in order, respectively. This suggests that the use of JHOL is crucial for the task.

5 Conclusion

In this paper, we propose the multi-modal and multi-scale refined network named MRNet for detecting salient objects. To start with, we present the nested dual attention module, which boosts the fusion of multi-modal features via the phased channel and spatial attention. Next, we present the adjacent interactive aggregation module, which boosts the aggregation of multi-scale features in the form of progressive and jumping connections. Last but not least, we present the joint hybrid optimization loss, which alleviates the imbalance of pixels from both local and global aspects. Exhaustive experimental results demonstrate the superiority of our method over the other 12 state-of-the-art approaches.

CRediT authorship contribution statement

Xian Fang: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Writing - original draft, Visualization. Jinchao Zhu: Data Curation, Writing - review & editing, Visualization. Ruixun Zhang: Writing - review & editing. Xiuli Shao: Writing - review & editing. Hongpeng Wang: Writing - review & editing, Funding acquisition.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This research was supported by the National Key R&D Program of China under Grant 2019YFB1311804, the National Natural Science Foundation of China under Grant 61973173, 91848108 and 91848203, and the Technology Research and Development Program of Tianjin under Grant 18ZXZNGX00340 and 20YFZCSY00830.

References

References

  • R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk (2009) Frequency-tuned salient region detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 1597–1604. Cited by: §4.1.
  • H. Chen, Y. Li, and D. Su (2019) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognition 86, pp. 376–385. Cited by: Table 1, §4.2.
  • H. Chen and Y. Li (2019) Three-stream attention-aware network for RGB-D salient object detection. IEEE Transactions on Image Processing 28 (6), pp. 2825–2835. Cited by: §2.2, Figure 7, Table 1, §4.2.
  • Y. Cheng, H. Fu, X. Wei, J. Xiao, and X. Cao (2014) Depth enhanced saliency detection method. In Proceedings of the International Conference on Internet Multimedia Computing and Service (ICIMCS), pp. 23–27. Cited by: Figure 5, Figure 6, Table 1, §4.1.
  • C. Craye, D. Filliat, and J. Goudou (2016) Environment exploration for object-based visual saliency learning. In Proceedings of the International Conference on Robotics and Automation (ICRA), pp. 2303–2309. Cited by: §1.
  • D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: A new way to evaluate foreground maps. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 4548–4557. Cited by: §4.1.
  • D. Fan, C. Gong, Y. Cao, B. Ren, M. Cheng, and A. Borji (2018) Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 698–704. Cited by: §4.1.
  • D. Fan, Z. Lin, Z. Zhang, M. Zhu, and M. Cheng (2019) Rethinking RGB-D salient object detection: Models, data sets, and large-scale benchmarks. arXiv preprint arXiv:1907.06781. Cited by: §2.2, Figure 5, Figure 6, Figure 7, Table 1, §4.1, §4.2.
  • D. Fan, Y. Zhai, A. Borji, J. Yang, and L. Shao (2020) BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 275–292. Cited by: §2.2.
  • K. Fu, D. Fan, G. Ji, and Q. Zhao (2020) JL-DCF: Joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3052–3062. Cited by: §2.2.
  • Y. Gao, M. Shi, D. Tao, and C. Xu (2015) Database saliency for fast image retrieval. IEEE Transactions on Multimedia 17 (3), pp. 359–369. Cited by: §1.
  • S. Gupta, R. Girshick, P. Arbeláez, and J. Malik (2014) Learning rich features from RGB-D images for object detection and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 345–360. Cited by: §4.1.
  • J. Han, H. Chen, N. Liu, C. Yan, and X. Li (2017) CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE transactions on cybernetics 48 (11), pp. 3171–3183. Cited by: Table 1, §4.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: Table 1.
  • R. Ju, Y. Liu, T. Ren, L. Ge, and G. Wu (2015) Depth-aware salient object detection using anisotropic center-surround difference. Signal Processing: Image Communication 38, pp. 115–126. Cited by: Figure 5, Figure 6, Table 1, §4.1.
  • P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected CRFs with gaussian edge potentials. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), pp. 109–117. Cited by: §4.1.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
  • G. Li, Z. Liu, and H. Ling (2020a) ICNet: Information conversion network for RGB-D based salient object detection. IEEE Transactions on Image Processing 29, pp. 4873–4884. Cited by: §2.2.
  • G. Li, Z. Liu, L. Ye, Y. Wang, and H. Ling (2020b) Cross-modal weighting network for RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 665–681. Cited by: §2.2.
  • N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu (2014) Saliency detection on light field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2806–2813. Cited by: Figure 5, Figure 6, Table 1, §4.1.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125. Cited by: §4.3.
  • J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3917–3926. Cited by: §2.1, Table 1, §4.2.
  • N. Liu, J. Han, and M. Yang (2018) PiCANet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3089–3098. Cited by: §2.1.
  • N. Liu, N. Zhang, and J. Han (2020) Learning selective self-mutual attention for RGB-D saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13756–13765. Cited by: Figure 1, §2.2, Table 1, §4.2.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §1.
  • R. Margolin, L. Zelnik-Manor, and A. Tal (2014) How to evaluate foreground maps?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §4.1.
  • Y. Niu, Y. Geng, X. Li, and F. Liu (2012) Leveraging stereopsis for saliency analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 454–461. Cited by: Figure 5, Figure 6, Table 1, §4.1.
  • Y. Pang, L. Zhang, X. Zhao, and H. Lu (2020a) Hierarchical dynamic filtering network for RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 235–252. Cited by: §2.2, §4.1.
  • Y. Pang, X. Zhao, L. Zhang, and H. Lu (2020b) Multi-scale interactive network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9413–9422. Cited by: §2.1.
  • H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji (2014) RGBD salient object detection: A benchmark and algorithms. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 92–109. Cited by: Figure 5, Figure 6, Table 1, §4.1.
  • F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung (2012) Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 733–740. Cited by: §4.1.
  • Y. Piao, W. Ji, J. Li, M. Zhang, and H. Lu (2019) Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 7254–7263. Cited by: §2.2, Figure 5, Figure 6, Figure 7, Table 1, §4.1, §4.2.
  • Y. Piao, Z. Rong, M. Zhang, W. Ren, and H. Lu (2020) A2dele: Adaptive and attentive depth distiller for efficient RGB-D salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9060–9069. Cited by: §2.2.
  • X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand (2019) BASNet: Boundary-aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7479–7489. Cited by: §2.1.
  • L. Qu, S. He, J. Zhang, J. Tian, Y. Tang, and Q. Yang (2017) RGBD salient object detection via deep fusion. IEEE Transactions on Image Processing 26 (5), pp. 2274–2285. Cited by: Figure 7, Table 1, §4.2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Figure 2, Table 1, §4.1.
  • J. Su, J. Li, Y. Zhang, C. Xia, and Y. Tian (2019) Selectivity or invariance: Boundary-aware salient object detection. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 3799–3808. Cited by: §2.1.
  • N. Wang and X. Gong (2019) Adaptive fusion for RGB-D salient object detection. IEEE Access 7, pp. 55277–55284. Cited by: §2.2, Table 1, §4.2.
  • J. Wei, S. Wang, Z. Wu, C. Su, Q. Huang, and Q. Tian (2020) Label decoupling framework for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13025–13034. Cited by: §2.1.
  • Z. Wu, L. Su, and Q. Huang (2019) Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3907–3916. Cited by: §2.1, Table 1, §4.2.
  • X. Yang, X. Qian, and Y. Xue (2015) Scalable mobile image retrieval by exploring contextual saliency. IEEE Transactions on Image Processing 24 (6), pp. 1709–1721. Cited by: §1.
  • J. Zhang, D. Fan, Y. Dai, S. Anwar, F. S. Saleh, T. Zhang, and N. Barnes (2020) UC-Net: Uncertainty inspired RGB-D saliency detection via conditional variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8582–8591. Cited by: §2.2.
  • L. Zhang, J. Zhang, Z. Lin, H. Lu, and Y. He (2019) CapSal: Leveraging captioning to boost semantics for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6024–6033. Cited by: §2.1.
  • X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang (2018) Progressive attention guided recurrent network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 714–722. Cited by: §2.1.
  • J. Zhao, Y. Cao, D. Fan, M. Cheng, X. Li, and L. Zhang (2019a) Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3927–3936. Cited by: Figure 1, §2.2, Table 1, §4.2.
  • J. Zhao, J. Liu, D. Fan, Y. Cao, J. Yang, and M. Cheng (2019b) EGNet: Edge guidance network for salient object detection. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 8779–8788. Cited by: §2.1, Table 1, §4.2.
  • R. Zhao, W. Ouyang, and X. Wang (2013) Unsupervised salience learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3586–3593. Cited by: §1.
  • X. Zhao, L. Zhang, Y. Pang, H. Lu, and L. Zhang (2020) A single stream network for robust and real-time RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 646–662. Cited by: §2.2.