for RGBD salient object detection
Fully convolutional neural networks (FCNs) have shown outstanding performance in many computer vision tasks including salient object detection. However, there still remains two issues needed to be addressed in deep learning based saliency detection. One is the lack of tremendous amount of annotated data to train a network. The other is the lack of robustness for extracting salient objects in images containing complex scenes. In this paper, we present a new architecture - PDNet, a robust prior-model guided depth-enhanced network for RGB-D salient object detection. In contrast to existing works, in which RGB-D values of image pixels are fed directly to a network, the proposed architecture is composed of a master network for processing RGB values, and a sub-network making full use of depth cues and incorporate depth-based features into the master network. To overcome the limited size of the labeled RGB-D dataset for training, we employ a large conventional RGB dataset to pre-train the master network, which proves to contribute largely to the final accuracy. Extensive evaluations over five benchmark datasets demonstrate that our proposed method performs favorably against the state-of-the-art approaches.READ FULL TEXT VIEW PDF
Recently, deep Convolutional Neural Networks (CNN) have demonstrated str...
In this paper, we aim to develop an efficient and compact deep network f...
3D feature descriptor provide information between corresponding models a...
Bottom-up and top-down visual cues are two types of information that hel...
Salient object detection (SOD) is a crucial and preliminary task for man...
Deep convolutional neural networks have become a key element in the rece...
We show that the basic classification framework alone can be used to tac...
for RGBD salient object detection
When human look at an image, he/she always focus on a subset of the whole image, which is called visual attention. Visual attention is a neurobiological process to filter out irrelevant information and highlight most noticeable foreground information. A variety of computational models have been developed to simulate this kind of mechanism, which can be used in object tracking , image montage  and image compression . In general, saliency detection algorithms can be categorized into two groups: top-down [4, 5, 6] or bottom-up [7, 8, 9, 10, 11]
approaches. Top-down approaches are task-driven and need supervised learning. While bottom-up approaches usually use low-level cues, such as color features, distance features and heuristic saliency features. One of the most used heuristic saliency features is contrast, such as pixel-based or patch-based contrast.
Most previous works on saliency detection focus on 2D images. To our thoughts, this remains limited potential for further research. First, 3D data instead of 2D is more suitable for real application, second, as visual scene become more and more complex, utilizing 2D data only is not enough for extracting salient objects. Recent advances in 3D data acquisition techniques, such as Time-of-Flight sensors and the Microsoft Kinect, have motivated the adoption of structural features, improving the discrimination between different objects with the similar appearance. Saliency detection on RGB-D images will expedite a variety of real applications, such as 3D content surveillance, retrieval, and image recognition.
In addition to RGB information, depth has shown to be a practical cue for saliency estimation[6, 1, 2]. However, it is still ineffective to train a network due to the limited size of annotated RGB-D data. Besides, how to integrate the additional depth information into the RGB framework remains to be a key issue that is needed to be addressed.
To resolve the above-mentioned limitation, in this paper, we propose a novel prior-model guided depth-enhanced network (PDNet). The PDNet is composed of a master network and sub-network. The master network is a convolution-deconvolution pipeline. The convolution stage serves as a feature extractor that transforms the input image into hierarchical rich feature representation, while the deconvolution stage serves as a shape restorer to recover the resolution and segment the salient object in fine detail from background. The sub-network can be treated like an encoder convolution architecture and it process depth map as input and enhance the robustness of the master network. To address the problem of insufficient RGB-D data for training, we employ a large dataset to pre-train our master network. This pre-train setup before training our network using RGB-D data has proved to contribute dramatically to accuracy improvement. Fig.1 illustrates the pipeline of our model.
In summary, the main contributions of this work are as follows:
We propose a novel deep network (PDNet) for saliency detection on RGB-D images, where we utilize RGB-based prior-model to guide the main learning stage.
Unlike the existing works, we process the depth cue in an independent encoder network, which can make full use of depth cues and assist the main-stream network.
Compared with previous works, the proposed method demonstrates dramatical performance improvements on five benchmark datasets.
In this section, we present a brief review of saliency detection methods on both RGB and RGB-D images.
Over the past decades, lots of salient object detection methods have been developed. The majority of these methods are designed on low-level hand-crafted features [8, 9, 10]. A complete survey of these methods is beyond the scope of this paper and we refer the readers to a recent survey paper  for details.
Recently, with the development of deep learning and the growth of annotated data in RGB-based salient object detection datasets. The convolutional neural network has a remarkable performance in salient object detection. A lot of research efforts have been made to develop various deep architectures for useful features that characterize salient objects or regions. For instance, zhu et al.  presents a two-channeled perceiving residual pyramid networks to generating high-resolution and high-quality results for saliency detection. Li et al.  fine tune fully connected layers of mutiple CNNs to predict the saliency degree of each superpixel. These methods achieve good performances, however, images with complex background are still a challenging task. Therefore, additionally auxiliary feature should be exploited to assist saliency detection.
Compared with RGB saliency detection, RGB-D saliency has received less research attention. In 
, Zhu et al. propose a framework based on depth mining, and use multilayer backpropagation to exploit the depth cue. In, Cheng et al. compute salient stimuli in both color and depth spaces. In , Peng et al. provide a simple fusion framework that combines existing RGB-produced saliency with new depth-induced saliency. In , Ju et al. propose a saliency method applied on depth images, which is based on anisotropic center-surround difference. In , Guo et al. propose a salient object detection method for RGB-D images based on evolution strategy.
However, for the limitation of the size of RGB-D datasets, some deep learning based methods  use many pre-extracted low-level hand-crafted features to fed the network. And almost all the methods [17, 5] integrate the depth cue with RGB information directly as a fourth dimensional input to train the network. By contrast, our proposed method adopt pre-trained RGB network as prior-model and learn depth cue independently, which remedies the existing methods’ drawback.
As shown in Fig.1, the proposed PDNet contains two main components: the prior-model guided master network and the depth-enhanced subsidiary network. The master network is based on the convolution-deconvolution architecture. The subsidiary network acts like an encoder, extracting depth cues. The proposed model will be discussed in detail in the following sections.
The master network is based on encoder-decoder architecture. VGG  is used in the encoder part of the proposed model, besides, we employ copy-crop and multi-feature concatenation technique. We utilize hierarchical features in an effective way.
Here are the details of the proposed FCN network. Each of the convolution layers is followed by a Batch Normalization (BN) layer for improving the speed of convergence. And then the Rectified Linear Unit (ReLU) activation function is used for adding non-linearity. Every kernel size isas used in other deep networks. VGG-16 and VGG-19 are tested for the encoder model. The experiment results will be shown in next section.
Copy-crop technique is used here for adding more low-level features from the early stage for improving fine details of saliency map on up-sampling stage.
Multi-feature concatenation technique is mainly based on loss-fusion pattern. It is used here for reasonably combining both low-level and high-level features for accurate saliency detection and loss fusion. Those features in different blocks in decoder part through one convolution kernel with size and linear activation function get pyramid outputs. They are concatenated to final convolutional layer which has one size kernel. The sigmoid activation function applied to this layer. Then, the pixel-wise binary cross entropy between predict saliency map and the ground truth saliency mask is computed by:
where are the pixel location in an image.
Given an input image , the salient object detection network produces a saliency map from a set of weights . The salient object detection is posed as a regression problem, and the saliency value of each pixel in can be described as:
where corresponds to the receptive field of location in . Once the network is trained, is fixed and used to detect salient objects for any input images.
Considering the limitation of RGB-D datasets, we employ the RGB based saliency detection datasets for pre-training. We utilize MSRA10K dataset  as well as the DUTS-TR  dataset. MSRA10K includes 10,000 images with high quality pixel-wise annotations. This DUTS dataset is currently the largest saliency detection benchmark, and contains 10,553 training images (DUTS-TR) and 5,019 test images (DUTS-TE). Before feeding the training images into our proposed model, each image is rescaled into the same size [224,224] and normalized to [0,1] as well as the ground truth.
After pre-training the master network, we can get the prior-model weight , which can guide the post-training weights . Thus, we can obtain a prior-model guided saliency map , denote as:
where is prior-model weight, which is fixed by the pre-training in master network.
In order to obtain the features of an input depth map, we apply a subsidiary network to encode the depth cue and incorporate the depth-based features acquired by the subsidiary network as a convolution layer into the proposed master network. We denote the input depth map of this convolution layer as . Its corresponding output is:
where is the bias, and is the depth-enhanced weight matrix obtained by the subsidiary network.
The output features of the subsidiary network is directly used as the weight matrix for the prior-model guided master network. The subsidiary network can therefore be viewed as a depth-enhanced weight prediction network to encode depth representation into the master network. Eq. 3 can be rewritten as:
where is the combination weight factor of the depth-based feature maps obtained via sub-network, which is based on the number of feature maps, denote as:
where is the RGB-based feature maps obtained via encoder part of prior-model guided master network.
The proposed model is implemented in python
. It’s evaluated on a machine equipped with an i7-7700 CPU and a Nvidia GTX1060 GPU (with 6G memory). The parameters of hierarchical feature layers are initialized by the truncated normal distribution. The Adam optimizer is used for training with learning rate from 0.001 to 0.0001. The proposed master network of PDNet is pre-trained on large-scale (20,553) RGB datasets, which cost about 26 hours (15 epochs). Then we fix the parameters in encoder part of the proposed master network and train the proposed PDNet on RGB-D datasets. It takes about 10 minutes each epoch. While performing inference, it runs 15 fps (VGG-16) on average.
In this section, we evaluate the proposed method on five RGB-D datasets.
NJU2000 . The NJUDS2000 dataset contains 2000 stereo images as well as the corresponding depth maps and manually labeled ground truths. The depth maps are generated using an optical flow method. We also randomly split this dataset into two parts: 1500 images for training and 500 for testing (NJU2000-TE).
NLPR . The NLPR RGB-D salient object detection dataset contains 1000 images captured by the Microsoft Kinect in various indoor and outdoor scenarios. We randomly split this dataset into two parts: 500 images for training and 500 for testing (NLPR-TE).
LFSD . This dataset contains 100 images with depth information and manually labeled ground truths. The depth information was captured using the Lytro light field camera. All images in this dataset were used for testing.
RGBD135 . This dataset has 135 indoor images taken by Kinect with the resolution . All images in this dataset were used for testing.
SSD100 . This dataset is built on three stereo movies. It contains 80 images with both indoors and outdoors scenes. All images in this dataset were used for testing.
Three most widely-used evaluation metrics used to evaluate the performance of different saliency algorithms, including the precision-recall (PR) curves, F-measure and mean absolute error (MAE).
|Model||NJU2000-TE ||NLPR-TE |
|Method||NJU2000-TE ||NLPR-TE ||LFSD ||RGBD135 ||SSD100 |
To validate the effectiveness of our proposed network, we design a baseline and evaluate five variants of the baseline. The baseline is the master network of the proposed PDNet without prior-model guided and trained with four-dimensional RGB-D data (). The five variants are:
The master network of the proposed PDNet with prior-model guided and trained with three-dimensional RGB data ().
The master network of the proposed PDNet without prior training, connecting the subsidiary depth-enhanced network (, here we adopt ).
The proposed PDNet with the combination weight factor equals 1 ().
The proposed PDNet with the combination weight factor less than 1 (, here we take four samples, which are , and averaging them to present this situation).
The proposed PDNet with the combination weight factor larger than 1 (, here we take four samples, which are , and averaging them to present this situation).
Table I shows the MAE and F-measure validation results on two RGB-D datasets. And the visual results of the ablation study is shown in Fig.4. We can clearly see the accumulated processing gains after each component. In summary, it proves that each variation in our algorithm is effective for generating the optimal final saliency map. And the approximate best performance is , so we adopt in the following experiments.
In this section, we compare our method with three state-of-the-art methods developed for RGB images (BSCA15 , LIP15 , and HS16 ) and seven RGB-D saliency methods designed specifically for RGB-D images (DES14 , NLPR14, ACSD15 , SE , TPF17 , DF17  and CTMF17 ). We use the codes provided by the authors to reproduce their experiments. For all the compared methods, we use the default settings suggested by the authors.
Fig.2 provides a visual comparison of our approach with the above-mentioned approaches. It can be observed that our proposed method produce fine detail as highlighting the attention-grabbing salient region.
In this paper, we propose a novel PDNet for RGB-D saliency detection. We adopt a prior-model guided master network to process RGB information of images. And the master network is pre-trained on the conventional RGB dataset to overcome the limited size of annotated RGB-D data. Instead of treating the depth map as a fourth-dimensional input, we design an independent sub-network for extracting depth information, which proves to be better than the previous treatment. Extensive experiments demonstrate that prior-model provides a solid foundation for salient object detection. And additionally integrating an independent depth-enhanced network contributes largely to the final accuracy. To encourage future works, we expose the source code that can be found on our project website: https://github.com/ChunbiaoZhu/PDNet/.