Bilateral Attention Network for RGB-D Salient Object Detection
Most existing RGB-D salient object detection (SOD) methods focus on the foreground region when utilizing the depth images. However, the background also provides important information in traditional SOD methods for promising performance. To better explore salient information in both foreground and background regions, this paper proposes a Bilateral Attention Network (BiANet) for the RGB-D SOD task. Specifically, we introduce a Bilateral Attention Module (BAM) with a complementary attention mechanism: foreground-first (FF) attention and background-first (BF) attention. The FF attention focuses on the foreground region with a gradual refinement style, while the BF one recovers potentially useful salient information in the background region. Benefitted from the proposed BAM module, our BiANet can capture more meaningful foreground and background cues, and shift more attention to refining the uncertain details between foreground and background regions. Additionally, we extend our BAM by leveraging the multi-scale techniques for better SOD performance. Extensive experiments on six benchmark datasets demonstrate that our BiANet outperforms other state-of-the-art RGB-D SOD methods in terms of objective metrics and subjective visual comparison. Our BiANet can run up to 80fps on 224×224 RGB-D images, with an NVIDIA GeForce RTX 2080Ti GPU. Comprehensive ablation studies also validate our contributions.READ FULL TEXT VIEW PDF
Typically, objects with the same semantics are not always prominent in i...
This paper presents a GPU implementation of two foreground object
Albeit intensively studied, false prediction and unclear boundaries are ...
In this paper, we propose an image matting framework called Salient Imag...
By the aid of attention mechanisms to weight the image features adaptive...
Across photography, marketing, and website design, being able to direct ...
Robust segmentation of hair from portrait images remains challenging: ha...
Bilateral Attention Network for RGB-D Salient Object Detection
Salient object detection (SOD) aims to segment the most attractive objects in an image. As an fundamental computer vision task, SOD has been widely applied into many vision applications, such as visual tracking[28, 32], image segmentation [20, 23, 43], and video analysis [53, 47], etc. Most of existing SOD methods [21, 31, 51] mainly deal with RGB images. However, they usually produce inaccurate SOD results on the scenarios of similar texture, complex background, or homogeneous objects [46, 52]. With the popularity of depth sensors in smartphones, the depth information, e.g., 3D layout and spatial cues, is crucial for reducing the ambiguity in the RGB images, and serves as important supplements to improve the SOD performance .
Recently, RGB-D SOD has received increasing research attention [5, 37]. Early RGB-D SOD works [35, 40, 42] introduced the depth contrast as an important prior for the SOD task. The recent work of CPFP  utilized the depth contrast prior to design an effectiveness loss. These methods essentially explore depth information to shift more priority on the foreground region [4, 2]. However, as demonstrated in [29, 48, 49], understanding what background is can also promote the SOD performance. Several traditional methods [26, 50] predict salient objects jointly from the complementary foreground and background information, which is largely ignored by current RGB-D SOD networks.
In this paper, we propose a Bilateral Attention Network (BiANet) to collaboratively learn complementary foreground and background features from both RGB and depth streams for better RGB-D SOD performance. As shown in Figure 2, our BiANet employs a two-stream architecture, and the side outputs from the RGB and depth streams are concatenated in multiple stages. Firstly, we use the high-level semantic features to locate the foreground and background regions . However, the initial saliency map is coarse and in low-resolution. To enhance the coarse saliency map, we design a Bilateral Attention Module (BAM), which is composed of the complementary foreground-first (FF) attention and background-first (BF) attention mechanisms. The FF shifts attention on the foreground region to gradually refine its saliency prediction, while the BF focuses on the background region to recover the potential salient regions around the boundaries. By bilaterally exploring the foreground and background cues, the model helps predict more accurately as shown in Figure 1. Secondly, we propose a multi-scale extension of BAM (MBAM) to effectively learn multi-scale contextual information, and capture both local and global saliency information to further improve the SOD performance. Extensive experiments on six benchmark datasets demonstrate that our BiANet achieves better performance than previous state-of-the-arts on RGB-D SOD, and is very fast owing to our simple architecture.
In summary, our main contributions are three-fold:
We propose a simple yet effective Bilateral Attention Module (BAM) to explore the foreground and background cues collaboratively with the rich foreground and background information from the depth images.
Our BiANet achieves better performance on six popular RGB-D SOD datasets under nine standard metrics, and presents better visual effects (e.g., contains more details and sharp edges) than the state-of-the-art methods.
Our BiANet runs at 34fps80fps on an NVIDIA GeForce RTX2080Ti GPU under different settings, and is a feasible solution for real-world applications.
The remainder of this paper is organized as follows. In §II, we briefly survey the related work. In §III, we present the proposed Bilateral Attention Network (BiANet) for RGB-D Salient Object Detection. Extensive experiments are conducted in §IV to evaluate its performance when compared with state-of-the-art RGB-D SOD methods on six benchmark datasets. The conclusion is given in §V.
RGB-D salient object detection (SOD) aims to segment the most attractive object(s) in a pair of cross-modal RGB and depth images. Early methods mainly focus on extracting low-level saliency cues from RGB and depth images, exploring object distance , difference of Gaussian , graph knowledge , multi-level discriminative saliency fusion , multi-contextual contrast [8, 35], and background enclosure , etc. However, these methods often produce inaccurate saliency predictions, due to the lack of high-level feature representation.
Recently, deep neural networks (DNNs) have been employed to investigate high-level representations of cross-modal fusion of RGB and depth images, with much better SOD performance. Most of these DNNs [3, 17, 44] first extract the RGB and depth features separately and then fuse them in the shallow, middle, or deep layers of the network. The methods of [4, 5, 27, 37] further improved the SOD performance by fusing cross-modal features in multi-level stages instead of as a one-off integration. Zhao et al.  also took the enhanced depth image as attention maps to boost RGB features in multiple stages with better SOD performance.
There are great differences in the distribution of foreground and background, so it is necessary to explore their respective cues. In traditional methods, some works focus on reasoning salient areas in foreground and background jointly. Yang et al.  proposed a two-stage method for SOD. It first takes the four boundaries of the inputs as background seeds to infer foreground queries via a graph-based manifold ranking. Then, it ranks the graph depending on the foreground seeds in the same manner for final detection. This method is enlightening, but it has obvious limitations: 1) It is inappropriate to use the four boundaries directly as background, because the foreground is likely to be connected to the boundaries. 2) Aggregation at the super-pixel level also results in rough outputs. For the limitation 1), Liang et al.  introduce the depth map to distinguish foreground and background regions instead of only assuming the boundaries as background. The depth map shows clear disparity in most senses; thus, it can support more precise locating. For the limitaion 2), Li et al.  further used the regularized random walks ranking to formulate pixel-wised saliency maps, which improves the scaling effect caused by super-pixel aggregation. Nevertheless, only depending on these low-level priors, traditional methods cannot accurately locate the initial region of foreground and background.
Recently, Chen et al.  proposed to gradually explore saliency regions from the background using reverse attention, but they ignored the contribution of foreground cues to the final detection. As far as we know, how to jointly refine the salient objects from the foreground and background regions is still an open problem in deep RGB-D SOD methods.
In this section, we first introduce the overall architecture of our BiANet, and then present the bilateral attention module (BAM) as well as its multi-scaled extension (MBAM).
As shown in Figure 2, our Bilateral Attention Network (BiANet) contains three main steps: feature extracting, prediction up-sampling, and bilateral attention residual compensation. We extract the multi-level features from the RGB and depth streams. With increasing network depth, the high-level features (e.g., ) will be more potent for capturing global context, while it loses the object details. When we up-sample the high-level predictions, the saliency maps (e.g., ) will be blurred, e.g.the edges will become uncertain. Thus, we use the proposed Bilateral Attention Module (BAM) to distinguish foreground and background regions.
We encode RGB and depth information with two streams. Specifically, both the RGB and depth streams employ five convolutional blocks from VGG-16  as the standard backbone and attach an additional convolution group with three convolutional layers to predict the saliency maps, respectively. Unlike previous works [17, 57, 3], we explore the cross-modal fusion of RGB and depth features at multiple stages, rather than fusing them once in low or high stage. The -th side output from the RGB stream and
from the depth stream are concatenated as a feature tensor. Note that, is concatenated by and , where
denotes the max-pooling operation. The coarse saliency mapis derived from , and are prepared for the BAMs in our BiANet to further refine the up-sampled saliency maps, by distinguishing the uncertain regions as foreground or background in a top-down manner.
The initial saliency map predicted from the high-level features is coarse in low-resolution, but useful to predict the initial position of the foreground and background, since it contains rich semantic information. To refine the basic saliency map , a lower-level feature with more details is used to predict the residual component between the higher-level prediction and the ground-true () with the help of BAM. We add the predicted residual component to the up-sampled higher-level prediction , and obtain a refined prediction , etc., that is,
where means up-sampling. Finally, our BiANet obtains a saliency map by , where
is a sigmoid function.
To get better residuals and distinguish up-sampled foreground and background regions, we design a bilateral attention module (BAM) to enable our BiANet to discriminate the foreground and background. In our BAM, the higher-level prediction serves as a foreground-first attention (FF) map, and the reversed prediction serves as background-first (BF) attention map to combine the bilateral attention on foreground and background. In Figure 3, one can see that the residual generated by BAM possesses high contrast at the object boundaries. More details are described in Sections III-B and III-C.
Deep supervision is widely used in the SOD task [14, 19]. It clarifies the optimization goals for each step of the network, and accelerates the convergence of training. For quick convergence, we also apply deep supervision in the depth stream output , RGB stream output , and each top-down side output
. The total loss function of our BiANet is
in which and are the weight coefficients and simply set to in our experiments. is the binary cross entropy loss, which is formulated as
In the above equation, and , and denotes the total pixel number.
Given the initial foreground and background, how to refine the prediction using higher-resolution cross-modal features is the focus of this paper. Considering that the distribution of foreground and background are quite different, we design a bilateral attention module using a pair of reversed attention components to learn features from the foreground and background respectively, and then jointly refine the prediction. As can be seen in Figure 2, to focus more on the foreground, we use the up-sampled prediction from the higher-level as foreground-first attention (FF) maps after they are activated by sigmoid, and the background-first attention (BF) maps are generated by subtracting FF maps from matrix , in which all the elements are .
Then, as shown in Figure 2, we apply FF and BF to weight the side-output features in two branches, respectively, and further predict the residual component jointly.
where is the channel-reduced feature of using 32 convolutions to reduce the computational cost. represents the feature extraction operation consisting of 32 convolution kernels with a size of
and a ReLU layer. The two branches do not share parameters.means concatenation. is the prediction layer to output a single channel residual map via a kernel after the same feature extraction operation with . Once the is obtained, the refined prediction is obtained via Equation 1.
To better understand the working mechanism of BAM, in Figure 3, we visualize the channel-wise averaged features from BAMs in different levels. In BAM, the original features will be first fed into two branches by multiply the FF and BF attention maps, respectively. The result of the direct multiplication is illustrated in the left half of the yellow (FF features) and blue (BF features) boxes. We can see that FF branch shifts attention to the foreground area predicted from its higher level to explore foreground saliency cues. After a convolution layer, more priority is given to the uncertain area. Complementarily, BF branch focuses on the background area to explore the background cues, looking for possible salient objects within it. In our BiANet, the top-down prediction up-sampling is a process in which the resolution of salient objects is gradually increased. It will result in uncertain coarse edges. We can see that both of FF and BF features focus on the uncertain area (such as object boundaries). The low-level and high-resolution FF branch will eliminate the overflow of the uncertain area, while the BF branch will eliminate the uncertain area which does not belong to the background. That is an important reason why BiANet performs better on detail and is prone to predicting sharp edges. After the joint inferring, we can see the bilaterally enhanced features contain more discriminative spatial information of foreground and background. The generated residual components are with sharp contrast on the edges, and then suppress the background area and strengthen the foreground regions.
Salient objects in a scene are various in location, size, and shape. Thus, exploring the multi-scaled context in high-level layers benefits for understanding the scene [45, 56]. To this end, we extend our BAM with a multi-scale version, in which groups of dilated convolutions are used to extract pyramid representations from the undetermined foreground and background areas. Specifically, the module can be described as
where means a concatenate operation. consists of kernels with 32 channels and a ReLU layer. is a group of dilated convolutions, with rates of 3, 5, and 7. They all consist of kernels with 32 channels and a ReLU layer.
We recommend applying the MBAM in high-level cross-modal features, such as , which need different sizes of receptive fields to explore multi-scale context. MBAM effectively improves the detection performance but introduces a certain computational cost. Thus, the number of MBAM should be a trade-off in practical applications. In Section IV-C3, we discuss in detail how the number of MBAM changes the detection effect and calculation cost.
In order to intuitively observe the gain effect brought by MBAM, we visualize the averaged foreground-first feature maps from MBAMs and BAMs in Figure 4. In the second row, the feature maps are obtained from the model with three MBAMs in its top three levels, while in the last row, all the feature maps are collected from BAMs. We can see the target object (horse) account for a large proportion of the scene. Without the ability to perceive multi-scale information effectively, the BAM fails to capture the accurate global salient regions in high levels and leads to incomplete prediction finally. When introducing the multi-scale extension, we can see higher-level features achieve stronger spatial representation, which supports to locate more complete salient object.
Following D3Net , we use the training set containing 1485 and 700 image pairs from the NJU2K  and NLPR  datasets, respectively. We employ the Adam optimizer  with an initial learning rate of 0.0001, , and
. The batch size is set to 8, and we train our BiANet for 25 epochs in total. The training images are resized to, also during the test. The output saliency maps are resized back to the original size for evaluation. Accelerated by an NVIDIA GeForce RTX 2080Ti, our BiANet takes about 2 hours for training, and runs at 3480fps (with different numbers of MBAMs) for the inputs with resolution.
Quantitative comparisons of our BiANet with nine deep-learning-based methods and five traditional methods on six popular datasets in term of S-measure (), maximum F-measure (max ), mean F-measure (mean ), adaptive F-measure (adp ), maximum E-measure (max ), mean E-measure (mean ), adaptive E-measure (adp ), and mean absolute error (MAE, ). and represent max and max by default. means that the larger the numerical value, the better the model, while means the opposite. For traditional methods, the statistics are based on overall datasets rather on the test set.
We conduct experiments on six widely used RGB-D based SOD datasets. NJU2K  and NLPR  are two popular large-scale RGB-D SOD datasets containing 1985 and 1000 images, respectively. DES  contains 135 indoor images with fine structures collected with Microsoft Kinect . STERE  contains 1000 internet images, and the corresponding depth maps are generated by stereo images using a sift flow algorithm . SSD  is a small-scale but high-resolution dataset with 400 images in resolution. SIP  is a high-quality RGB-D SOD dataset with 929 person images.
We employ 9 metrics to comprehensively evaluate these methods. Precision-Recall (PR) curve  shows the precision and recall performances of the predicted saliency map at different binary thresholds. F-measure 
is computed by the weighted harmonic mean of the thresholded precision and recall. We employ maximum F-measure (max), mean F-measure (mean ), and adaptive F-measure (adp ). Mean Absolute Error (MAE, ) 
directly estimates the average pixel-wise absolute difference between the prediction and the binary ground-truth map.S-measure ()  is an advanced metric, which takes the region-aware and object-aware structural similarity into consideration. E-measure  is the recent proposed Enhanced alignment measure in the binary map evaluation field, which combines local pixel values with the image level mean value in one term, jointly capturing image-level statistics and local pixel matching information. Similar to , we employ the maximum E-measure (max ), mean E-measure (mean ), and adaptive E-measure (adp ).
|#||Candidates||NJU2K ||STERE |
|D3Net ||DMRA |
We compared with 14 state-of-the-art RGB-D SOD methods, including 5 traditional methods: ACSD , LBE , DCMC , MDSF , and SE , and 9 DNN-based methods: DF , AFNet , CTMF , MMCI , PCF , TANet , CPFP , DMRA , and D3Net . The codes and saliency maps of these methods are provided by the authors.
The complete quantitative evaluation results are listed in Table I. The comparison methods are presented from right to left according to the comprehensive performance of these metrics, where the lower the value of MAE (), the better the effect of the model. The other metrics are the opposite. We also plot the PR curves of these methods in Figure 5. One can see that our BiANet achieves remarkable advantages over the comparison methods. DMRA  and D3Net  are well-matched in these datasets. On the large-scaled NJU2K  and NLPR  datasets, our BiANet outperforms the second best with 3% improvement on max . On the DES  dataset, Compared to methods which are heavily dependent on depth information, our proposed BiANet also has a 3.8% improvement on max . This indicates that our BiANet can make more efficient use of depth information. Although the SSD  dataset is high-resolution, the quality of the depth map is poor. Our BiANet still exceeds D3Net , which is specifically designed for robustness to low-quality depth maps. Our BiANet also performs the best on the SIP , which is a challenging dataset with complex scenes and multiple objects.
To further demonstrate the effectiveness of our BiANet, we visualized the saliency maps of our BiANet and other top 5 methods in Figure 6. One can see that the target object in the 1st column is tiny, and its white shoes and hat are hard to distinguish from the background. Our BiANet effectively utilizes the depth information, while the others are disturbed by RGB background clutter. The inputs in the 2nd column are challenging because the depth map is mislabeled, and the RGB image was taken in a dark environment with low contrast. Our BiANet successfully detects the target sculpture and eliminates the interference of flowers and the base of the sculpture, while D3Net mistakenly detects a closer rosette, and DMRA loses the part of the object that is similar to the background. The 3rd column shows the ability of our BiANet to detect complex structures of salient objects. Among these methods, only our BiANet completely discover the chairs, including the fine legs. The 4th column is a multi-object scene. Because there are no depth differences between the three salient windows below and the wall, they are not reflected on the depth map, but the three windows above are clearly observed on the depth map. In this case, the depth map will mislead subsequent segmentation. Our BiANet detects multiple objects from RGB images with less noise. The 5th column is also a multi-object scene. The bottom half of depth map is confused with the interference from the ground. Thus, detecting the legs of these persons in the image is very difficult. However, our BiANet successfully detected all the legs. The last row is a large-scale object whose color and depth map are not distinguished. Large scale, low color contrast and lack of discriminative depth information make the scene very challenging. Fortunately, our BiANet is robust on this scene.
In this section, we mainly investigate: 1) the benefits of bilateral attention mechanism to our BiANet; 2) the effectiveness of BAM in different levels to our BiANet for RGB-D SOD; 3) the further improvements of MBAM in different levels to our BiANet; 4) the benefits of combining BAM and MBAM for RGB-D SOD; and 5) the impact of different backbones to our BiANet for RGB-D SOD.
We conduct ablation studies on the large-scaled NJU2K and STERE datasets to investigate the contributions of different mechanisms in the proposed method. The baseline model used here contains a VGG-16 backbones and a residual refine structure. It takes RGB images as input without depth information. The performance of our basic network without any additional mechanisms is illustrated in Table II No. 1. Based on the network, we gradually add different mechanisms and test various combinations. These candidates are depth information (Dep), foreground-first attention (FF), background-first attention (BF), and multi-scale extension (ME). In Table II No. 3, by applying FF, the performance is improved to some extent, It benefits from the foreground cues being learned effectively by shifting the attention to the foreground objects. This is also reflected in Figure 7. The foreground objects are detected more accurately; however, without good understanding on background cues, it tend to mistake some background objects, such as the red house in the third row, or cannot find complete foreground objects as lack of exploration on background regions. We get a similar accuracy when using the BF only, as shown in No. 4. It excels at distinguishing between salient areas and non-salient areas in the background, and can help to find more complete regions of the salient object in the uncertain background; however, too much attention focusing on the background and without a good understanding of the foreground cues, it leads that sometimes background noise is introduced. When we combine FF together with BF to form our BAM and apply it in all side outputs, the performance boosts. We can see that BAM increases S-measure by 0.9% and max F-measure by 1.2% compared with No. 2. When we replace the top three levels BAMs with MBAMs, the performance further improved. In Figure 7, compared to the performance of No. 2 without BAM, the detected salient objects of No. 6 possess higher confidence, sharper edges, and less background noise.
In order to verify that our BAM module is effective at each feature level, we apply BAM to each side output of the No. 2 model’s feature extractor, respectively. That is, in each experiment, BAM is applied to one side output, while the others undergo general convolutions. From Table III, we can see that the BAMs in every layer facilitate a universal improvement on detection performance. In addition, we find that BAM applied in the lower levels contributes more to the results.
In Table II, compared with No. 5, No. 6 carry out multi-scaled extension on its higher three levels . This extension effectively improves the performance of the model. In order to better show the gain of MBAM in each level features, similar to Table III, we apply MBAM to each side output of the No. 2 model, respectively. The experimental results are recorded in Table IV, where different levels of MBAM bring different degrees of improvement to the results. Comparing Table III and Table IV, we can see a more interesting phenomenon that the BAM applied in the lower level brings more improvement while the MBAM applied in the higher level is more effective.
The observation above guides us that when using BAM and MBAM in cooperation, we should give priority to multi-scale expansion of higher-level BAM. Therefore, we expand BAM from top to bottom until all BAMs are converted into MBAMs. We record the final detection performance and calculation cost during the gradual expansion in Table V. We start from the highest level, and gradually increase the number of MBAMs to three. We can see that the effect on the model is a steady improvement, but the computing cost is also increased. At the lower levels, adding MBAM has no obvious effect. This phenomenon is in line with our expectation. Besides, due to the high resolution, the extension of lower-level BAM will increase the calculation cost and reduce the robustness. The selection of the number of MBAM needs to balance the accuracy and speed requirements of the application scenario. In scenarios with higher speed requirements, we recommend not to use MBAM. Our most lightweight model can achieve 80fps while ensuring significant performance advantages. The parameter size and FLOPs are superior to the SOTA methods D3Net  and DMRA . In scenarios where high accuracy is required, we suggest applying less than three MBAMs on higher-level features.
We implement the BiANet based on some other widely-used backbones to demonstrate the effectiveness of the proposed bilateral attention mechanism on different feature extractors. Specifically, in addition to VGG-16 , we provide the results of BiANet on VGG-11 , ResNet-50 , and Res2Net-50 . Compared with VGG-16, VGG-11 is a lighter backbone. As shown in Table VI, although the accuracy is slightly lower than VGG-16, it still reaches SOTA with a faster speed. BiANet with stronger backbones will bring more remarkable improvements. For example, when we employ ResNet-50 like D3Net  as backbone, our BiANet brings 1.5% improvement on NJU2K  in terms of the MAE compared with the D3Net . When armed with Res2Net-50 , BiANet achieves 3.8% improvement on NJU2K  in terms of the max F-measure compared with the SOTA methods.
In Figure 8, we illustrate some failure cases when our BiANet works in some extreme environments. BiANet explores the saliency cues bilaterally in the foreground and background regions with the relationship provided by depth information. However, when the foreground regions indicated by depth information do not belong to the salient object, it is likely to mislead the prediction. The first two columns in Figure 8 are typical examples, where our BiANet mistakenly takes the object close to the observer as the target, and gives the wrong prediction. The other situation that may cause failure is when BiANet encounters coarse depth maps in complex scenarios ( see the last two columns). In the third column, the depth map provides inaccurate spatial information, which affects the detection of details. In the last column, the inaccurate depth map and the confusing RGB information make BiNet fail to locate the target object.
In this paper, we propose a fast yet effective bilateral attention network (BiANet) for RGB-D saliency object detection (SOD) task. To better utilize the foreground and background information, we propose a bilateral attention module (BAM) to comprise the dual complementary of foreground-first attention and background-first attention mechanisms. To fully exploit the multi-scale techniques, we extend our BAM module to its multi-scale version (MBAM), capturing better global information. Extensive experiments on six benchmark datasets demonstrated that our BiANet, benefited by our BAM and MBAM modules, outperforms previous state-of-the-art methods on RGB-D SOD, in terms of quantitative and qualitative performance. The proposed BiANet runs at real-time speed on a single GPU, making it a potential solution for various real-world applications.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1597–1604. Cited by: §IV-A2.
International Joint Conference on Artificial Intelligence, pp. 698–704. Cited by: §IV-A2.
Journal of Machine Learning Technologies2 (1), pp. 37–63. Cited by: §IV-A2.