Semantic segmentation, which aims at assigning each pixel with different semantic labels, is a long-standing task. Besides exploiting various contextual information from the visual cues [fcn, danet, he2019dynamic, fu2019adaptive, spgnet, zhang2019acfnet], depth data have recently been utilized as supplementary information to RGB data to achieve improved segmentation accuracy [park2017rdfnet, depthaware, pap, 3dneighbourconv, he2017std2p, lstmcf, cheng2017locality, hung2019incorporating]. Depth data naturally complements RGB signals by providing the 3D geometry to 2D visual information, which is robust to illumination changes and helps better distinguishing various objects.
Although significant advances have been achieved in RGB semantic segmentation, directly feeding the complementary depth data into existing RGB semantic segmentation frameworks [fcn] or simply ensemble results of two modalities [cheng2017locality] might lead to inferior performance. The key challenges lie in two aspects. (1) The substantial variations between RGB and Depth modalities. RGB and depth data show different characteristics. How to effectively identify their differences and unify the two types of information into an efficient representation for semantic segmentation is still an open problem. (2) The uncertainty of depth measurements. Depth data provided with existing benchmarks are mainly captured by Time-of-Flight or structured light cameras, such as Kinect, AsusXtion and RealSense etc. The depth measurements are generally noisy due to different object materials and limited distance measurement range. The noise is more apparent for out-door scenes and results in undesirable segmentation, as shown in Fig 1.
Most existing RGB-D based methods mainly focus on tackling the first challenge. Standard practice is to use the depth data 222Raw depth map or its encoded representation–HHA map, which includes horizontal disparity, height above ground and norm angle. For more detail about HHA, please refer to [hha]. as another input and adopt Fully Convolutional Network (FCN)-like architectures with feature fusion schemas, e.g., convolution and modality-based affinity etc., to fuse the features of two modalities [park2017rdfnet, cheng2017locality, hu2019acnet, xing2019coupling]. The fused feature is then used to recalibrate the subsequent RGB feature responses or predicted results. Although these methods provide plausible solutions to unify the two types of information, the assumption of the input depth data being accurate and well-aligned with RGB signals might not be true, making these methods sensitive to in-the-wild samples. Moreover, how to ensure that the network fully utilizes information from both modalities remains an open problem. Recently, some works [pap, padnet] attempt to tackle the second challenge by diminishing the network’s sensitivity to the quality of depth measurements. Instead of utilizing depth data as an extra input, they propose to distill the depth features via multi-task learning and regard depth data as extra supervision for training. Specifically, [padnet]
introduces a two-stage framework, which first predicts several intermediate tasks including depth estimation and then uses the outputs of these intermediate tasks as the multi-modal input to final tasks.[pap] proposes a pattern-affinitive propagation with jointly predicting depth, surface normal and semantic segmentation to capture correlative information between modalities. We argue that there exists an inherent inefficacy in such design, i.e. the interaction and correlation of RGB and depth information are only implicitly modeled. The complementarity of the two types of data for semantic segmentation was not well studied in this way.
Motivated by the above observations, we propose to tackle both two challenges in a simple yet effective framework by introducing a novel cross-modality guided encoder to FCN-like RGB-D semantic segmentation backbones. The key idea of the proposed framework is to leverage both channel-wise and spatial-wise correlation of the two modalities to firstly squeeze the exceptional feature responses of depth, which effectively suppresses feature responses from the low-quality depth measurements, and then use the suppressed depth representations to refine RGB features. In practice, we devise the steps bi-directionally due to the in-door RGB sources also contain noisy features. In contrast to depth data, the RGB noisy features are usually caused by similar appearance of different neighboring objects. We denote the above process as depth-feature recalibration and RGB-feature recalibration, respectively. We therefore introduce a new gate unit, namely the Separation-and-Aggregation Gate (SA-Gate), to improve the quality of the multi-modality representation by encouraging the network to recalibrate and spotlight the modality-specific feature of each modality first, and then selectively aggregate the informative features from both modalities for the final segmentation. To effectively take advantage of the differences of features between the two modalities, we further introduce the Bi-direction Multi-step Propagation (BMP) that encourages the two streams to better preserve their specificity during the information interaction process in the encoder stage.
Our contributions can be summarized into three-fold:
We propose a novel bi-directional cross-modality guided encoder for RGB-D semantic segmentation. With the proposed SA-Gate and BMP modules, we could effectively diminish the influence of noisy depth measurements, and also allow incorporating sufficiently complementary information to form discriminative representations for segmentation.
Comprehensive evaluation on the NYUD V2 dataset shows significant improvements by our approach when integrated into state-of-the-art RGB semantic segmentation networks, which demonstrate the generalization of our encoder as a plug-and-play module.
The proposed method achieves state-of-the-art performances on both in-door and challenging out-door semantic segmentation datasets.
2 Related Work
2.1 RGB-D Semantic Segmentation
With the development of depth sensors, recently there is a surge of interest in leveraging depth data as a geometry augmentation for RGB semantic segmentation task, dubbed as RGB-D semantic segmentation [park2017rdfnet, depthaware, kong2018recurrent, cfn, pap, chen20203d]. According to specific functionality of depth information suited in different architectures, current RGB-D based methods could be roughly divided into two categories.
Most of the works treat depth data as an additional input source to recalibrate the RGB feature responses either implicitly or explicitly. Long et al. [fcn] shows simply averaging final score maps of RGB and D modalities helps enforce the inter-object discrimination in the in-door setting. Li et al. [lstmcf] utilize the LSTM layers to selectively fuse the feature from the two modalities input. With a similar target, [cheng2017locality] proposes locality-sensitive deconvolution networks along with a gated fusion module. Several recent works [DBLP:conf/eccv/WangWTSW16, deng2019rfbnet, hu2019acnet] extend the RGB feature recalibration process from the final outputs of a dual-path network to different stages of the backbone, encouraging better recalibration with multi-level cross-modality feature fusion. To guide the recalibration with explicit cross-modality interaction modeling, some works [kong2018recurrent, depthaware, 3dgnn, xing20192] tailor general 2D operations to 2.5D behaviors with depth guidance. For example, [depthaware] proposes depth-aware convolution and pooling operations to help recalibrating RGB feature responses in depth-consistent regions. [kong2018recurrent] proposes a depth-aware gate module that adaptively selects the pooling field size in a CNN according to object scale. 3DGNN [3dgnn]
introduces a 3D graph neural network to model accurate context with geometry cues provided by depth. Alternatively, some approaches regard the depth data as an extra supervised signal to recalibrate the RGB counterpart in a multi-task learning manner. For example,[pap] proposes a pattern affinity propagation network to regularize and boost complementary tasks. [padnet] introduces a multi-modal distillation model to pass the valid messages from depth to RGB features.
Different from previous works that hold the ideal assumption of depth source’s quality and mainly focus on in-door setting, we try to extend the task to the in-the-wild environment, e.g., CityScapes dataset. The out-door setting is more challenging due to the inevitable noisy signals contained in the depth data. In this work, we try to recalibrate RGB feature responses from a filtered depth representation and vice versa, which effectively enhance the strength of representations for both modalities.
2.2 Attention Mechanism
Attention mechanisms have been widely utilized in kinds of computer vision tasks, serving as the tools to spotlight the most representative and informative regions of input signals[danet, woo2018cbam, wang2017residual, hu2018squeeze, sknet, non-local]. For example, to improve the performance of the image/video classification task, SENet [hu2018squeeze] introduces a self－recalibrate gating mechanism by model importance among different channels of feature maps. Based on similar spirits, SKNet [sknet] designs a channel-wise attention module to select kernel sizes to adaptively adjust its receptive field size based on multiple scales of input information. [non-local] introduces a non-local operation which explores the similarity of each pair of points in space. For the segmentation task, a well-designed attention module could encourage the network to learn helpful context information effectively. For instance, DFN [dfn] introduces a channel attention block to select the more discriminative features from multi-level feature maps to get more accurate semantic information. DANet [danet] proposes two types of attention modules to model the semantic inter-dependencies in spatial and channel dimensions respectively.
However, the main challenge of RGB-D semantic segmentation task is how to make full use of cross-modality data under the substantial variations and noisy signals between modalities. The proposed SA-Gate is the first to focus on the noisy features of cross-modalities by tailoring the attention mechanisms. The SA-Gate module is specialized for suppressing the exceptional noisy feature of depth data and recalibrate its counterpart RGB feature responses in a unified manner at first, and then fuses the cross-modality information with a softmax gating that is guided by the recalibrated features, achieving effective and efficient cross-modality feature aggregation.
RGB-D semantic segmentation needs to aggregate features from both RGB and depth modalities. However, both modalities have inevitably noisy information. Specifically, depth measurements are inaccurate due to the characteristics of depth sensors and RGB features might generate confusing results due to the high appearance similarity between the objects. An effective cross-modality aggregation scheme should be able to identify their strengths from each feature as well as unify the most informative cross-modality features into an efficient representation. To this end, we put forward a novel cross-modality guided encoder. The overall framework of the proposed approach is depicted in Fig. 2 (a), which consists of a cross-modality guided encoder and a segmentation decoder. Given RGB-D data as inputs 333Note that we use HHA map to encode the depth measurements., our encoder recalibrates and fuses the complementary information from the two modalities via the SA-Gate unit, and then propagates the fused multi-modal features along with modality-specific features via the Bi-direction Multi-step Propagation (BMP) module. The information is then decoded by a segmentation decoder network to generate the segmentation map. We will detail each component in the remaining parts of this section.
3.1 Bi-direction Guided Encoder
Separation-and-Aggregation (SA) Gate. To ensure informative feature propagation between modalities, the SA-Gate is designed with two operations. One is feature recalibration on each single modality, and the other is cross-modality feature aggregation. The operations are in terms of Feature Separation (FS) and Feature Aggregation (FA) parts, as illustrated in Fig 2 (b).
Feature Separation (FS). We take depth stream for example. Due to physical characteristics of depth sensors, noisy signals in depth modality frequently show up in regions close to object’s boundaries or partial surfaces outside the scope of depth sensors, as shown in the second column of Fig. 3
. Hence, the network is expected to first filter noisy signals surrounding these local regions to avoid misleading information propagation on the process of recalibrating complementary RGB modality and aggregating cross-modality features. In practice, we exploit high confident activations in RGB stream to filter out exceptional depth activations at the same level. To do so, global spatial information of both modalities should be embedded and squeezed to obtain a cross-modality attention vector first. We achieve this by a global average pooling along the channel-wise dimensions of two modalities, which is followed by concatenation and a MLP operation to obtain attention vector. Suppose we have two input feature maps denoted asand , above operations could be formulated as
where denotes the concatenation of feature maps from two modalities, refers to global average pooling, is the cross-modality global descriptor for collecting expressive statistics for the whole inputs. Then, the cross-modality attention vector for the depth input is learned by
where denotes MLP network,
denotes sigmoid function scaling the weight value into. By doing so, the network can take advantage of the most informative visual appearance and geometry features, and thus tends to effectively suppress the importance of noisy features in depth stream. Then, we could obtain a less noisy depth representation, namely Filtered HHA, through a channel-wise multiplication between input depth feature maps and the cross-modality gate:
With a filtered depth representation counterpart, the RGB feature responses could be recalibrated with more accurate depth information. We devise the recalibration operation as the summation of the two modalities:
where denotes recalibrated RGB feature maps. The general idea behind the formula is that, instead of directly using element-wise product to reweight RGB feature with regarding depth features as recalibrate coefficients, the proposed operation using summation could be viewed as some kind of offset to refine RGB feature responses at corresponding positions, as demonstrated in Table 2.
In practice, we implement recalibration step in a symmetric and bi-directional manner, such that low confident activations in RGB stream could also be suppressed in the same manner and filtered RGB information could inversely recalibrate the depth feature responses to form a more robust depth representation . We visualize feature maps of HHA before and after Feature Separation Part in Fig. 3. The RGB counterpart is shown in the supplementation.
Feature Aggregation (FA). RGB and D features are strongly complementary to each other. To make full use of their complementarity, we need to complementarily aggregate the cross-modality features at a certain position in space according to their characterization capabilities. To achieve this, we consider both characteristics of these two modalities and generate spatial-wise gates for both and to control information flow of each modality feature map with soft attention mechanism, which is visualized in Figure 2 (b) and marked by the second red frame. To make the gate more precise, we use recalibrated RGB and HHA feature maps from FS part, i.e., and , to generate the gate. We first concatenate these two feature maps to combine their features at a certain position in space. Then we define two mapping functions to map high-dimensional feature to two different spatial-wise gates:
where is the concatenated feature, is the spatial-wise gate for RGB feature map, and is the spatial-wise gate for HHA feature map. In practice, we use a convolution to implement this mapping function. A softmax function is applied on these two gates:
where and . is the weight assigned to each position in the RGB feature map and is the weight assigned to each position in the HHA feature map. The final merged feature can be obtained by weighting the RGB and HHA maps:
So far, we have added gated RGB and HHA feature maps to obtain the fused feature maps . Since SA-Gate is injected into the encoder stage, we then average the fused features and the original input to obtain and respectively, which share similar spirits with residual learning.
Bi-directional Multi-step Propagation (BMP). By normalizing the sum of two weights at each position to , the numerical scale of the weighted feature will not significantly differ from the input RGB or HHA. Therefore, it has no negative influence on the learning of the encoder or the loading of the pre-trained parameters. For each layer , we use the output generated by the -th SA-Gate to refine the raw output of the -th layer in the encoder: , . This is a bi-directional propagation process and the refined results will be propagated to the next layer in the encoder for more accurate and efficient encoding of the two modalities.
3.2 Segmentation Decoder
The decoder can adopt almost any design of decoder from SOTA RGB-based segmentation networks, since SA-Gate is a plug-and-play module and can make good use of complementary information of cross-modality on encoder stage. We show results of combining our encoder with different decoders in Table 6. We choose DeepLabV3+ [v3+] as our decoder for it achieves the best performance.
We conduct comprehensive experiments on in-door NYU Depth V2 and out-door CityScapes datasets in terms of two metrics: mean Intersection-over-Union () and pixel accuracy (pixel acc.). We also evaluate our model on SUN-RGBD dataset (Please refer to the supplemental material for more details).
NYU Depth V2 [nyudv2] contains RGB-D images with 40-class labels, in which images are used for training and the rest 654 images are for testing.
CityScapes [cordts2016cityscapes] contains images from cities. There are images for training, for validation and for testing. Each image has a resolution of and is fine-annotated with pixel-level labels of semantic classes. We do not use additional coarse annotations in our experiments.
4.2 Implementation Details
We use PyTorch framework. For data augmentation, we use random horizontal flipping and scaling with scales,. When comparing with SOTA methods, we adopt flipping and multi-scale inference strategies as a test-time augmentation to boost the performance. More details are shown in the supplemental material.
4.3 Efficiency Analysis
To verify whether the proposed cross-modality feature propagation helps and is efficient, we compare the final model with the RGB-D baseline. We average predictions of two parallel DeepLab V3+ as RGB-D baseline. As shown in Table 1, the proposed method achieves better performance with significantly less memory requirement and computational cost when compared with baseline. The results indicate that aimlessly adding parameters to a multi-modality network will not bring extra representational power to better recognize objects. In contrast, a well-design cross-modality mechanism, like proposed cross-modality feature propagation, helps to learn more powerful representations to improve performance more efficiently.
4.4 Ablation Study
We perform ablation studies on our design choices under same hyperparameters.
Feature Separation. We employ the FS operation before the feature aggregation in SA-Gate, to filter out noisy features for bi-directional recalibration step. To verify effectiveness of this operation, we ablate each design of FS in Table 2. Note that we ablate four different architectures and replace all FS parts in the network for comparison. ‘Concat’ represents we concatenate and feature maps and directly pass them to feature aggregation part. ‘Self-global’ represents we filter single modality features with its own global information. ‘Cross-global’ represents the filtered RGB is added to input RGB and vice versa. The filtering guidance comes from cross-modality global information. ‘Product’ means we multiply by and vice versa. We see that from column to , not using cross-modality information to filter noisy feature or refine features without explicit cross-modality recalibration lead to about drop. On the other hand, the last two columns indicate the cross-modality guidance (E.q 4) is more appropriate and effective than cross-modality re-weighting when doing cross-modality recalibration. Overall, these results show that proposed FS operator effectively filters incorrect messages and recalibrates feature responses, achieving the best performance among all compared designs.
Feature Aggregation. We employ the SA-Gating mechanism to adaptively select the feature from the cross-modal data, according to their different characteristics at each spatial location. This gate can effectively control information flow of multimodal data. To evaluate the validity of the design, we perform ablation study on feature aggregation, as shown in Table 3. The experiment setting is kept the same as above. ‘Addition’ represents directly adding the recalibrated RGB and HHA feature maps. ‘Conv’ represents conducting convolution on the concatenated feature map. ‘Proposed’ represents the FA operator. We see that FA operator leads to the best result, since it considers the spatial-wise relationship between two modalities and can better explore the complementary information.
|Res50 (Average of Dual Path)||45.9|
|Res50 + SA-Gate||47.4|
|Res50 + BMP||47.8|
|Res50 + BMP + SA-Gate||48.6|
|Method||RGB(%mIoU)||RGB-D(%mIoU)||RGB-D w SA-Gate(%mIoU)|
|DeepLab V3 [v3]||44.7||46.5||49.1|
|DeepLab V3+ [v3+]||44.3||46.7||50.4|
: the simple method which only average final score maps of RGB path and HHA path. Note that we reproduce these methods using official open-source code and all experiments use the same setting as our method
|Shu Kong [kong2018recurrent]||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||-||78.2|
RGB baseline (Deeplab V3+ [v3+])
Design of Encoder. We verify and analyze the effectiveness of proposed BMP to our encoder, and how it functions with the SA-Gate. Toward this end, we conduct two ablation studies as shown in Table 4 5. We use ResNet-50 as our backbone here and directly upsampling the final score map by a factor of 16, without using a segmentation decoder. The first row in Table 4 5 is the baseline that averages score maps generated by two ResNet-50 (RGB & D).
For the first ablation, we gradually embed SA-Gate unit behind different layers of ResNet50. Note that we generate score maps for both two sides and average them as final segmentation result. This setting is different from those above, because last block of ResNet may not be equipped with a SA-Gate in this part, i.e., no fused feature is generated from last block. From Table 4, we observe that if SA-Gate is embedded into a higher stage, it will lead to relatively worse performance. Besides, when stacking SA-Gate stage by stage, the additional gain continuously reduces. These two phenomena show that features of different modalities are more different in lower stage and an early fusing will achieve better performance. Table 5 shows results of second experiment. We observe that both SA-Gate and BMP can boost performance. Meanwhile, they complement each other and performs better in the presence of the other component. Moreover, when associating Table 5 2, we see that SA-Gate helps BMP better propagate valid information than other gate mechanisms. It demonstrates effectiveness and importance of a more accurate representation to the feature propagation.
The Plug-and-Play Property of Proposed Encoder. We conduct ablation study to validate the flexibility and effectiveness of our method for different types of decoders. Following recent RGB-based semantic segmentation algorithms, we splice their decoders with our model to form modified RGB-D versions (i.e., RGB-D w SA-Gate), as shown in Table 6. We see that in the column and , our method consistently helps achieving significant improvements against original RGB versions. Besides, comparing with naive RGB-D modifications, our method also boosts the performance at least 1.5% mIoU. Especially, with the decoders in Deeplab V3+ [v3+], our method achieves 3.7% mIoU improvements. The results verify both the flexibility and effectiveness of our method for various decoders.
4.5 Visualization of SA-Gate
We visualize first SA-Gate in our model to see what it has learned, as shown in Fig 4. Note that the black region in GT represents ignored pixels when calculating IoU. We reproduce RDFNet-101 [park2017rdfnet] in PyTorch with 48.7% mIoU on NYU Depth V2, which is close to the result in the original paper (49.1%). Red represents a higher weight assigned to RGB and blue represents a higher weight assigned to HHA. From column 4, we can see that RGB has a stronger response at boundary and HHA responds well in glare and dark areas. The phenomenon is reasonable since RGB feature has more details in high contrast areas and HHA feature is not affected by lighting conditions. From row 1, details inside yellow boxes are lost in HHA while obvious in RGB. Our method successfully identifies chair legs and distinguishes table that looks similar to chair. In row 2, glare blurs the border of the photo frame. Since our model focuses more on HHA in this area, it predicts the photo frame more completely than RDFNet. Besides, our model captures more details than RDFNet on clothes stand. In row 3, cabinet in dark red is hard to recognize in RGB but with identifiable features in HHA. Improper fusion of RGB and HHA leads to erroneous semantics for this area (column ). While our model pays more attention to HHA in this area to achieve more precise results.
4.6 Comparing with State-of-the-arts
NYU Depth V2. Results are shown in Table 7. Our model achieves leading performance. On the consideration of a fair comparison to [pap, hu2019acnet, padnet] that utilize ResNet-50 as backbone, we also use same backbone and achieve mIoU, which is still better than these methods. Specifically, [park2017rdfnet, hu2019acnet] try to use channel-wise attention or vanilla convolution to extract complementary feature, which are more implicit than our model in selecting valid feature from complementary information. Besides, we can see that utilizing depth data as extra supervision (such as [pap, padnet]) could make network more robust than general RGB-D methods that take both RGB and depth as input sources [park2017rdfnet, cheng2017locality, 3dgnn]. However, our results demonstrate that once the input RGB-D information could be effectively recalibrated and aggregated, higher performance could be obtained.
CityScapes. We achieve % mIoU on validation set and % mIoU on test set, which are both leading performances. Table 8 shows results on test set. We observe that due to serious noise of depth measurements in this dataset, most of previous RGB-D based methods even worse than RGB-based methods. However, our method effectively distills depth feature and extracts valid information in it and boosts the performance. Note that [choi2020cars] is a contemporary work and we outperform them by . We exclude the results of GSCNN [gatedscnn] for fair comparison, since it uses a stronger backbone WideResNet instead of ResNet-101. However, we still outperform GSCNN by mIoU on the validation set and achieve the same performance as it on test set.
In this work, we propose a cross-modality guided encoder along with SA-Gate and BMP modules to address two key challenges in RGB-D semantic segmentation, i.e., the effective unified representation for different modalities and the robustness to low-quality depth source. Meanwhile, our proposed encoder can act as a plug-and-play module, which can be easily injected to current state-of-the-art RGB semantic segmentation frameworks to boost their performances.
Acknowledgments: This work is supported by the National Key Research and Development Program of China (2017YFB1002601, 2016QY02D0304), National Natural Science Foundation of China (61375022, 61403005, 61632003), Beijing Advanced Innovation Center for Intelligent Robots and Systems (2018IRS11), and PEK-SenseTime Joint Laboratory of Machine Vision.
This supplementary material presents: more implementation details based on the main paper; additional experimental analysis and qualitative results of our approach on NYU Depth V2, CityScapes val set and SUN-RGBD dataset.
2 Implementation Details
We use PyTorch framework to implement our experiments. We set batch size to 16 for all experiments. We adopt mini-batch SGD with momentum to train our model. The momentum is fixed as and the weight decay is set to . We employ a poly learning rate policy where the initial learning rate is multiplied by .
For NYU Depth V2, we randomly crop the image to and train epochs with base learning rate set to . We employ cross-entropy loss on both the final output and the intermediate feature map output from ResNet-101 block4, where the weight over the final loss is and the auxiliary loss is .
For SUN-RGBD, we randomly crop the image to and train epochs with base learning rate set to . Cross-entropy loss is used for the final output.
For CityScapes, we randomly crop the image to and train epochs with base learning rate set to . We use OHEM loss for better learning. For data augmentation, we use random horizontal flipping and random scaling with scale . When comparing with the state-of-the-art methods, we adopt flipping and multi-scale inference strategies as a test-time augmentation to boost the performance.
3 Experimental Results
Besides the results analyzed in the main paper, we also conduct experiments on CityScapes val set and SUN-RGBD dataset to further verify the effectiveness and generalization ablity of our approach. Meanwhile, we conduct more ablation studies on NYU Depth V2 to verify the robustness of the proposed method.