Spatial Information Guided Convolution for Real-Time RGBD Semantic Segmentation

04/09/2020 ∙ by Lin-Zhuo Chen, et al. ∙ Nankai University 0

3D spatial information is known to be beneficial to the semantic segmentation task. Most existing methods take 3D spatial data as an additional input, leading to a two-stream segmentation network that processes RGB and 3D spatial information separately. This solution greatly increases the inference time and severely limits its scope for real-time applications. To solve this problem, we propose Spatial information guided Convolution (S-Conv), which allows efficient RGB feature and 3D spatial information integration. S-Conv is competent to infer the sampling offset of the convolution kernel guided by the 3D spatial information, helping the convolutional layer adjust the receptive field and adapt to geometric transformations. S-Conv also incorporates geometric information into the feature learning process by generating spatially adaptive convolutional weights. The capability of perceiving geometry is largely enhanced without much affecting the amount of parameters and computational cost. We further embed S-Conv into a semantic segmentation network, called Spatial information Guided convolutional Network (SGNet), resulting in real-time inference and state-of-the-art performance on NYUDv2 and SUNRGBD datasets.



There are no comments yet.


page 1

page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the development of 3D sensing technologies, RGBD data with spatial information (depth, 3D coordinates) is easily accessible. As a result, RGBD semantic segmentation for high-level scene understanding becomes extremely important, benefiting a wide range of applications such as automatic driving 

[icnet], SLAM [bescos2018dynaslam]

, and robotics. Due to the effectiveness of Convolutional Neural Network (CNN) and additional spatial information, recent advances demonstrate enhanced performance on indoor scene segmentation tasks 

[fcn, deeplab]. Nevertheless, there remains a significant challenge caused by the complexity of the environment and the extra efforts for considering spatial data, especially for applications that require real-time inference.

[width=1.trim=15 0 15 0,clip]introduction2 (a)(b)

Fig. 1: The network architecture of different approaches. (a) The conventional two-stream structure. (b) The proposed SGNet. It can be seen that the approach in (a) largely increases parameter number and inference time due to processing spatial information, thus less suitable for real-time applications. We replace the convolution with our S-Conv in (b) where the kernel distribution and weights of the convolution are adaptive to the spatial information. S-Conv greatly enhances the spatial awareness of the network with few additional parameters and computations, thus can efficiently utilize spatial information. Best viewed in color.

A common approach treats 3D spatial information as an additional input of the network to extract features, followed by combining the features of RGB images [rdfnet, fcn, eigen2015predicting, ma2017multi, fusenet, wang2016learning], as shown in Fig. 1 (a). This approach achieves promising results at the cost of significantly increasing the parameter number and computational time, thus being unsuitable for real-time tasks. Meanwhile, several works [fcn, fusenet, gupta2014learning, lstmcf, rdfnet] encode raw spatial information into three channels (HHA) composed of horizontal disparity, height above ground, and norm angle. However, the conversion from raw data to HHA is also time-consuming [fusenet].

It is worth noting that indoor scenes have more complex spatial relations than outdoor scenes. This requires a stronger adaptive ability of the network to deal with geometric transformations. However, due to the fixed structure of the convolution kernel, the 2D convolution in the aforementioned methods cannot well adapt to spatial transformation and adjust the receptive field inherently, limiting the accuracy of semantic segmentation. Although alleviation can be made by revised pooling operation and prior data augmentation [deform, deformablev2], a better spatially adaptive sampling mechanism for conducting convolution is still desirable.

Moreover, the color and texture of objects in indoor scenes are not always representative. Instead, the geometry structure often plays a vital role in semantic segmentation. For example, to recognize the fridge and wall, the geometric structure is the primary cue due to the similar texture. However, such spatial information is ignored by 2D convolution on RGB data. The depth-aware convolution [dcnn] is proposed to address this problem. It forces pixels with similar depths as the center of the kernel to have higher weight than others. Nevertheless, this prior is handcrafted and may lead to sub-optimal results.

Fig. 2: The illustration of the Spatial information guided Convolution (S-Conv). Firstly, the input 3D spatial information is projected by the spatial projector to match the input feature map. Secondly, the adaptive convolution kernel distribution is generated by the offset generator. Finally, the projected spatial information is sampled according to the kernel distribution and fed into the weight generator to generate adaptive convolution weights.

It can be seen that there is a contradiction between the fixed structure of 2D convolution and the varying spatial transformation, along with the efficiency bottleneck of separately processing RGB and spatial data. To overcome the limitations mentioned above, we propose a novel operation, called Spatial information guided Convolution(S-Conv), which adaptively changes according to the spatial information (see Fig. 1 (b)).

Specifically, this operation can generate convolution kernels with different sampling distributions adapting to spatial information, boosting the spatial adaptability and the receptive field regulation of the network. Furthermore, S-Conv establishes a link between the convolution weights and the underlying spatial relationship with their corresponding pixel, incorporating the geometric information into the convolution weights to better capture the spatial structure of the scene.

The proposed S-Conv is light yet flexible and achieves significant performance improvements with only few additional parameters and computation costs, making it suitable for real-time applications. We conduct extensive experiments to demonstrate the effectiveness and efficiency of S-Conv. We first design the ablation study and compare S-Conv with deformable convolution [deform, deformablev2] and depth-aware convolution [dcnn], exhibiting the advantages of S-Conv. We also verify the applicability of S-Conv to spatial transformations by testing its influence on different types of spatial data with depth, HHA and 3D coordinates. We demonstrate that spatial information is more suitable to generate offset than RGB feature which is used by deformable convolution [deform, deformablev2]. Finally, benefiting from the adaptability to spatial transformation and the effectiveness of perceiving spatial structure, our network equipped with S-Conv, named Spatial information Guided convolutional Network (SGNet), achieves high-quality results with real-time inference on NYUDv2 [nyud] and SUNRGBD [sunrgbd, sunrgbd2] datasets.

We highlight our contributions as follows:

  • We propose a novel S-Conv operator that can adaptively adjust receptive field while effectively adapting to spatial transformation, and can perceive intricate geometric patterns with low cost.

  • Based on S-Conv, we propose a new SGNet that achieves competitive RGBD segmentation performance in real-time on NYUDv2 [nyud] and SUNRGBD [sunrgbd, sunrgbd2] datasets.

Ii Related Work

Ii-a Semantic Segmentation

The recent advances of semantic segmentation benefit a lot from the development of convolutional neural network (CNN)  [imagenet, deep]. FCN [fcn]

is the pioneer of leveraging CNN for semantic segmentation. It leads to convincing results and serves as the basic framework for many tasks. With the research efforts in the field, the recent methods can be classified into two categories according to the network architecture, including atrous convolution-based methods 

[deeplab, multi], and encoder-decoder based methods [refinenet, deeplabv3plus, segnet, deconvnet].

Atrous convolution:

The standard approach relies on stride convolutions or poolings to reduce the output stride of the CNN backbone and enables a large receptive field. However, the resolution of the resulting feature map is reduced 

[deeplabv3], and many details are lost. One approach exploits atrous convolution to alleviate the conflict by enhancing the receptive field while keeping the resolution of the feature map [deeplabv3, deeplab, deeplabv3plus, multi, denseaspp]. We use atrous convolution based backbone in the proposed SGNet.

Encoder-decoder architecture: The other approach utilizes the encoder-decoder structure [deconvnet, segnet, refinenet, deeplabv3plus, psp], which learns a decoder to recover the prediction details gradually. DeconvNet [deconvnet] employs a series of deconvolutional layers to produce a high-resolution prediction. SegNet [segnet] achieves better results by using pooling indices in the encoder to guide the recovery process in the decoder. RefineNet [refinenet] fuses low-level features in the encoder with the decoder to refine the prediction. While this method can achieve more precise results, it requires longer inference time.

Ii-B RGBD Semantic Segmentation

How to effectively use the extra geometry information (depth, 3D coordinates) is the key of RGBD semantic segmentation. A number of works focus on how to extract more information from geometry, which is treated as additional input in [eigen2015predicting, ma2017multi, fusenet, wang2016learning, hu2019acnet]. Two-stream network is used in [ma2017multi, fusenet, wang2016learning, lstmcf, rdfnet]

to process RGB image and geometry information separately, and combines the two results in the last layer. These methods achieve promising results at the expense of doubling the parameters and computational cost. 3D CNNs or 3D KNN graph networks are also used to take geometry information into account  

[song2017semantic, song2016deep, qi20173d]

. Besides, various deep learning methods on 3D point cloud  

[pointnet, pointnet++, chen2019lsanet, spidercnn, spectral_graph_conv, pointcnn] are also explored. However, these methods cost a lot of memory and are computationally expensive. Another stream incorporates geometric information into explicit operations. Cheng et al. [local]

use geometry information to build a feature affinity matrix acting in average pooling and up-pooling. Lin et al. 

[cascaded] splits the image into different branches based on geometry information. Wang et al. [dcnn]

propose Depth-aware CNN, which adds depth prior to the convolutional weights. Although it improves feature extraction by convolution, the prior is handcrafted but not learned from data. Other approaches, such as multi-task learning

[jiao2019geometry, wang2015towards, hoffman2016learning, kokkinos2017ubernet, eigen2015predicting, Zhang_2019_CVPR] or spatial-temporal analysis [he2017std2p], are further used to improve segmentation accuracy. The proposed S-Conv aims to efficiently utilize spatial information to improve the feature extraction ability. It can significantly enhance the performance with high efficiency due to using only a small amount of parameters.

Ii-C Dynamic structure in CNN

Using dynamic structure to deal with varying input of CNN has also been explored. Dilation Convolution is used in [multi, deeplab]

to increase the receptive field size without reducing feature map resolution. Spatial transformer network 

[stn] adapts spatial transformation by warping feature map. Dynamic filter [dynamicfilter] adaptively changes its weights according to the input. Besides, self-attention based methods [selective, nonlocal, senet] generate attention maps from the intermediate feature map to adjust response at each location or capture long-range contextual information adaptively. Some generalizations of convolution from 2D image to 3D point cloud are also presented. PointCNN [pointcnn] is a seminal work that enables CNN on a set of unordered 3D points. There are other improvements [chen2019lsanet, spidercnn, spectral_graph_conv]

on utilizing neural networks to effectively extract deep features from 3D point sets. Deformable convolution 

[deform, deformablev2] can generate different distribution with adaptive weights. Nevertheless, their input is an intermediate feature map rather than spatial information. Our work experimentally verifies that better results can be obtained based on spatial information in Sec. IV.

Iii S-Conv and SGNet

In this section, we first elaborate on the details of Spatial information guided Convolution (S-Conv), which is a generalization of conventional RGB-based convolution by involving spatial information in the RGBD scenario. Then, we discuss the relation between our S-Conv and other approaches. Finally, we describe the network architecture of Spatial information Guided convolutional Network (SGNet), which is equipped with S-Conv for RGBD semantic segmentation.

Iii-a Spatial information guided Convolution

For completeness, we first review the conventional convolution operation. We use

to denote a tensor, where

is the index corresponding to the first dimension, and indicates the two indices for the second and third dimensions. Non-scalar values are highlighted in bold for convenience.

For an input feature map . We describe it in 2D for simplicity, thus . Note that it is straightforward to extend to the 3D case. The conventional convolution applied on to get can be formulated as the following:


where is the convolution kernel with size , is the 2D convolution center, denotes the kernel distribution around . For convolution:


From the above equation, we can see that the convolution kernel is constant over . In other words, and are fixed, meaning the convolution is content-agnostic.

Fig. 3: The illustration of weights in 2D convolution and in S-Conv. The yellow dot indicates the point whose spatial position changes along the arrow. Illustration of 2D convolution is on the top, and S-Conv is on the bottom. The conventional 2D convolution operation orderly places local points in a regular grid with fixed weights, while ignoring the spatial information. We can see that the spatial position variation of the yellow point can not be reflected in the weight. Our S-Conv can be regarded as placing a local patch into a weight space, which is generated by the spatial guidance of that patch. Hence the weight of each point establishes a link with its spatial location, effectively capturing the spatial variation of the local patch. The spatial relationship between the yellow point and other points can be reflected in the adaptive weights.
Fig. 4: The network architecture of SGNet equipped with S-Conv for RGBD semantic segmentation. The SGNet consists of a backbone network and a decoder. The deep supervision is added between layer 3 and layer 4 to improve network optimization.

In the RGBD context, we want to involve 3D spatial information efficiently by using adaptive convolution kernels.

We first generate the offset according to the spatial information, then use the spatial information corresponding to the given offset to generate new spatially adaptive weights. Our S-Conv requires two inputs. One is the feature map which is the same as conventional convolution. The other is the spatial information . In practice, can be HHA (), 3D coordinates (), or depth (). The method of encoding depth into 3D coordinates and HHA is the same as [qi20173d]. Note that the input spatial information is not included in the feature map.

As the first step of S-Conv, we project the input spatial information into a high-dimensional feature space, which can be expressed as:


where is a spatial transformation function, and , which has a higher dimension than .

Then, we take the transformed spatial information into consideration, perceive its geometric structure, and generate the distribution (offset of pixel coordinate in and axis) of convolution kernels at different . This processes can be expressed as:


where , represent the feature map size after convolution, and are the kernel size. is a non-linear function which can be implemented by a series of convolutions.

After generating the distribution of kernel for each possible using , we boost its feature extraction ability by establishing the link between the geometric structure and the convolution weight. More specifically, we sample the geometric information of the pixels corresponding to the convolution kernel after shifting:


where is the spatial distribution of convolution kernels at . is the spatial information corresponding to the feature map of the convolution kernel centered on after transformation.

Finally, we generate convolution weights according to the final spatial information as the following:



is a non-linear function that can be implemented as a series of convolution layers with non-linear activation function,

indicates the convolution weights, which can be updated by the gradient descent algorithm. denotes the spatially adaptive weights for convolution centered at .

Overall, our generalized S-Conv is formulated as:


We can see that establishes the correlation between spatial information and convolution weights. Moreover, convolution kernel distribution is also relevant to the spatial information through . Note that and are not constant, meaning the generalized convolution is adaptive to different . Also, as

is typically fractional, we use bilinear interpolation to compute

as in [deform, stn]. The main formulae discussed above are labeled in Fig. 2.

Iii-B Relation to other approaches

2D convolution is the special case of the proposed S-Conv without geometry information and corresponding offset. Specifically, without geometry information, the center point and its neighboring points have fixed positional relation in image space. And we do not need to capture the varying spatial relation. This also shows that our S-Conv has very good compatibility to handle the 2D case. While for the RGBD case, our S-Conv can extract feature at the point level and is not limited to the discrete grid by introducing spatially adaptive weights as shown in Fig. 3. Deformable convolution [deform, deformablev2] also alleviates this problem by generating different distribution weights. Nevertheless, their distributions are inferred from 2D feature maps instead of 3D spatial information as in our case. We will verify through experiments that our method achieves better results than deformable convolution [deform, deformablev2]. Compared with the 3D KNN graph-based method, our S-Conv selects neighboring pixels adaptively instead of using the KNN graph, which is not flexible and computationally expensive.

Iii-C SGNet architecture

Our semantic segmentation network, called SGNet, is equipped with S-Conv and consists of a backbone and decoder. The structure of SGNet is illustrated in Fig. 4. We use ResNet101 [resnet] as our backbone, and replace the first and the last two conventional convolutions (

filter) of each layer with our S-Conv. We add a series of convolutions to extract the feature further and then use bilinear up-sampling to produce the final segmentation probability map, which corresponds to the decoder part of the SGNet. The

in Equ. (3) is implemented as three convolution layers, i.e. Conv(3, 64) - Conv(64, 64) - Conv(64, 64) with non-linear activation function. The in Equ. (4) and the in Equ. (6) are implemented as single and two convolution layers separately. The S-Conv implementation is modified from deformable convolution [deformablev2, deform] We add deep supervision between layer 3 and layer 4 to improve the network optimization capability, which is the same as PSPNet [psp].

Iv Experiments

In this section, we first validate the performance of S-Conv by analyzing its usage in different layers; conducting ablation study/comparison with its alternatives; and evaluating results of using different input information to generate offset. Then we compare our SGNet equipped with S-Conv with other state-of-the-art semantic segmentation methods on NYUDV2 and SUNRGBD datasets. Finally, we visualize the depth adaptive receptive field in each layer and the segmentation results, demonstrating that the proposed S-Conv can well exploit spatial information.

Fig. 5: Per-category IoU improvement of S-Conv on NYUDv2 dataset.
S-Conv layer3_0 layer3_1 layer3_2 layer3_20 layer3_21 layer3_22 other layers mIoU(%) param(M) FPS
43.0 56.8 37
47.0 56.9 37
Baseline 46.6 57.2 36
(ResNet101) 46.5 57.2 36
47.8 57.2 36
49.0 58.3 28
TABLE I: The results of replacing convolution (of filter) of different layers with S-Conv on NYUDv2 dataset. ”layerx_y” means the convolution of y-th residual block in x-th layer.

Datasets and metrics: We evaluate the performance of S-Conv operator and SGNet segmentation method on two public datasets:

  • NYUDv2 [nyud] : This dataset has 1,449 RGB images with corresponding depth maps and pixel-wise labels. 795 images are used for training, while 654 images are used for testing as in [split]. The 40-class settings are used for experiments.

  • SUN-RGBD [sunrgbd, sunrgbd2]: This dataset contains 10,335 RGBD images with semantic labels organized in 37 categories. 5,285 images are used for training, and 5050 images are used for testing.

We use three common metrics for evaluation, including pixel accuracy (Acc), mean accuracy (mAcc), and mean intersection over union (mIoU). The three metrics are defined as the following:


where is the amount of pixels which are predicted as class with ground truth , is the number of classes, and is the number of pixels whose ground truth class is . The depth map is used as the default format of spatial information unless specified otherwise.

Implementation details: We use dilated ResNet101 [resnet] pretrained on ImageNet [imagenet-c] as the backbone network for feature extraction following [deeplab]

, and the output stride is 16 by default. The whole system is implemented based on PyTorch. The SGD optimizer is adopted for training with the learning rate policy as used in 

[deeplab, deeplabv3plus]:

, where the initial learning rate is 5e-3 for NYUDv2 and 1e-3 for SUNRGBD, and the weight decay is 5e-4. This learning policy updates the learning rate for every 40 epochs for NYUDv2 and 10 epochs for SUNRGBD. We use ReLU activation function, and the batch size is 8. Following 

[rdfnet], we employ general data augmentation strategies, including random scaling, random cropping, and random flipping. The crop size is . During testing, we down-sample the image to the training crop size, and its prediction map is up-sampled to the original size. We use cross-entropy loss in both datasets, and reweight [jiang2018rednet] training loss of each class in SUNRGBD due to its extremely unbalanced label distribution. We train the network by 500 epochs for the NYUDv2 dataset and 200 epochs for the SUNRGBD dataset on two NVIDIA 1080Ti GPUs.

Iv-a Analysis of S-Conv

We design ablation studies on NYUDv2 [nyud] dataset. The ResNet101 with a simple decoder and deep supervision is used as the baseline.

Replace convolution with S-Conv: We evaluate the effectiveness of S-Conv by replacing the conventional convolution (of filter) in different layers. We first replace convolution in layer 3, then extend the explored rules to other layers. The FPS (Frames per Second) is tested on NVIDIA 1080Ti with input image size following [dcnn]. The results are shown in Tab. I.

Model Acc. mAcc. mIoU
Baseline 72.1 54.6 43.0
Baseline+OG 73.9 58.2 46.3
Baseline+SP+OG 75.2 60.0 48.4
Baseline+SP+WG 74.5 58.4 46.8
Baseline+SP+OG+WG 75.5 60.9 49.0
TABLE II: Ablation study of SGNet on NYUDv2 [nyud] dataset. DCV2: Deformable Convolution V2 [deformablev2], OG: Offset generator of S-Conv, WG: Weight generator of S-Conv, SP: Spatial projection of S-Conv.
Model Acc. mAcc. mIoU.
Baseline 72.1 54.6 43.0
Baseline+DCV2 73.0 56.1 44.5
Baseline+HHANet 73.5 56.8 45.4
Baseline+DAC 73.8 57.1 45.4
Baseline+SP+WG 74.5 58.4 46.8
Baseline+S-Conv(SGNet) 75.5 60.9 49.0
TABLE III: The comparison results on NYUDv2 test dataset. DAC: Depth-aware Convolution [dcnn]. SP: Spatial projector in S-Conv. WG: Weight generator in S-Conv.
Information Acc. mAcc. mIoU.
Depth 75.5 60.9 49.0
RGB Feature 73.9 58.5 46.4
HHA 75.7 60.8 48.9
3D coordinates 75.3 61.2 48.5
TABLE IV: Comparison of using different types of spatial information on NYUDv2 Dataset.
Network Backbone MS SI Acc. mAcc. mIoU. fps param (M)
FCN [fcn] 2VGG16 HHA 65.4 46.1 34.0 8 272.2
LSD-GF [local] 2VGG16 HHA 71.9 60.7 45.9 - -
RefineNet [refinenet] ResNet152 - 73.6 58.9 46.5 16 129.5
RDFNet [rdfnet] 2ResNet152 HHA 76.0 62.8 50.1 9 200.1
RDFNet [rdfnet] 2ResNet101 HHA 75.6 62.2 49.1 11 169.1
CFNet [cascaded] 2ResNet152 HHA - - 47.7 - -
3DGNN [qi20173d] VGG16 HHA - 55.2 42.0 5 47.2
D-CNN [dcnn] 2VGG16 HHA - 56.3 43.9 13 92.0
D-CNN [dcnn] VGG16 Depth - 53.6 41.0 26 47.0
D-CNN [dcnn] 2ResNet152 Depth - 61.1 48.4 - -
ACNet [hu2019acnet] 2ResNet50 Depth - - 48.3 18 116.6
SGNet ResNet101 depth 75.5 60.9 49.0 28 58.3
SGNet-8s ResNet101 depth 76.4 62.7 50.3 12 58.3
SGNet ResNet101 depth 76.4 62.1 50.3 28 58.3
SGNet-8s ResNet101 depth 76.8 63.1 51.0 12 58.3
TABLE V: Comparison results on NYUDv2 test dataset. MS: Multi-scale test; SI: Spatial information. The input image size for forward speed testing is using NVIDIA 1080Ti.

We can draw the following two conclusions from the results in the Tab. I. 1) The inference speed of the baseline network is fast, but its performance is poor. Replacing convolution with S-Conv can improve the results of the baseline network with a little bit more parameters and computational time. 2) In addition to the first convolution in layer 3 whose stride is 2, the effect of replacing the later convolution is better. The main reason would be that spatial information can better guide down-sampling operation in the first convolution. Thus we choose to replace the first convolution and the last two convolutions of each layer with S-Conv. We generalize the rules found in layer 3 to other layers and achieve better results. The above experiments show that our S-Conv can significantly improve network performance with only a few parameters. It is worth noting that our network has no spatial information stream. The spatial information only affects the distribution and weight of convolution kernel.

We also show the IoU improvement of S-Conv on most categories in Fig. 5. It’s obvious that our S-Conv improves IoU in most categories, especially for objects lacking representative texture information such as mirror, board and bathtub. There are also clear improvements for objects with rich spatial transformation, such as chairs and tables. This shows that our S-Conv can make good use of spatial information during the inference process.

Architecture ablation: To evaluate the effectiveness of each component in our proposed S-Conv, we design ablation studies. The results are shown in Tab. II. By default, we replace the first convolution and the last two convolutions of each layer according to Tab. I. We can see that the offset generator, spatial projection module, and weight generator of S-Conv all contribute to the improvement of the results.

Comparison with alternatives: Most methods [fusenet, jiang2018rednet, rdfnet, hu2019acnet] use a two-stream network to extract features from two different modalities and then combine them. Our S-Conv focuses on advancing the feature extraction process of the network by utilizing spatial information. Here we compare our S-Conv with two-stream network, deformable convolution [deformablev2, deform], and depth-aware convolution [dcnn]. We use a simple baseline which consists of a ResNet101 network with deep supervision and a simple decoder.

Fig. 6: FPS, mIoU, and the number of parameters of different methods on NYUDv2 test dataset with input image size 425560 using NVIDIA 1080Ti. The radius of the circle corresponds to the number of parameters of the model. For the model marked with , its mIoU is obtained by multi-scale test, and its inference time is obtained by single-scale forward. The results of DCNet [dcnn] and 3DGNN [qi20173d] are from  [dcnn]. Our SGNet can achieve real-time inference and competitive performance. The SGNet-8s with output stride is 8 can get better results than RDFNet which use more parameters and multi-scale test.

We add an additional ResNet101 network, called HHANet, to extract HHA features and fuse it with our baseline features at the final layer of a two-stream network. To compare with depth-aware convolution and deformable convolution, similar to SGNet, we replace the first convolution and the last two convolutions of each layer in the baseline. The results are shown in Tab. III. We find that our S-Conv achieves better results than two-stream network, deformable convolution, and depth-aware convolution. This demonstrates that our S-Conv can effectively utilizes spatial information. The baseline equipped with weight generator can also achieve better results than depth-aware convolution, indicating that learning weights from spatial information is necessary.

[width = 2]visual RGBDepthLayer1_1Layer1_2Layer1_3Layer2_1(e)(d)(c)(b)(a)

Fig. 7: The visualization of receptive filed in S-Conv.

[width = 2]compare RGBDepthGTBaselineSGNetSGNet-8s(e)(d)(c)(b)(a)

Fig. 8: The qualitative semantic segmentation comparison results on NYUDv2 test dataset.
Network Backbone MS SI Acc. mAcc. mIoU. fps param (M)
LSD-GF [local] 2VGG16 HHA - 58.0 - - -
RefineNet [refinenet] ResNet152 - 80.6 58.5 45.9 16 129.5
3DGNN [qi20173d] VGG16 HHA - 57.0 45.9 5 47.2
RDFNet [rdfnet] 2ResNet152 HHA 81.5 60.1 47.7 9 200.1
CFNet [cascaded] 2ResNet152 HHA - - 48.1 - -
D-CNN [dcnn] 2VGG16 HHA - 53.5 42.0 12.5 92.0
ACNet [hu2019acnet] 2ResNet50 HHA - - 48.1 8 272.2
SGNet ResNet101 depth 81.0 59.6 47.1 28 58.3
SGNet ResNet101 depth 81.8 60.9 48.5 28 58.3
TABLE VI: Comparison results on SUNRGBD test dataset. MS: Multi-scale test; SI: Spatial information.

[width = 1]sunrgbd RGBGTBaseline(d)(c)(b)(a)

Fig. 9: The qualitative semantic segmentation comparison results on SUNRGBD test dataset.

Spatial information comparison: We also evaluate the impact of different formats of spatial information on S-Conv. The results are shown in Tab. IV. We can see that depth information leads to comparable results with HHA and 3D coordinates, and better results than intermediate RGB features which are used by deformable convolution [deform, deformablev2]. This shows the advantage of using spatial information for offset and weight generation over RGB features. However, converting depth to HHA is time-consuming [fusenet]. Hence 3D coordinates and depth map are more suitable for real-time segmentation using SGNet.

Iv-B Comparison with state-of-the-art 

We compare our SGNet with other state-of-the-art methods on NYUDv2 [nyud] and SUNRGBD [sunrgbd, sunrgbd2] datasets. The architecture of SGNet is shown in Fig. 4.

NYUDv2 dataset: The comparison results can be found in Tab. V and Fig. 6. The input image size for single-scale speed testing is following [dcnn]. We tested the single-scale speed of other methods under the same conditions using NVIDIA 1080Ti. Note that some methods in  Tab. V

do not report parameter quantities or open source. So we just listed the mIoU of these methods. Instead of using additional networks to extract spatial features, our SGNet can achieve competitive performance and real-time inference. This benefits from S-Conv which can make use of spatial information efficiently with only a small amount of extra parameters and computation cost. Moreover, our S-Conv can achieve good results without using HHA information, making it suitable for real-time tasks. After using multi-scale test which is the same as RDFNet 

[rdfnet] and CFNet [cascaded], we exceed all methods based on complicated data fusion and achieve state-of-the-art performance using fewer parameters. This verifies the efficiency of our S-Conv in utilizing spatial information. At the expense of a little bit more reasoning time by changing the output stride of SGNet to 8 noted as SGNet-8, the proposed SGNet can achieve better results than other methods and RDFNet which uses multi-scale test, HHA information and two resnet152 backbones. After using multi-scale test which is used by other methods, SGNet’s performance can be further improved.

SUNRGBD dataset: The comparison results on the SUNRGBD dataset are shown in Tab. VI. It is worth noting that some methods in Tab. V did not report results on the SUNRGBD dataset. The inference time and parameter number of models in Tab. VI are the same as those in Tab. V. Our SGNet can achieve competitive results in real-time compared with models that do not have real-time performance.

Iv-C Qualitative Performance

Visualization of receptive filed in S-Conv: Appropriate receptive field is very important for scene recognition. We visualize the input adaptive receptive filed of SGNet in different layers generated by S-Conv. Specifically, we get the receptive field of each pixel by summing up the norm of their offsets during the S-Conv operation, then we normalize each value to [0, 255] and visualize the result using a gray-scale image. The results are shown in Fig. 7. The brighter the pixel, the larger the receptive field. We observe that the receptive fields of different convolutions vary adaptively with the depth of the input image. For example, in layer1_1, the receptive field is inversely proportional to the depth, which is opposite to layer1_2. The combination of the adaptive receptive field learned at each layer can help the network better resolve indoor scenes with complex spatial relations.

Qualitative comparison results: We show qualitative comparison results on NYUDv2 test dataset in Fig. 8. For the visual results in Fig. 8 (a), the bathtub and the wall have insufficient texture, which cannot be easily distinguished by the baseline method. Some objects may have reflections such as the table in Fig. 8 (b), which is also challenging for the baseline. SGNet, however, can recognize it well by incorporating spatial information with the help of S-Conv. The chairs in Fig. 8 (c, d) are hard to be recognized by RGB data due to the low contrast and confused texture, while they can be easily recovered by SGNet benefiting from the equipped S-Conv. In the meantime, SGNet can recover the object’s geometric shape nicely, as demonstrated by the chairs of Fig. 8 (e). We also show qualitative results on SUNRGBD test dataset in Fig. 9. It can be seen that our SGNet can also achieve precise segmentation on SUNRGBD.

V Conclusion

In this paper, we propose a novel Spatial information guided Convolution (S-Conv) operator. Compared with conventional 2D convolution, it can adaptively adjust the convolution weights and distributions according to the input spatial information, resulting in better awareness of the geometric structure with only a few additional parameters and computation cost. We also propose Spatial information Guided convolutional Net

work (SGNet) equipped with S-Conv that yields real-time inference speed and achieves competitive results on NYUDv2 and SUNRGBD datasets for RGBD semantic segmentation. We also compare the performance of using different inputs to generate offset, demonstrating the advantage of using spatial information over RGB feature. Furthermore, we visualize the depth-adaptive receptive filed in each layer to show effectiveness. In the future, we will investigate the fusion of different modal information and the adaptive change of S-Conv structure simultaneously, making these two approaches benefit each other. We will also explore the application of S-Conv in different fields, such as pose estimation and 3D object detection.