With the development of 3D sensing technologies, RGBD data with spatial information (depth, 3D coordinates) is easily accessible. As a result, RGBD semantic segmentation for high-level scene understanding becomes extremely important, benefiting a wide range of applications such as automatic driving[icnet], SLAM [bescos2018dynaslam]
, and robotics. Due to the effectiveness of Convolutional Neural Network (CNN) and additional spatial information, recent advances demonstrate enhanced performance on indoor scene segmentation tasks[fcn, deeplab]. Nevertheless, there remains a significant challenge caused by the complexity of the environment and the extra efforts for considering spatial data, especially for applications that require real-time inference.
A common approach treats 3D spatial information as an additional input of the network to extract features, followed by combining the features of RGB images [rdfnet, fcn, eigen2015predicting, ma2017multi, fusenet, wang2016learning], as shown in Fig. 1 (a). This approach achieves promising results at the cost of significantly increasing the parameter number and computational time, thus being unsuitable for real-time tasks. Meanwhile, several works [fcn, fusenet, gupta2014learning, lstmcf, rdfnet] encode raw spatial information into three channels (HHA) composed of horizontal disparity, height above ground, and norm angle. However, the conversion from raw data to HHA is also time-consuming [fusenet].
It is worth noting that indoor scenes have more complex spatial relations than outdoor scenes. This requires a stronger adaptive ability of the network to deal with geometric transformations. However, due to the fixed structure of the convolution kernel, the 2D convolution in the aforementioned methods cannot well adapt to spatial transformation and adjust the receptive field inherently, limiting the accuracy of semantic segmentation. Although alleviation can be made by revised pooling operation and prior data augmentation [deform, deformablev2], a better spatially adaptive sampling mechanism for conducting convolution is still desirable.
Moreover, the color and texture of objects in indoor scenes are not always representative. Instead, the geometry structure often plays a vital role in semantic segmentation. For example, to recognize the fridge and wall, the geometric structure is the primary cue due to the similar texture. However, such spatial information is ignored by 2D convolution on RGB data. The depth-aware convolution [dcnn] is proposed to address this problem. It forces pixels with similar depths as the center of the kernel to have higher weight than others. Nevertheless, this prior is handcrafted and may lead to sub-optimal results.
It can be seen that there is a contradiction between the fixed structure of 2D convolution and the varying spatial transformation, along with the efficiency bottleneck of separately processing RGB and spatial data. To overcome the limitations mentioned above, we propose a novel operation, called Spatial information guided Convolution(S-Conv), which adaptively changes according to the spatial information (see Fig. 1 (b)).
Specifically, this operation can generate convolution kernels with different sampling distributions adapting to spatial information, boosting the spatial adaptability and the receptive field regulation of the network. Furthermore, S-Conv establishes a link between the convolution weights and the underlying spatial relationship with their corresponding pixel, incorporating the geometric information into the convolution weights to better capture the spatial structure of the scene.
The proposed S-Conv is light yet flexible and achieves significant performance improvements with only few additional parameters and computation costs, making it suitable for real-time applications. We conduct extensive experiments to demonstrate the effectiveness and efficiency of S-Conv. We first design the ablation study and compare S-Conv with deformable convolution [deform, deformablev2] and depth-aware convolution [dcnn], exhibiting the advantages of S-Conv. We also verify the applicability of S-Conv to spatial transformations by testing its influence on different types of spatial data with depth, HHA and 3D coordinates. We demonstrate that spatial information is more suitable to generate offset than RGB feature which is used by deformable convolution [deform, deformablev2]. Finally, benefiting from the adaptability to spatial transformation and the effectiveness of perceiving spatial structure, our network equipped with S-Conv, named Spatial information Guided convolutional Network (SGNet), achieves high-quality results with real-time inference on NYUDv2 [nyud] and SUNRGBD [sunrgbd, sunrgbd2] datasets.
We highlight our contributions as follows:
We propose a novel S-Conv operator that can adaptively adjust receptive field while effectively adapting to spatial transformation, and can perceive intricate geometric patterns with low cost.
Based on S-Conv, we propose a new SGNet that achieves competitive RGBD segmentation performance in real-time on NYUDv2 [nyud] and SUNRGBD [sunrgbd, sunrgbd2] datasets.
Ii Related Work
Ii-a Semantic Segmentation
The recent advances of semantic segmentation benefit a lot from the development of convolutional neural network (CNN) [imagenet, deep]. FCN [fcn]
is the pioneer of leveraging CNN for semantic segmentation. It leads to convincing results and serves as the basic framework for many tasks. With the research efforts in the field, the recent methods can be classified into two categories according to the network architecture, including atrous convolution-based methods[deeplab, multi], and encoder-decoder based methods [refinenet, deeplabv3plus, segnet, deconvnet].
The standard approach relies on stride convolutions or poolings to reduce the output stride of the CNN backbone and enables a large receptive field. However, the resolution of the resulting feature map is reduced[deeplabv3], and many details are lost. One approach exploits atrous convolution to alleviate the conflict by enhancing the receptive field while keeping the resolution of the feature map [deeplabv3, deeplab, deeplabv3plus, multi, denseaspp]. We use atrous convolution based backbone in the proposed SGNet.
Encoder-decoder architecture: The other approach utilizes the encoder-decoder structure [deconvnet, segnet, refinenet, deeplabv3plus, psp], which learns a decoder to recover the prediction details gradually. DeconvNet [deconvnet] employs a series of deconvolutional layers to produce a high-resolution prediction. SegNet [segnet] achieves better results by using pooling indices in the encoder to guide the recovery process in the decoder. RefineNet [refinenet] fuses low-level features in the encoder with the decoder to refine the prediction. While this method can achieve more precise results, it requires longer inference time.
Ii-B RGBD Semantic Segmentation
How to effectively use the extra geometry information (depth, 3D coordinates) is the key of RGBD semantic segmentation. A number of works focus on how to extract more information from geometry, which is treated as additional input in [eigen2015predicting, ma2017multi, fusenet, wang2016learning, hu2019acnet]. Two-stream network is used in [ma2017multi, fusenet, wang2016learning, lstmcf, rdfnet]
to process RGB image and geometry information separately, and combines the two results in the last layer. These methods achieve promising results at the expense of doubling the parameters and computational cost. 3D CNNs or 3D KNN graph networks are also used to take geometry information into account[song2017semantic, song2016deep, qi20173d]
. Besides, various deep learning methods on 3D point cloud[pointnet, pointnet++, chen2019lsanet, spidercnn, spectral_graph_conv, pointcnn] are also explored. However, these methods cost a lot of memory and are computationally expensive. Another stream incorporates geometric information into explicit operations. Cheng et al. [local]
use geometry information to build a feature affinity matrix acting in average pooling and up-pooling. Lin et al.[cascaded] splits the image into different branches based on geometry information. Wang et al. [dcnn]
propose Depth-aware CNN, which adds depth prior to the convolutional weights. Although it improves feature extraction by convolution, the prior is handcrafted but not learned from data. Other approaches, such as multi-task learning[jiao2019geometry, wang2015towards, hoffman2016learning, kokkinos2017ubernet, eigen2015predicting, Zhang_2019_CVPR] or spatial-temporal analysis [he2017std2p], are further used to improve segmentation accuracy. The proposed S-Conv aims to efficiently utilize spatial information to improve the feature extraction ability. It can significantly enhance the performance with high efficiency due to using only a small amount of parameters.
Ii-C Dynamic structure in CNN
Using dynamic structure to deal with varying input of CNN has also been explored. Dilation Convolution is used in [multi, deeplab]
to increase the receptive field size without reducing feature map resolution. Spatial transformer network[stn] adapts spatial transformation by warping feature map. Dynamic filter [dynamicfilter] adaptively changes its weights according to the input. Besides, self-attention based methods [selective, nonlocal, senet] generate attention maps from the intermediate feature map to adjust response at each location or capture long-range contextual information adaptively. Some generalizations of convolution from 2D image to 3D point cloud are also presented. PointCNN [pointcnn] is a seminal work that enables CNN on a set of unordered 3D points. There are other improvements [chen2019lsanet, spidercnn, spectral_graph_conv]
on utilizing neural networks to effectively extract deep features from 3D point sets. Deformable convolution[deform, deformablev2] can generate different distribution with adaptive weights. Nevertheless, their input is an intermediate feature map rather than spatial information. Our work experimentally verifies that better results can be obtained based on spatial information in Sec. IV.
Iii S-Conv and SGNet
In this section, we first elaborate on the details of Spatial information guided Convolution (S-Conv), which is a generalization of conventional RGB-based convolution by involving spatial information in the RGBD scenario. Then, we discuss the relation between our S-Conv and other approaches. Finally, we describe the network architecture of Spatial information Guided convolutional Network (SGNet), which is equipped with S-Conv for RGBD semantic segmentation.
Iii-a Spatial information guided Convolution
For completeness, we first review the conventional convolution operation. We use
to denote a tensor, whereis the index corresponding to the first dimension, and indicates the two indices for the second and third dimensions. Non-scalar values are highlighted in bold for convenience.
For an input feature map . We describe it in 2D for simplicity, thus . Note that it is straightforward to extend to the 3D case. The conventional convolution applied on to get can be formulated as the following:
where is the convolution kernel with size , is the 2D convolution center, denotes the kernel distribution around . For convolution:
From the above equation, we can see that the convolution kernel is constant over . In other words, and are fixed, meaning the convolution is content-agnostic.
In the RGBD context, we want to involve 3D spatial information efficiently by using adaptive convolution kernels.
We first generate the offset according to the spatial information, then use the spatial information corresponding to the given offset to generate new spatially adaptive weights. Our S-Conv requires two inputs. One is the feature map which is the same as conventional convolution. The other is the spatial information . In practice, can be HHA (), 3D coordinates (), or depth (). The method of encoding depth into 3D coordinates and HHA is the same as [qi20173d]. Note that the input spatial information is not included in the feature map.
As the first step of S-Conv, we project the input spatial information into a high-dimensional feature space, which can be expressed as:
where is a spatial transformation function, and , which has a higher dimension than .
Then, we take the transformed spatial information into consideration, perceive its geometric structure, and generate the distribution (offset of pixel coordinate in and axis) of convolution kernels at different . This processes can be expressed as:
where , represent the feature map size after convolution, and are the kernel size. is a non-linear function which can be implemented by a series of convolutions.
After generating the distribution of kernel for each possible using , we boost its feature extraction ability by establishing the link between the geometric structure and the convolution weight. More specifically, we sample the geometric information of the pixels corresponding to the convolution kernel after shifting:
where is the spatial distribution of convolution kernels at . is the spatial information corresponding to the feature map of the convolution kernel centered on after transformation.
Finally, we generate convolution weights according to the final spatial information as the following:
is a non-linear function that can be implemented as a series of convolution layers with non-linear activation function,indicates the convolution weights, which can be updated by the gradient descent algorithm. denotes the spatially adaptive weights for convolution centered at .
Overall, our generalized S-Conv is formulated as:
We can see that establishes the correlation between spatial information and convolution weights. Moreover, convolution kernel distribution is also relevant to the spatial information through . Note that and are not constant, meaning the generalized convolution is adaptive to different . Also, as
is typically fractional, we use bilinear interpolation to computeas in [deform, stn]. The main formulae discussed above are labeled in Fig. 2.
Iii-B Relation to other approaches
2D convolution is the special case of the proposed S-Conv without geometry information and corresponding offset. Specifically, without geometry information, the center point and its neighboring points have fixed positional relation in image space. And we do not need to capture the varying spatial relation. This also shows that our S-Conv has very good compatibility to handle the 2D case. While for the RGBD case, our S-Conv can extract feature at the point level and is not limited to the discrete grid by introducing spatially adaptive weights as shown in Fig. 3. Deformable convolution [deform, deformablev2] also alleviates this problem by generating different distribution weights. Nevertheless, their distributions are inferred from 2D feature maps instead of 3D spatial information as in our case. We will verify through experiments that our method achieves better results than deformable convolution [deform, deformablev2]. Compared with the 3D KNN graph-based method, our S-Conv selects neighboring pixels adaptively instead of using the KNN graph, which is not flexible and computationally expensive.
Iii-C SGNet architecture
Our semantic segmentation network, called SGNet, is equipped with S-Conv and consists of a backbone and decoder. The structure of SGNet is illustrated in Fig. 4. We use ResNet101 [resnet] as our backbone, and replace the first and the last two conventional convolutions (
filter) of each layer with our S-Conv. We add a series of convolutions to extract the feature further and then use bilinear up-sampling to produce the final segmentation probability map, which corresponds to the decoder part of the SGNet. Thein Equ. (3) is implemented as three convolution layers, i.e. Conv(3, 64) - Conv(64, 64) - Conv(64, 64) with non-linear activation function. The in Equ. (4) and the in Equ. (6) are implemented as single and two convolution layers separately. The S-Conv implementation is modified from deformable convolution [deformablev2, deform] We add deep supervision between layer 3 and layer 4 to improve the network optimization capability, which is the same as PSPNet [psp].
In this section, we first validate the performance of S-Conv by analyzing its usage in different layers; conducting ablation study/comparison with its alternatives; and evaluating results of using different input information to generate offset. Then we compare our SGNet equipped with S-Conv with other state-of-the-art semantic segmentation methods on NYUDV2 and SUNRGBD datasets. Finally, we visualize the depth adaptive receptive field in each layer and the segmentation results, demonstrating that the proposed S-Conv can well exploit spatial information.
Datasets and metrics: We evaluate the performance of S-Conv operator and SGNet segmentation method on two public datasets:
NYUDv2 [nyud] : This dataset has 1,449 RGB images with corresponding depth maps and pixel-wise labels. 795 images are used for training, while 654 images are used for testing as in [split]. The 40-class settings are used for experiments.
SUN-RGBD [sunrgbd, sunrgbd2]: This dataset contains 10,335 RGBD images with semantic labels organized in 37 categories. 5,285 images are used for training, and 5050 images are used for testing.
We use three common metrics for evaluation, including pixel accuracy (Acc), mean accuracy (mAcc), and mean intersection over union (mIoU). The three metrics are defined as the following:
where is the amount of pixels which are predicted as class with ground truth , is the number of classes, and is the number of pixels whose ground truth class is . The depth map is used as the default format of spatial information unless specified otherwise.
Implementation details: We use dilated ResNet101 [resnet] pretrained on ImageNet [imagenet-c] as the backbone network for feature extraction following [deeplab]
, and the output stride is 16 by default. The whole system is implemented based on PyTorch. The SGD optimizer is adopted for training with the learning rate policy as used in[deeplab, deeplabv3plus]:
, where the initial learning rate is 5e-3 for NYUDv2 and 1e-3 for SUNRGBD, and the weight decay is 5e-4. This learning policy updates the learning rate for every 40 epochs for NYUDv2 and 10 epochs for SUNRGBD. We use ReLU activation function, and the batch size is 8. Following[rdfnet], we employ general data augmentation strategies, including random scaling, random cropping, and random flipping. The crop size is . During testing, we down-sample the image to the training crop size, and its prediction map is up-sampled to the original size. We use cross-entropy loss in both datasets, and reweight [jiang2018rednet] training loss of each class in SUNRGBD due to its extremely unbalanced label distribution. We train the network by 500 epochs for the NYUDv2 dataset and 200 epochs for the SUNRGBD dataset on two NVIDIA 1080Ti GPUs.
Iv-a Analysis of S-Conv
We design ablation studies on NYUDv2 [nyud] dataset. The ResNet101 with a simple decoder and deep supervision is used as the baseline.
Replace convolution with S-Conv: We evaluate the effectiveness of S-Conv by replacing the conventional convolution (of filter) in different layers. We first replace convolution in layer 3, then extend the explored rules to other layers. The FPS (Frames per Second) is tested on NVIDIA 1080Ti with input image size following [dcnn]. The results are shown in Tab. I.
We can draw the following two conclusions from the results in the Tab. I. 1) The inference speed of the baseline network is fast, but its performance is poor. Replacing convolution with S-Conv can improve the results of the baseline network with a little bit more parameters and computational time. 2) In addition to the first convolution in layer 3 whose stride is 2, the effect of replacing the later convolution is better. The main reason would be that spatial information can better guide down-sampling operation in the first convolution. Thus we choose to replace the first convolution and the last two convolutions of each layer with S-Conv. We generalize the rules found in layer 3 to other layers and achieve better results. The above experiments show that our S-Conv can significantly improve network performance with only a few parameters. It is worth noting that our network has no spatial information stream. The spatial information only affects the distribution and weight of convolution kernel.
We also show the IoU improvement of S-Conv on most categories in Fig. 5. It’s obvious that our S-Conv improves IoU in most categories, especially for objects lacking representative texture information such as mirror, board and bathtub. There are also clear improvements for objects with rich spatial transformation, such as chairs and tables. This shows that our S-Conv can make good use of spatial information during the inference process.
Architecture ablation: To evaluate the effectiveness of each component in our proposed S-Conv, we design ablation studies. The results are shown in Tab. II. By default, we replace the first convolution and the last two convolutions of each layer according to Tab. I. We can see that the offset generator, spatial projection module, and weight generator of S-Conv all contribute to the improvement of the results.
Comparison with alternatives: Most methods [fusenet, jiang2018rednet, rdfnet, hu2019acnet] use a two-stream network to extract features from two different modalities and then combine them. Our S-Conv focuses on advancing the feature extraction process of the network by utilizing spatial information. Here we compare our S-Conv with two-stream network, deformable convolution [deformablev2, deform], and depth-aware convolution [dcnn]. We use a simple baseline which consists of a ResNet101 network with deep supervision and a simple decoder.
We add an additional ResNet101 network, called HHANet, to extract HHA features and fuse it with our baseline features at the final layer of a two-stream network. To compare with depth-aware convolution and deformable convolution, similar to SGNet, we replace the first convolution and the last two convolutions of each layer in the baseline. The results are shown in Tab. III. We find that our S-Conv achieves better results than two-stream network, deformable convolution, and depth-aware convolution. This demonstrates that our S-Conv can effectively utilizes spatial information. The baseline equipped with weight generator can also achieve better results than depth-aware convolution, indicating that learning weights from spatial information is necessary.
Spatial information comparison: We also evaluate the impact of different formats of spatial information on S-Conv. The results are shown in Tab. IV. We can see that depth information leads to comparable results with HHA and 3D coordinates, and better results than intermediate RGB features which are used by deformable convolution [deform, deformablev2]. This shows the advantage of using spatial information for offset and weight generation over RGB features. However, converting depth to HHA is time-consuming [fusenet]. Hence 3D coordinates and depth map are more suitable for real-time segmentation using SGNet.
Iv-B Comparison with state-of-the-art
We compare our SGNet with other state-of-the-art methods on NYUDv2 [nyud] and SUNRGBD [sunrgbd, sunrgbd2] datasets. The architecture of SGNet is shown in Fig. 4.
NYUDv2 dataset: The comparison results can be found in Tab. V and Fig. 6. The input image size for single-scale speed testing is following [dcnn]. We tested the single-scale speed of other methods under the same conditions using NVIDIA 1080Ti. Note that some methods in Tab. V
do not report parameter quantities or open source. So we just listed the mIoU of these methods. Instead of using additional networks to extract spatial features, our SGNet can achieve competitive performance and real-time inference. This benefits from S-Conv which can make use of spatial information efficiently with only a small amount of extra parameters and computation cost. Moreover, our S-Conv can achieve good results without using HHA information, making it suitable for real-time tasks. After using multi-scale test which is the same as RDFNet[rdfnet] and CFNet [cascaded], we exceed all methods based on complicated data fusion and achieve state-of-the-art performance using fewer parameters. This verifies the efficiency of our S-Conv in utilizing spatial information. At the expense of a little bit more reasoning time by changing the output stride of SGNet to 8 noted as SGNet-8, the proposed SGNet can achieve better results than other methods and RDFNet which uses multi-scale test, HHA information and two resnet152 backbones. After using multi-scale test which is used by other methods, SGNet’s performance can be further improved.
SUNRGBD dataset: The comparison results on the SUNRGBD dataset are shown in Tab. VI. It is worth noting that some methods in Tab. V did not report results on the SUNRGBD dataset. The inference time and parameter number of models in Tab. VI are the same as those in Tab. V. Our SGNet can achieve competitive results in real-time compared with models that do not have real-time performance.
Iv-C Qualitative Performance
Visualization of receptive filed in S-Conv: Appropriate receptive field is very important for scene recognition. We visualize the input adaptive receptive filed of SGNet in different layers generated by S-Conv. Specifically, we get the receptive field of each pixel by summing up the norm of their offsets during the S-Conv operation, then we normalize each value to [0, 255] and visualize the result using a gray-scale image. The results are shown in Fig. 7. The brighter the pixel, the larger the receptive field. We observe that the receptive fields of different convolutions vary adaptively with the depth of the input image. For example, in layer1_1, the receptive field is inversely proportional to the depth, which is opposite to layer1_2. The combination of the adaptive receptive field learned at each layer can help the network better resolve indoor scenes with complex spatial relations.
Qualitative comparison results: We show qualitative comparison results on NYUDv2 test dataset in Fig. 8. For the visual results in Fig. 8 (a), the bathtub and the wall have insufficient texture, which cannot be easily distinguished by the baseline method. Some objects may have reflections such as the table in Fig. 8 (b), which is also challenging for the baseline. SGNet, however, can recognize it well by incorporating spatial information with the help of S-Conv. The chairs in Fig. 8 (c, d) are hard to be recognized by RGB data due to the low contrast and confused texture, while they can be easily recovered by SGNet benefiting from the equipped S-Conv. In the meantime, SGNet can recover the object’s geometric shape nicely, as demonstrated by the chairs of Fig. 8 (e). We also show qualitative results on SUNRGBD test dataset in Fig. 9. It can be seen that our SGNet can also achieve precise segmentation on SUNRGBD.
In this paper, we propose a novel Spatial information guided Convolution (S-Conv) operator. Compared with conventional 2D convolution, it can adaptively adjust the convolution weights and distributions according to the input spatial information, resulting in better awareness of the geometric structure with only a few additional parameters and computation cost. We also propose Spatial information Guided convolutional Net
work (SGNet) equipped with S-Conv that yields real-time inference speed and achieves competitive results on NYUDv2 and SUNRGBD datasets for RGBD semantic segmentation. We also compare the performance of using different inputs to generate offset, demonstrating the advantage of using spatial information over RGB feature. Furthermore, we visualize the depth-adaptive receptive filed in each layer to show effectiveness. In the future, we will investigate the fusion of different modal information and the adaptive change of S-Conv structure simultaneously, making these two approaches benefit each other. We will also explore the application of S-Conv in different fields, such as pose estimation and 3D object detection.