Log In Sign Up

3D Semantic Segmentation with Submanifold Sparse Convolutional Networks

Convolutional networks are the de-facto standard for analyzing spatio-temporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard "dense" implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and use them to develop spatially-sparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SSCNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior state-of-the-art on the test set of a recent semantic segmentation competition.


page 4

page 10


Submanifold Sparse Convolutional Networks

Convolutional network are the de-facto standard for analysing spatio-tem...

Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation

Convolutional neural networks are designed for dense data, but vision da...

Putting 3D Spatially Sparse Networks on a Diet

3D neural networks have become prevalent for many 3D vision tasks includ...

Tangent Convolutions for Dense Prediction in 3D

We present an approach to semantic scene analysis using deep convolution...

Complete Label: A Domain Adaptation Approach to Semantic Segmentation of LiDAR Point Clouds

We study an unsupervised domain adaptation problem for the semantic labe...

LatticeNet: Fast Spatio-Temporal Point Cloud Segmentation Using Permutohedral Lattices

Deep convolutional neural networks (CNNs) have shown outstanding perform...

Making a Case for 3D Convolutions for Object Segmentation in Videos

The task of object segmentation in videos is usually accomplished by pro...

1 Introduction

Convolutional networks (ConvNets) constitute the state-of-the art method for a wide range of tasks that involve the analysis of data with spatial and/or temporal structure, such as photos, videos, or 3D surface models. While such data frequently comprises a densely populated (2D or 3D) grid, other datasets are naturally sparse. For instance, handwriting is made up of one-dimensional lines in two-dimensional space, pictures made by RGB-D cameras are three-dimensional point clouds, and polygonal mesh models form two-dimensional surfaces in 3D space.

The curse of dimensionality applies, in particular, to data that lives on grids that have three or more dimensions: the number of points on the grid grows exponentially with its dimensionality. In such scenarios, it becomes increasingly important to exploit data sparsity whenever possible in order to reduce the computational resources needed for data processing. Indeed, exploiting sparsity is paramount when analyzing,

e.g., RGB-D videos which are sparsely populated 4D structures.

Figure 1: Examples of 3D point clouds of objects from the ShapeNet part-segmentation challenge [23]. The colors of the points represent the part labels.

Traditional convolutional network implementations are optimized for data that lives on densely populated grids, and cannot process sparse data efficiently. More recently, a number of convolutional network implementations have been presented that are tailored to work efficiently on sparse data [4, 18, 3]. Mathematically, some of these implementations are identical to regular convolutional networks, but they require fewer computational resources in terms of FLOPs and/or memory [4, 3]. Prior work uses a sparse version of the im2col operation that restricts computation and storage to “active” sites [4], or uses the voting algorithm from [22] to prune unnecessary multiplications by zeros [3]. OctNets [18] modify the convolution operator to produce “averaged” hidden states in parts of the grid that are outside the region of interest.

One of the downsides of prior sparse implementations of convolutional networks is that they “dilate” the sparse data in every layer by applying “full” convolutions. In this work, we show that it is possible to create convolutional networks that keep the same level of sparsity throughout the network. To this end, we develop a new implementation for performing sparse convolutions (SCs) and introduce a novel convolution operator termed submanifold sparse convolution (SSC).111These operators appeared earlier in an unpublished technical report [5]. We use these operators as the basis for submanifold sparse convolutional networks (SSCNs) that are optimized for efficient semantic segmentation of 3D point clouds, e.g., on the examples shown in Figure 1.

In Table 1, we present the performance of SSCNs on the test set of a recent part-based segmentation competition [23] and compare it to some of the top-performing entries in the competition: SSCNs outperform all of these entries. Source code for our library is publicly available online222

Method Average IoU
NN matching with Chamfer distance 77.57%
Synchronized Spectral CNN [11] 84.74%
Pd-Network (extension of Kd-Network [10]) 85.49%
Densely Connected PointNet (extension of [17]) 84.32%
PointCNN 82.29%
Submanifold SparseConvNet (Section 6.5) 85.98%
Table 1: Average intersection-over-union (IoU) of six approaches on the test set of a recent part-based segmentation competition on ShapeNet [23]. Higher is better. Our SSCNs outperform all alternative approaches.

2 Related Work

Our work primarily builds upon previous literature on sparse convolutional networks [3, 4], and image segmentation using dense convolutional networks [14, 19, 24]. Examples of applications of dense 3D convolutions on volumetric data include classification [15] and segmentation [2]; these methods suffer from high memory usage and slow inference, limiting the size of models that can be used.

Methods for processing 3D point clouds without voxelization have also been developed [10, 17]. This may seem surprising given the dominance of ConvNets for processing 2D inputs; it is likely due to the computational obstacles involved in using dense 3D convolutional networks.

Prior work on sparse convolutions implements a convolutional operator that increases the number of active sites with each layer [3, 4]. In [4], all sites that have at least one “active” input site are considered as active. In [3], a greater degree of sparsity is attained after

the convolution has been calculated by using ReLUs and a special loss function. In contrast, we introduce

submanifold sparse convolutions that fix the location of active sites so that the sparsity remains unchanged for many layers. We show that this makes it practical to train deep and efficient networks similar to VGG networks [20] or ResNets [7], and that it is well suited for the task of point-wise semantic segmentation.

OctNets [18] are an alternative form of sparse convolution. Sparse voxels are stored in oct-trees: a data structure in which the grid cube is progressively subdivided into smaller sub-cubes until the sub-cubes are either empty or contain a single active site. OctNet operates on the surfaces of empty regions, so a size- OctNet convolution on an empty cube of size requires 23% of the calculation of a dense 3D convolution. Conversely, submanifold convolutions require no calculations in empty regions.

Another approach to segmenting point clouds is to avoid voxelizing the input, which may lead to a loss of information due to the finite resolution. This can be done by either using carefully selected data structures such as Kd-trees [10], or by directly operating on the unordered set of points [17]. Kd-Networks [10] build a Kd-tree by recursively partitioning the space along the axis of largest variation until each leaf of the tree represents one input point. This takes O time for input points. PointNet [17]

uses a pooling operation to produce a global feature vector.

Fully convolutional networks (FCNs) were proposed in [14] as a method of 2D image segmentation; FCNs make use of information at multiple scales to preserve low-level information to accurately delineate object boundaries. U-Nets [19] extend FCNs by using convolutions to more accurately merge together the information from the different scales before the final classification stage; see Figure 4.

3 Spatial sparsity for ConvNets

We define a -dimensional convolutional network as a network that takes as input a

-dimensional tensor: the input tensor contains

spatio-temporal dimensions (such as length, width, height, time, etc.) and one additional feature-space dimension (e.g., RGB color channels or surface normal vectors). The input corresponds to a -dimensional grid of sites, each of which is associated with a feature vector. We define a site in the input to be active if any element in the feature vector is not in its ground state, e.g., if it is non-zero333Note that the ground state does not necessarily have to be zero, in particular, when convolutions with a bias term are used.. In many problems, thresholding may be used to eliminate input sites at which the feature vector is within a small distance from the ground state. Note that even though the input tensor is -dimensional, activity is a -dimensional phenomenon: entire lines along the feature dimension are either active or inactive.

Similarly, the hidden layers of a -dimensional convolutional network are represented by -dimensional grids of feature-space vectors. When propagating the input data through the network, a site in a hidden layer is active if any of the sites in the layer that it takes as input is active. (Note that when using size-3 convolutions, each site is connected to sites in the hidden layer below.) Activity in a hidden layer thus follows an inductive definition in which each layer determines the set of active states in the next layer. In each hidden layer, inactive sites all have the same feature vector: the one corresponding to the ground state. The value of the ground state only needs to be calculated once per forward pass at training time, and only once for all forward passes at test time. This allows for substantial savings in computational and memory requirements; the exact savings depend on data sparsity and network depth.

However, we argue that the framework described above is unduly restrictive, in particular, because the convolution operation has not been modified to accommodate the sparsity of the input data. If the input data contains a single active site, then after applying a convolution, there will be active sites. Applying a second convolution of the same size will yield active sites, and so on. This rapid growth of the number of active sites is a poor prospect when implementing modern convolutional network architectures that comprise tens or even hundreds of convolutional layers, such as VGG networks, ResNets, or DenseNets [8, 9, 20].

Of course, convolutional networks are not often applied to inputs that only have a single active site, but the aforementioned dilation problems are equally problematic when the input data comprises one-dimensional curves in spaces with two or more dimensions, or two-dimensional surfaces in three or more dimensions. We refer to this problem as the “submanifold dilation problem”, which is illustrated in Figure 2. The figure illustrates that even when we apply small convolutions on this grid, the sparsity of the grid rapidly disappears.

Figure 2: Example of “submanifold” dilation. Left: Original curve. Middle: Result of applying a regular convolution with weights . Right: Result of applying the same convolution again. The example shows that regular convolutions substantially reduce the sparsity of the features with each convolutional layer.

4 Submanifold Convolutional Networks

We explore a simple solution to the submanifold dilation problem that restricts the output of the convolution only to the set of active input points. A potential problem of this approach is that hidden layers in the network may not receive all the information they require to classify the input data: in particular, two neighboring connected components are treated completely independently. We resolve this problem by using convolutional networks that incorporate some form of pooling, or use strided convolutions. Such operations are essential in the sparse convolutional networks

444By “sparse convolutional networks”, we mean networks designed to operate on spatially-sparse input data. We do not mean networks that have sparse parameter matrices [12, 13]. we investigate, as they allow information to flow between disconnected components in the input. The closer the components are spatially, the fewer strided operations are necessary for the components to “communicate” in the intermediate representations.

4.1 Sparse Convolutional Operations

We define a sparse convolution SC( with input feature planes, output feature planes, a filter size of , and stride . An SC convolution computes the set of active sites in the same way as a regular convolution: it looks for the presence of any active sites in its receptive field of size . If the input has size then the output will have size . Unlike a regular convolution (and the sparse convolution from [4]), an SC convolution discards the ground state for non-active sites by assuming that the input from those sites is zero. This seemingly small change to the convolution brings computational benefits in practice.

Figure 3: SSC receptive field centered at different active spatial locations. Active locations in the field are shown in green. Red locations are ignored by SSC so the pattern of active locations remains unchanged.

Submanifold sparse convolution.

Next, we define a second type of sparse convolution, which forms the main contribution of this paper. Let

denote an odd number. We define a submanifold sparse convolution SSC

as a modified SC

convolution. First, we pad the input with

zeros on each side, so that the output will have the same size as the input. Next, we restrict an output site to be active if and only if the site at the corresponding site in the input is active (i.e., if the central site in the receptive field is active). Whenever an output site is determined to be active, its output feature vector is computed by the SSC convolution; see Figure 3 for an illustration. Table 2 shows the computational and memory requirements of a regular convolution (C) and our SC and SSC convolutions.

Submanifold sparse convolutions are similar to OctNets [18] in that they preserve the sparsity structure. However, unlike OctNets, empty space imposes no computational or memory overhead in the implementation of submanifold sparse convolutions.

Other operators.

To construct convolutional networks using SC and SSC, we also need activation functions, batch normalization, and pooling. Activation functions are defined as usual, but are restricted to the set of active sites. Similarly, we define batch normalization in terms of regular batch normalization applied over the set of active sites. Max-pooling MP

and average-pooling AP operations are defined as a variant of SC. MP takes the maximum of the zero vector and the input feature vectors in the receptive field. AP calculates times the sum of the active input vectors. We also define a deconvolution [25] operation DC as an inverse of the SC convolution. The set of active output sites from a DC convolution is exactly the same as the set of input active sites to the corresponding SC convolution: the connections between input and output sites are simply inverted.

Active Type C SC SSC
No, FLOPs 0
Memory 0
No, FLOPs 0 0
Memory 0 0
Table 2: Computational and memory requirements of three convolutional operations: regular convolution (C), sparse convolution (SC), and submanifold sparse convolution (SSC). We consider convolutions with size and padding at a single location in dimensions. Notation: is the number of active inputs to the spatial location, the number of input feature planes, and the number of output feature planes.

4.2 Implementation

To implement (S)SC convolutions efficiently, we store the state of a input/hidden layer in two parts: a hash table555 and a matrix. The matrix has size and contains one row for each of the active sites. The hash table contains (location, row) pairs for all active sites: the location is a tuple of integer coordinates, and the row number indicates the corresponding row in the feature matrix. Given a convolution with filter size , let denote the spatial size of the convolutional filter. Define a rule book to be a collection of integer matrices each with two columns. To implement an SC( convolution, we:

  1. Iterate once through the input hash-table. We build the output hash table and rule book on-the-fly by iterating over points in the input layers, and all the points in the output layer that can see them. When an output site is visited for the first time, a new entry is created in the output hash table. For each active input point located at point in the receptive field of an output point , add a row (input-hash, output-hash) to element of the rule book.

  2. Initialize the output matrix to all zeros. For each , there is a parameter matrix with size . For each row in , multiply the -th row of the input feature matrix by and add it to the -th row of the the output feature matrix. This can be implemented very efficiently on GPUs because it is a matrix-matrix multiply-add operation.

To implement an SSC convolution, we re-use the input hash table for the output, and construct an appropriate rule book. Note that because the sparsity pattern does not change, the same rule book can be re-used in VGG/ResNet/DenseNet networks until a pooling or subsampling layer is encountered in the architecture.

If there are active points in the input layer, the cost of building the input hash-table is . For FCN and U-Net networks, assuming the number of active sites reduces by a multiplicative factor with each downsampling operation, the cost of building all the hash-tables and rule-books is also regardless of the depth of the network.

The above implementation differs from [4] in that the cost of calculating an output site is proportional to the number of active inputs, rather than the size of the receptive field. For SC convolutions this is similar to the voting algorithm [22, 3] - the filter weights are never multiplied with inactive input locations - but for SSC convolutions, this implementation is less computationally intensive than voting as there is no interaction between active input locations and inactive neighboring output locations.

5 Submanifold FCNs and U-Nets
for Semantic Segmentation

Three-dimensional semantic segmentation involves the segmentation of 3D objects or scenes represented as point clouds into their constituent parts; each point in the input cloud must be assigned a part label. As substantial progress has been made in the segmentation of 2D images using convolutional neural networks 

[14, 19, 24], interest in the problem of 3D semantic segmentation has grown recently. This interest was fueled, in particular, by the release of a new dataset for the part-based segmentation of 3D objects, and an associated competition [23].

We use a sparse voxelized input representation similar to [4, 3], and a combination of SSC convolutions and strided SC convolutions to construct versions of the popular FCN [14] and U-Net [2] architectures. The resulting convolutional network architectures are illustrated in Figure 4; see the associated caption for details. We refer to these networks as submanifold sparse convolutional networks (SSCNs), because they process low-dimensional data living in a space of higher dimensionality.666We note that this is a slight abuse of the term “submanifold”. We emphasize that the data on which these networks are applied may contain multiple connected components, and even a mixture of 1D and 2D objects embedded in 3D space.

The basic building block for our models are “pre-activated” SSC

convolutions. Each convolution is preceded by batch normalization and a ReLU non-linearity. In addition to FCN and U-Nets with standard convolutional layers, we also experiment with variants of these networks that use pre-activated residual blocks

[8] that contain two SSC

convolutions. Herein, the residual connections are identity functions: the number of input and output features are equal. Whenever the networks reduce the spatial scale by a factor of two, we use SC

convolutions. Our implementation of FCNs upsamples feature maps to their original resolution rather than performing deconvolutions using residual blocks. This substantially reduces the number of parameters and the number of multiplication-addition operations of the FCN.

(a) Submanifold sparse FCN.
(b) Submanifold sparse U-Net.
Figure 4: Illustrations of our submanifold sparse FCN (a) and U-Net (b) architectures. Dark blue boxes represents one or more “pre-activated” SSC

convolutions, which may have residual connections. Red boxes represent size-2, stride-2 downsampling convolutions; green deconvolutions “invert” these convolutions. Purple upsampling boxes perform “nearest-neighbor” upsampling. The final linear and softmax layers are applied separately on each active input voxel.

6 Experiments

In this section, we perform experiments with SSCNs on the ShapeNet competition dataset from [23]. We compare SSCNs against three strong baseline models in terms of performance and computational cost. Specifically, we consider shape contexts [1], dense 3D convolutional networks, and 2D multi-view convolutional networks [21] as baselines. Throughout our experimental evaluation, we focus on the trade-off between segmentation accuracy and computational efficiency measured in FLOPs777We ignore the FLOPs from the final classification layer.. In a second set of experiments, we also study SSCN performance on the NYU Depth (v2) dataset [16].

6.1 Dataset

The ShapeNet segmentation dataset [23] comprises 16 different object categories (plane, chair, hat, etc.), each of which is composed of up to 6 different parts. For instance, a “plane” is segmented into “wings”, “engine”, “body”, and “tail”. Across all object categories, the dataset contains a total of 50 different object part classes. Each object is represented as a 3D point cloud that was obtained by sampling points uniformly from the surface of the underlying CAD model. Each point cloud contains between and points. To increase the size of the validation set, we re-split the training and validation sets using the first bit of the MD5 hash of the point cloud files to obtain a training set with 6,955 examples and a validation set with 7,052 examples. The test set contains 2,874 examples.

In the original dataset, the objects are axis-aligned: for instance, rockets always point along the -axis. To make the problem more challenging, we perform a random 3D translation and rotation on each point cloud before classifying it. The results in Table 3 show that removing the alignment, indeed, makes the segmentation task more challenging.

View type IoU accuracy
Aligned 63.5%
Random pose 47.8%
Table 3: Accuracy of segmentation classifiers based on shape-context features on (1) the original ShapeNet dataset and (2) a variant of the dataset in which objects are randomly rotated. The results show that removing the alignment of the ShapeNet objects via random 3D rotations makes the segmentation problem more challenging.

To evaluate the accuracy of our models, we adopt the intersection-over-union (IoU) metric of [23]. The IoU is computed for each part per object category and averaged over parts and examples for the category to produce a “per-category IoU”. This way of averaging the IoU scores rewards models that make accurate predictions even for object-parts that are very small: small parts have the same weight in the final accuracy measure as larger parts. The final accuracy measure is obtained by taking a weighted average of the per-category IoUs, using the fraction of training examples per category as weights.

6.2 Details of Experimental Setup

In all experiments, the same data pre-processing procedure is used. Specifically, each point cloud is centered and re-scaled to fit into a sphere with diameter ; scale determines the size of the voxelized representation. We use in our experiments. At scale , the voxels are approximately 99% sparse. In experiments with dense convolutional networks, we place the sphere randomly in a grid of size . For submanifold sparse convolutional networks, we place the sphere randomly in a grid of size . To voxelize the point cloud, we measure the number of points per voxel and normalize them so that non-empty voxels have a mean density of one.

Networks are trained using the same optimization hyperparameters, unless otherwise noted. We use stochastic gradient descent (SGD) with a momentum of 0.9, Nesterov updates, and

weight decay of . The initial learning rate is set to , and the learning rate is decayed by a factor of

after every epoch. We train all networks for

epochs using a batch size of . We train a single network on all object categories jointly using a multi-class negative log-likelihood loss function over all part labels.

For our SSCNs, we experiment with two types of network architectures. The first architecture (C3) operates on a single spatial resolution by stacking SSC convolutions; we use with , , , or filters per layer, and , , or layers. The second architecture type comprises FCNs and U-Nets with three layers of downsampling. These networks have , , , or filters in the first layer, and double the number of filters each time the data is downsampled. For the convolutional blocks in these networks, we use stacks of , , or SSC convolutions, or stacks of , , or residual blocks.

Details on testing.

At test time, we only compute softmax probabilities for part labels that actually appear in the object that is being segmented,

i.e., we assume the models know the category of the object they are segmenting. Softmax probabilities for irrelevant part classes are set to zero (and the distribution over part labels is re-normalized).

For each of the three network types (C3, FCN, and U-Net), we train a range of models with varying sizes, as described above, and monitor their accuracy on the validation set. For each network type, we select the networks that correspond to local maxima on the accuracy vs. FLOPs curve, and report test set accuracies for those networks. Akin to multi-crop testing that is commonly used with 2D convolutional networks, we measure test accuracies using “multi-view” testing. In particular, we generate different views of the object with by randomly rotating them, and average the model predictions for each point over the different views of the object.

6.3 Baselines

In addition to SSCNs, we consider three baseline models in our experiments: (1) shape contexts [1]; (2) dense 3D convolutional networks; and (3) 2D multi-view convolutional networks [21]. Details of the four baseline models are described separately below.

Shape contexts.

Inspired by [1], we define a voxelized shape context vector. Specifically, we define a ShapeContext layer as a special case of the SSC submanifold convolution operator: we set the weight matrix of the operator to be a identity matrix so that it accumulates the voxel intensities in its neighborhood.

We scale the data using average pooling with sizes 2, 4, 8 and 16 to create four additional views. Combined this gives each voxel a 135-dimensional feature vector. The feature vector is then fed into a non-convolutional multi-layer perceptron (MLP) with two hidden layers, followed by a 50-class softmax classifier. The MLPs have 32, 64, 128, 256, or 512 units per layer. At test time, we use multi-view testing with


(a) Comparison with baseline methods.
(b) Comparison between architectures (see 6.2).
(c) SSCN with different scales, .
Figure 5: Average interaction-over-union (IoU) on the test set of SSCNs trained for 3D semantic segmentation on the ShapeNet competition data set (higher is better).

Dense 3D convolutional networks.

For dense 3D convolutional networks, we simply considered dense versions of the SSCN networks. Due to computational constraints, we restricted the FCN and U-Net convolutional blocks to a single C3-layer. We trained some of the models with a reduced learning rate due to numerical instabilities we observed during training. Again, we use multi-view testing.

Convolutional networks on multi-view 2D projections.

This baseline model discards the inherent 3D structure of the data by projecting the point cloud into a two-dimensional view by assuming infinite focal length, applying a 2D convolutional network on this projection, and averaging the predictions over multiple views. An immediate advantage of this approach is that well-studied models from 2D vision can be used out-of-the-box without further adaptations. Moreover, the computational cost scales with the surface area, rather than the volume of the point cloud.

In our implementation of this approach, we first convert the point clouds into a 3D grid of size as we did for the previous baseline. We then project to a plane of size , i.e. a face of the cube, with two feature channels. One feature channel is the first visible, non-zero voxel along the corresponding column. The second channel is the distance to the visible voxel, normalized to the range ; this is analogous to the depth channel of an RGB-D image. Our network architectures are two-dimensional versions of the dense 3D convolutional networks described above.

During training, a random projection of the point cloud is passed into the model. Points in the point cloud that fall into the same voxel are given the same prediction. Some voxels are occluded by others—the network receives no information on the occluded voxels. We modify the multi-view testing procedure to take into account the occlusion of voxels. Similar to before, predictions are performed using a weighted sum over random projections. We found that 2D networks require more views to obtain high accuracy and therefore use . Voxels that are observed in the 2D projection are given a weight of . The weight of occluded voxels decays exponentially with the distance to the voxel that occludes them.

6.4 Results

In Figure 5, we report the average IoU of a range of differently sized variants of (1) the three baseline models and (2) submanifold C3, FCN and U-Nets on the ShapeNet test set. The average IoU is shown as a function of the number of multiplication-addition operations (FLOPs) required by the models for computing the predictions. Note that these results are not directly comparable with those in [23] because we are testing the models in the more challenging “random-pose” setting.

SSCNs vs. baselines.

Figure 5(a) compares SSCNs with the three baselines.888The number of FLOPs reported for shape contexts may be slightly misleading: the computational costs of calculating shape context features is not reflected in the number of FLOPs, as it involves integer arithmetic. The results show that shape context features, multi-view 2D ConvNets, and dense 3D ConvNets perform roughly on par in terms of accuracy per FLOP. SSCN networks outperform all baseline models by a substantial margin. For instance, at FLOPs, the average IoU of SSCNs is - higher than that of the baselines. Importantly, our results show that restricting information to travel along submanifolds in the data does not hamper the performance of SSCNs, whilst it does lead to considerable computational and memory savings that can be exploited to train larger models with better accuracies.


In Figure 5(b), we compare the three SSCN architectures presented in Section 6.2. We observe that SSCNs involving downsampling and upsampling operations (FCNs and U-Nets) outperform SSCNs operating on a single spatial resolution and we conjecture that this is due to the increased receptive field obtained by downsampling.

Figure 5(c) shows the performance of SSCNs at three different scales (using all three architectures; C3, FCN, and U-Net). We observe that the performance of SSCNs is similar for different values of , particularly for a low number of FLOPs. At a higher number of FLOPs, the models operating at a larger scale perform slightly better.

6.5 Results on Competition Data

To compare SSCNs with the entries to the competition in [23], we also trained a FCN SSCN on the aligned point clouds. In this experiment, we performed data augmentation using random affine transforms. We set and use filters in the input layer, three levels of downsampling, and two residual blocks per spatial resolution. The results of 10-view testing are compared with the competition entries in Table 1. With a test error of 85.98%, our network outperforms other methods by IoU.

6.6 Semantic Segmentation of Scenes

We also performed experiments on the NYU Depth dataset (v2) [16] for semantic segmentation of scenes rather than objects. The dataset contains RGB-D images, which are semantically segmented into 894 different classes. Figure 6 shows two pairs of an RGB image and the associated depth map from the dataset. Following [6, 14], we crop the images and reduce the number of classes to . To assess the performance of our models, we measure their pixel-wise classification accuracy. We compare our models to a 2D FCN [14].

Figure 6: Two examples of RGB-D images from the NYU Depth dataset (v2) [16].

We perform experiments with two differently sized SSCN-FCN networks. Network A has 16 filters in the input layer, and one SSC convolution per level. Network B has 24 filters in the input layer, and two SSC convolutions per level. Both networks use eight levels of downsampling; we increase the number of filters in the networks when downsampling, adding 16 (A) or 24 (B) features with each reduction of scale.

We use the depth information to convert the RGB-D images into a 3D point cloud. Each point in the cloud has the three (RGB) features that were normalized to the range , and a fourth indicator features that is set to for each point in the point cloud. During training, we perform data augmentation by applying random affine transformations to the point cloud. Before voxelizing the point cloud, we downscale by a factor of two, and place the points into the model’s receptive field. We form voxels by averaging the feature vectors of the points corresponding to the voxel. At test time, we perform multi-view testing with .

The results of our experiments on the NYU Depth dataset (v2) are presented in Table 4. The results in the table show that SSCNs outperform 2D FCN in terms of pixel accuracy by up to , whilst also substantially reducing the computational costs of the model.

To verify that SSCN-FCN A actually uses depth information, we repeat the experiment whilst setting all the depth values to zero; this prevents the SSCN from exploiting depth information. We observe: (1) a reduction of FLOPs by 60%, as there are fewer active voxels; and (2) a drop in accuracy from 64.1% to 50.8%, which demonstrates that SSCNs do use 3D structure when performing segmentation.

Network Accuracy FLOPs Memory
2D FCN [14] 1 61.5% 28.50G 135.7M
SSCN-FCN A 1 64.1% 1.09G 5.2M
4 66.9% 4.36G 20.7M
SSCN-FCN B 1 66.4% 4.50G 11.6M
4 68.5% 17.90G 46.4M
Table 4: Semantic segmentation performance of five different convolutional networks on the NYU Depth test set (v2) on 40 classes. We report, the pixel-wise classification accuracy, the computational costs (in FLOPs), and the memory requirements (c.f. Table 2).

7 Conclusions

In this paper, we presented submanifold sparse convolutional networks (SSCNs) for the efficient processing of high-dimensional, sparse input data. We demonstrated the efficacy of SSCNs in a series of experiments on semantic segmentation of three-dimensional point clouds. Specifically, our SSCN networks outperform a range of state-of-the-art approaches for this problem, both when identifying parts within an object and when recognizing objects in a larger scene. Moreover, SSCNs are computationally efficient compared to alternative approaches.