Spherical Kernel for Efficient Graph Convolution on 3D Point Clouds
We propose a spherical kernel for efficient graph convolution of 3D point clouds. Our metric-based kernels systematically quantize the local 3D space to identify distinctive geometric relationships in the data. Similar to the regular grid CNN kernels, the spherical kernel maintains translation-invariance and asymmetry properties, where the former guarantees weight sharing among similar local structures in the data and the latter facilitates fine geometric learning. The proposed kernel is applied to graph neural networks without edge-dependent filter generation, making it computationally attractive for large point clouds. In our graph networks, each vertex is associated with a single point location and edges connect the neighborhood points within a defined range. The graph gets coarsened in the network with farthest point sampling. Analogous to the standard CNNs, we define pooling and unpooling operations for our network. We demonstrate the effectiveness of the proposed spherical kernel with graph neural networks for point cloud classification and semantic segmentation using ModelNet, ShapeNet, RueMonge2014, ScanNet and S3DIS datasets. The source code and the trained models can be downloaded from https://github.com/hlei-ziyan/SPH3D-GCN.READ FULL TEXT VIEW PDF
We propose a neural network for 3D point cloud processing that exploits
Unlike on images, semantic learning on 3D point clouds using a deep netw...
Analyzing the geometric and semantic properties of 3D point clouds throu...
Spherical data is found in many applications. By modeling the discretize...
A number of problems can be formulated as prediction on graph-structured...
We introduce tensor field networks, which are locally equivariant to 3D
We present Kernel Point Convolution (KPConv), a new design of point
Spherical Kernel for Efficient Graph Convolution on 3D Point Clouds
are known for accurately solving a wide range of Computer Vision problems. Classification[2, 3, 4, 5], image segmentation [6, 7, 8], object detection [9, 10, 11]
, and face recognition[12, 13] are just a few examples of the tasks for which CNNs have recently become the default modelling technique. The success of CNNs is mainly attributed to their impressive representational prowess. However, their representation is only amenable to the data defined over regular grids, e.g. pixel arrays of images and videos. This is problematic for applications where the data is inherently irregular , e.g. 3D Vision, Computer Graphics and Social Networks.
In particular, point clouds produced by 3D vision scanners (e.g. LiDAR, Matterport) are highly irregular. Recent years have seen a surge of interest in deep learning for 3D vision due to self-driving vehicles. This has also resulted in multiple public repositories of 3D point clouds[15, 16], , , . Early attempts of exploiting CNNs for point clouds applied regular grid transformation (e.g. voxel grids [20, 21], multi-view images ) to point clouds for processing them with 3D-CNNs or enhanced 2D-CNNs [3, 5]. However, this line of action does not fully exploit the sparse nature of point clouds, leading to unnecessarily large memory footprint and computational overhead of the methods. Riegler et al.  addressed the memory issue in dense 3D-CNNs with an octree-based network, termed OctNet. However, the redundant computations over empty spaces still remains a discrepancy of OctNet.
Computational graphs are able to capitalize on the sparse nature of point clouds much better than volumetric or multi-view representations. However, designing effective modules such as convolution, pooling and unpooling layers, becomes a major challenge for the graph based convolutional networks. These modules are expected to perform point operations analogous to the pixel operations of CNNs, albeit for irregular data. Earlier instances of such modules exist in theoretical works [24, 25, 26], which can be exploited to form Graph Convolutional Networks (GCNs) . Nevertheless, these primitive GCNs are yet to be seen as a viable solution for point cloud processing due to their inability to effectively handle real-world point clouds.
Based on the convolution operation, GCNs can be divided into two groups, namely; the spectral networks [24, 25, 26, 27] and the spatial networks [28, 29, 30, 31, 32]. The former perform convolutions using the graph Laplacian and adjacency matrices, whereas the latter perform convolutions directly in the spatial domain. For the spectral networks, careful alignment of the graph Laplacians of different samples is necessary . This is not easily achieved for the real-world point clouds. Consequently, the spatial networks are generally considered more attractive than the spectral networks in practical applications.
The spatial GCNs are challenged by the unavailability of discrete convolutional kernels in the 3D metric space. To circumvent the problem, mini-networks [28, 29, 31] are often applied to dynamically generate edge-wise filters. This incurs significant computational overhead, which can be avoided in the case of discrete kernels. However, design and application of discrete kernels in this context is not straight forward. Beside effective discretization of the metric space, the kernel application must exhibit the properties of (a) translation-invariance that allows identification of similar local structures in the data, and (b) asymmetry for vertex pair processing to ensure that the overall representation remains compact.
Owing to the intricate requirements of discrete kernels for irregular data, many existing networks altogether avoid the convolution operation for point cloud processing ,[34, 35, 36]. Although these techniques report decent performance on benchmark datasets, they do not contribute towards harnessing the power of convolutional networks for point clouds. PointCNN  is a notable exception that uses a convolutional kernel for point cloud processing. However, its kernel is again defined using mini-networks, incurring high computational cost. Moreover, it is sensitive to the order of the neighborhood points, implicating that the underlying operation is not permutation-invariant, which is not a desired kernel property for point clouds.
In this work, we introduce a discrete metric-based spherical convolutional kernel that systematically partitions a 3D region into multiple volumetric bins as shown in Fig. 1. The kernel is directly applied to point clouds for convolution. Each bin of the kernel specifies learnable parameters to convolve the points falling in it. The convolution defined by our kernel preserves the properties of translation-invariance, asymmetry, as well as permutation-invariance. The proposed kernel is applied to point clouds using Graph Networks. To that end, we construct the networks with the help of range search  and farthest point sampling . The former defines edges of the underlying graph, whereas the latter coarsens the graph as we go deeper into the network layers. We also define pooling and unpooling modules for our graph networks to downsample and upsample the vertex features. The novel convolutional kernel and its application to graph networks are thoroughly evaluated for the tasks of 3D point cloud classification and semantic segmentation. We achieve highly competitive performance on a wide range of benchmark datasets, including ModelNet , ShapeNet , RueMonge2014 , ScanNet  and S3DIS . Owing to the proposed kernel, the resulting graph networks are found to be efficient in both memory and computation. This leads to fast training and inference on high resolution point clouds.
This work is a significant extension of our preliminary findings presented in IEEE CVPR 2019 . Below, we summarize the major directions along which the technique is extended beyond the preliminary work.
Graph architecture. Instead of the octree-guided network of , we use a more flexible graph-based technique to design our network architectures. This allows us to exploit convolution blocks and define pooling/unpooling operations independent of convolution. In contrast to the convolution-based down/upsampling, specialized modules for these operations are highly desirable for processing large point clouds. Moreover, this strategy also brings our network architectures closer to the standard CNNs.
Comprehensive evaluation on real-world data. Compared to the preliminary work , we present a more thorough evaluation on real-world data. Highlights include 4.2% performance gain over  for the RueMonge2014 dataset, and comprehensive evaluation on two additional datasets, ScanNet and S3DIS. The presented results ascertain the computational efficiency of our technique with highly competitive performance on the popular benchmarks.
Tensorflow implementation. While 
was implemented in Matconvnet, with this article, we release cuda implementations of the spherical convolution and the pooling/unpooling operations for Tensorflow. The source code is available on Github (https://github.com/hlei-ziyan/SPH3D-GCN) for the broader research community.
PointNet  is one of the first techniques to directly process point clouds with deep networks. It uses the
coordinates of points as input features. The network learns point-wise features with shared MLPs, and extracts a global feature with max pooling. One limitation of this technique is that it does not explore the geometric context of points in representation learning. PointNet++ addresses that by applying max-pooling to the local regions hierarchically. However, both networks must rely on max-pooling to aggregate any context information without convolution.
SO-Net  builds an rectangular map from the point cloud, and hierarchically learns node-wise features within the map using mini-PointNet. However, similar to the original PointNet, it also fails to exploit any convolution modules. KCNet  learns features with kernel correlation between the local neighboring points and a template of learnable points. This can be optimized in a training session similar to convolutional kernels. In contrast to the image-like map used by the SO-Net, KCNet is based on graph representation. Kd-network  is a prominent contribution that processes point clouds with tree structure based networks. This technique also uses point coordinates as the input and computes the feature of a parent node by concatenating the features of its children in a balanced tree. Despite their varied network architecture construction, none of the above methods contribute towards developing convolutional networks for point clouds. Approaches that advance research in that direction can be divided into two broad categories, discussed below.
At the advent of 3D deep learning, researchers predominantly extracted features with 3D-CNN kernels using volumetric representations. The earlier attempts in this direction could only process voxel-grids of low resolution (e.g. 303030 in ShapeNets , 323232 in VoxNet ), even with the modern GPUs. This issue also transcended to the subsequent works along this direction [42, 43, 44, 45]. The limitation of low input resolution was a natural consequence of the cubic growth of memory and computational requirements associated with the dense volumetric inputs. Different solutions later appeared to address these issues. For example, Engelcke et al.  introduced sparsity in the input and hidden neural activations. Their solution is effective in reducing the number of convolutions, but not the amount of required memory. Li et al.  proposed a field probing neural network, which transforms 3D data into intermediate representations with a small set of probing filters. Although this network is able to reduce the computational and memory costs of fully connected layers, the probing filters fail to support weight sharing. Later, Riegler et al.  proposed the octree-based OctNet, which represents point clouds with a hybrid of shallow grid octrees (depth = 3). Compared to its dense peers, OctNet reduces the computational and memory costs to a large degree, and is applicable to high-resolution inputs up to 256256256. However, it still has to perform unnecessary computations in the empty spaces around the objects. Other recent techniques also transform the original point cloud into other regular representations like tangent image  or high-dimensional lattice  such that the standard CNNs can be applied to the transformed data.
The demand of irregular data processing with CNN-like architectures has resulted in a recent rise of graph convolutional networks . In general, the broader graph-based deep learning has also seen techniques besides convolutional networks that update vertex features recurrently to propagate the context information (e.g. [50, 51, 52, 53, 54]). However, here, our focus is on graph convolutional networks that relate to our work more closely.
). The spectral networks perform convolution on spectral vertex signals converted from Fourier transformation, while the spatial networks perform convolution directly on the spatial vertices. A major limitation of the spectral networks is that they require the graph structure to be fixed, which makes their application to the data with varying graph structures (e.g. point clouds) challenging. Yiet al. 
attempted to address this issue with Spectral Transformer Network (SpecTN), similar to STN in the spatial domain. However, the signal transformation from spatial to spectral domains and vice-versa has computational complexity , resulting in prohibitive requirements for large point clouds.
ECC  is among the pioneering works for point cloud analysis with graph convolution in the spatial domain. Inspired by the dynamic filter networks , it adapts MLPs to generate convolution filters between the connected vertices dynamically. The dynamic generation of filters naturally comes with a computational overhead. DGCNN , Flex-Conv  and SpiderCNN  subsequently explore different parameterizations to generate the edge-dependent filters. Instead of generating filters for the edges individually, few networks also generate a complete local convolution kernel at once using mini networks [37, 31]. Li et al.  recently introduced PointCNN that uses a convolution module named -Conv for point cloud processing. The network achieves good performance on the standard benchmarks (e.g. ShapeNet and S3DIS). However, the generated kernels are sensitive to the order of neighborhood points indicating that the underlying representation is not permutation-invariant. Moreover, the strategy of dynamic kernel generation makes the technique computationally inefficient.
More recently, Wang et al.  inserted an attention mechanism in graph convolutional networks to develop GACNet. Such an extension of graph networks is particularly helpful for semantic segmentation as it enforces the neighborhood vertices to have consistent semantic labels similar to CRF . Besides the convolution operation, graph coarsening and edge construction are two essential parts for the graph network architectures. We briefly review the methods along these aspects below.
Graph coarsening: Point cloud sampling methods are useful for graph coarsening. PointNet++  utilizes farthest point sampling (FPS) to coarsen the point cloud, while Flex-Conv  samples the point cloud based on inverse densities (IDS) of each point. Random sampling is the simplest alternative to FPS and IDS, but it does not perform as well for the challenging tasks like semantic segmentation. Recently, researchers also started to explore the possibility of learning sampling with deep neural networks . In this work, we exploit FPS as the sampling strategy for graph coarsening, as it does not need training and it reduces the point cloud resolution relatively uniformly.
Point neighborhood search can be used to build edge connections in a graph. KNN search generates fixed number of neighborhood points for a given point, which results in a regular graph. Range search generates flexible number of neighborhood points, which may results in irregular graphs. Tree structures can also be seen as special kinds of graphs[34, 32], however, the default absence of intra-layer connections in trees drastically limits their potential as graph networks. In a recent example, Rao et al.  proposed to employ spherical lattices for regular graph construction. Their technique relies on convolution and max-pooling to aggregate the geometric context between neighbouring points.
Given an arbitrary point cloud of points , we represent the neighborhood of each point as . To achieve graph convolution on the target point , the more common ‘continuous’ filter approaches [28, 29, 59, 58, 37, 31] parameterize convolution as a function of local point coordinates. For instance, suppose is the filter that computes the output feature of channel . These techniques may represent the filter as , where is a continuous function (e.g. MLP) and . However, compared to the continuous filters, discrete kernel is predefined and it does not need the above mentioned (or similar) intermediate computations. This makes a discrete kernel computationally more attractive.
Following the standard CNN kernels, a primitive discrete kernel for point clouds can be defined similar to the 3D-CNN kernels [20, 21]. For resolution , this kernel comprises weight filters . By incorporating the notion of separable convolution 
into this design, each weight filter is transformed from a vectorto a scalar . It is noteworthy that the application of a discrete kernel to ‘graph’ representation is significantly different from its volumetric counterpart. Hence, to differentiate, we refer to a kernel for graphs as CNN3D kernel. A CNN3D kernel indexes the bins and for the bin, it uses to propagate features from all neighboring point in that bin to the target point , see Fig. 2. It performs convolutions only at the point locations, avoiding unnecessary computations at empty spaces, which is in contrast to 3D-CNN kernels.
We make the following observation in relation to improving the CNN3D kernels. For images, the more primitive constituents, i.e. patches, have traditionally been used to extract hand-crafted features [63, 64]
. The same principle transcended to the receptive fields of automatic feature extraction with CNNs, which compute feature maps using the activations of well-defined rectangular regions of images. Whereas rectangular regions are intuitive choice for images, spherical regions are more suited to process unstructured 3D data such as point clouds. Spherical regions are inherently amenable to computing geometrically meaningful features for such data[65, 66, 67]. Inspired by this natural kinship, we introduce the concept of spherical convolution kernel111The term spherical in Spherical CNN  is used for surfaces (i.e. images) not the ambient 3D space. Our notion of spherical kernel is widely dissimilar, and it is used in a different context. Also, note that, different from the preliminary work , here the spherical kernel is only used to perform depth-wise spatial convolutions. (termed SPH3D kernel) that considers a 3D sphere as the basic geometric shape to perform the convolution operation. We explain the proposed discrete spherical kernel in Section 3.1, and later contrast it to the existing CNN3D kernels in Section 3.2.
We define the convolution kernel with the help of a sphere of radius , see Fig. 2. For a target point , we consider its neighborhood to be the set of points within the sphere centered at , i.e. , where is a distance metric - distance in this work. We divide the sphere into ‘bins’ by partitioning the occupied space uniformly along the azimuth () and elevation () dimensions. We allow the partitions along the radial () dimension to be non-uniform because the cubic volume growth for large radius values can be undesirable. Our quantization of the spherical region is mainly inspired by 3DSC . We also define an additional bin corresponding to the origin of the sphere to allow the case of self-convolution of points on the graph. To produce an output feature map, we define a learnable weight parameter for each bin, where relates to self-convolution. Combined, the weight values specify a single spherical convolution kernel.
To compute the activation value for a target point , we first identify the relevant weight values of its neighboring points . It is straightforward to associate to for self-convolution. For the non-trivial cases, we first represent the neighboring points in terms of their spherical coordinates that are referenced using as the origin. That is, for each we compute , where defines the transformation from Cartesian to Spherical coordinates and . Supposing that the bins of the quantized sphere are respectively indexed by , and along the azimuth, elevation and radial dimensions, the weight values associated with each spherical kernel bin can then be indexed as , where . Using this indexing, we relate the relevant weight value to each , and hence . In the network layer, the activation for the point in channel gets computed as:
where is the feature of a neighboring point from layer , is the weight value, and
is the non-linear activation function - ELU in our experiments. By applying the spherical convolution times for each input channel, we produce output features for the target convolution point .
To elaborate on the characteristics of the spherical convolution kernel, we denote the boundaries along , and dimensions of the kernel bins as follows:
The constraint of uniform splitting along the azimuth and elevation results in and . Lemma 2.1: If , and , then for any two points within the spherical convolution kernel, the weight value , are applied asymmetrically.
Proof: Let , then . Under the Cartesian to Spherical coordinate transformation, we have , and . Assume that the resulting and fall in the same bin indexed by , i.e. will have to be applied symmetrically to the original points. In that case, under the inverse transformation , we have and . The condition entails that . Similarly, . Since , for we have . However, if , fall into the same bin, we have , which entails . Thus, can not be applied to any two points symmetrically unless both points are the same.
The kernel asymmetry forbids weight sharing between point pairs for the convolution operation, which leads to learning fine geometric details of the point clouds. Lemma 2.1 also provides guidelines on how to divide the spherical space into kernel bins such that the asymmetry is always preserved. The resulting division also ensures translation-invariance of the kernel, similar to the standard CNN kernels. Additionally, unlike the convolution operation of PointCNN , the proposed kernel is invariant to point permutations because it explicitly incorporates the geometric relationships between the point pairs.
We can apply the spherical convolution kernel to learn depth-wise features in the spatial domain. The point-wise convolution can be readily achieved with shared MLP or convolution using any modern deep learning library. To be more precise, the two convolutions make our kernel perform separable convolution . However, we generally refer to it as spherical convolution, for simplicity.
CNN3D kernel rasterizes 3D data into uniform voxel grids, where the size of is prevalently used. This size splits the space in 1 voxel for radius (self-convolution); 6 voxels for radius ; 12 voxels for radius ; and 8 voxels for radius . An analogous spherical convolution kernel for the same region can be specified with a radius , using the following edges for the bins:
This division results in a kernel size (i.e. total number of bins) , which is one of the coarsest multi-scale quantization allowed by Lemma 2.1.
Notice that, if we move radially from the center to periphery of the spherical kernel, we encounter identical number of bins (16 in this case) after each edge defined by , where fine-grained bins are located close to the origin that can encode detailed local geometric information of the points. This is in sharp contrast to CNN3D kernels that must keep the size of all cells constant and rely on increased resolution to capture the finer details. This makes their number of parameters grow cubicly, harming the scalability. The multi-scale granularity of spherical kernel (SPH3D) allows for more compact representation.
To corroborate, we briefly touch upon classification with CNN3D and SPH3D kernel, using a popular benchmark dataset ModelNet40  in Table I. We give further details on the dataset and experimental settings in Section 5. Here, we focus on the single aspect of representation compactness resulting from the non-uniform granularity of the bins in SPH3D. In the table, the only difference in the networks is in the used kernels. All the other experimental details are ‘exactly’ the same for all networks. Network-1 and 2 use CNN3D kernels that partition the space into and bins, respectively. The SPH3D kernel partitions the space into bins. Consequently, the kernel requires parameters as compared to the Network-1 kernel, but only parameters required by the Network-2 kernel. However, the performance of Network-3 easily matches Network-2. Such an advantage is a natural consequence of the non-uniform partitioning allowed by our kernel.
In this work, we employ graph neural network to process point clouds. Compared to the inter-layer connectivity of the octree-guided network of our preliminary work , graph representation additionally allows for intra-layer connections. This is beneficial in defining effective convolutional blocks as well as pooling/unpooling modules in the network. Let us consider a graph constructed from a point cloud , where and respectively represent the sets of vertices and edges. It is straightforward to associate each vertex of the graph to a point location and its corresponding feature . However, the edge set must be carefully established based on the neighborhood of the points.
Edge construction: We use range search with a specified radius to get the spatial neighborhood of each point and construct the edge connections of each graph vertex. In the range search, neighborhood computations are independent of each other, which makes the search suitable for parallel processing and taking advantage of GPUs. The time complexity of the search is linear in the number of vertices . One potential problem of using range search is that large number of neighborhood points in dense clouds can cause memory issues. We sidestep this problem by restricting the number of neighboring points to by randomly sub-sampling the neighborhood, if required. The edges are finally built on the sampled points. As a result, the neighborhood indices of the vertex can be denoted as , in which . With these sets identified, we can later compute features for vertices with spherical convolution.
Graph coarsening: We use Farthest Point Sampling (FPS) to coarsen the point graph in our network layer-by-layer. The FPS algorithm selects one random seed vertex, and iteratively searches for the point that is farthest apart from the previously selected points for the sampling purpose. The algorithm terminates when the desired number of sampled points are acquired, which form the coarsened graph. By alternately constructing the edges and coarsening the graph for times, we construct a graph pyramid composed of graphs, i.e. . As compared to the octree structure based graph coarsening adopted in the preliminary work , FPS coarsening has the advantage of keeping the number of vertices of each layer fixed across different samples, which is conducive for more systematic application of convolutional kernels.
Pooling: Once a graph is coarsened, we still need to compute the features associated with its vertices. To that end, we define max pooling and average pooling operations to sample features for the coarsened graph vertices. Inter-layer graph connections facilitate these operations. To be consistent, we denote the graphs before and after pooling layer as and respectively, where . Let and be the two vertices associated with the same point location. The inter-layer neighborhood of can be readily constructed from graph as . We denote the features of and its neighborhood point as and respectively. The max pooling operation then computes the feature of the vertex as
while the average pooling computes it as
Decoder architectures with increasing neuron resolution are important for element-wise predictions in semantic segmentation, dense optical flow , etc. We build graph decoder by inverting the graph pyramid as . The coarsest graph is ommited in the reversed pyramid because it is shared between encoder and decoder. We denote the graphs before and after an unpooling layer as and respectively. To upsample the features from to
, we define two types of feature interpolation operations, namely; uniform interpolation and weighted interpolation. Notice that the neighborhood setin Eqs. (4), (5) is readily available because of the relation . However, the vertices of graphs and satisfy on the contrary. Therefore, we have to additionally construct the neighborhood of from . For that, we again use the range search to compute . The features of and its neighborhood points can be consistently denoted as and . The uniform interpolation computes the feature of vertex as the average features of its inter-layer neighborhood points, i.e.
The weighted interpolation computes the features of vertex by weighing its neighborhood features based on their distance to . Mathematically,
where . Here, is the distance function and the points and are associated to vertices and , respectively. In our source code, we provide both types of interpolation functionalities for upsampling. However, the experiments in Section 5 are performed with uniform interpolation for its computational efficiency.
In Fig. 3, we illustrate an encoder-decoder graph neural network constructed by our technique for a toy example. In the shown network, a graph of 12 vertices gets coarsened to 8 () and 4 () vertices in the encoder network, and later gets expanded in the decoder network. The pooling/unpooling operations are applied to learn features of the structure altered graphs. The graph structure remains unchanged during convolution operation. Notice, we apply consecutive spherical convolutions to form convolution blocks in our networks. In the figure, variation in width of the feature maps depicts different number of channels (e.g. 128, 256 and 384) for the features. The shown U-shape architecture for the task of semantic segmentation also exploits skipping connections similar to U-Net [7, 8]. These connections copy features from the encoder and concatenate them to the decoder features. For the classification task, these connections and the decoder part are removed, and a global feature representation is fed to a classifier comprising fully connected layers. The simple architecture in Fig. 3 graphically illustrates the application of the above-mentioned concepts to our networks in Section 5, where we provide details of the architectures used in our experiments.
Software for Tensorflow: With this article, we also release a cuda enabled implementation for the above presented concepts. The package is Tensorflow compatible . As compared to the Matconvnet  source code of the preliminary work , Tensorflow compatibility is chosen due to the popularity of the programming framework. In the package, we provide cuda implementations of the spherical convolution, range search, max pooling, average pooling, uniform interpolation and weighted interpolation. The provided spherical kernel implementation can be used for convolutions on both regular and irregular graphs. Unlike existing methods (e.g. [37, 31]), we do not impose any constraint on the vertex degree of the graph allowing the graphs to be more flexible, similar to ECC 
. In our implementation, the spherical convolutions are all followed by batch normalization. In the preliminary work , the implemented spherical convolution does not separate the depth-wise convolution from the point-wise convolution , thereby performing the two convolutions simultaneously similar to a typical convolution operation. Additionally, the previous implementation is specialized to octree structures, and hence not applicable to general graph architectures. The newly released implementation for Tensorflow improves on all of these aspects. The source code and further details of the released package can be found at https://github.com/hlei-ziyan/SPH3D-GCN.
We evaluate our technique for classification and semantic segmentation tasks using clean CAD point clouds and large-scale noisy point clouds of real-world scenes. The dataset used in our experiments include ModelNet , ShapeNet , RueMonge2014 , ScanNet  and S3DIS , for which representative samples are illustrated in Fig. 4. We only use the coordinates of points to train our networks, except when the values are also available. In that case, we additionally use those values by rescaling them into the range . We note that, a few existing methods also take advantage of normals as input features [33, 35, 36, 62]. However, normals are not directly sensed by the 3D sensors and must be computed separately, entailing additional computational burden. Hence, we avoid using normals as input features except for RueMonge2014, which already provides the normals.
Throughout the experiments, we apply the spherical convolution with a kernel size . Our network training is conducted on a single Titan Xp GPU with 12 GB memory. We use Adam Optimizer  with an initial learning rate of 0.001 and momentum 0.9 to train the network. The batch size is kept fixed to 32 in ModelNet and ShapeNet, and 16 the remaining datasets. The maximum neighborhood connections for each vertex is set to . These hyper-parameters are empirically optimized with cross-validation. We also employ data augmentation in our experiments. For that, we use random sub-sampling to drop points, and random rotation, which include azimuth rotation (up to rad) and small arbitrary perturbations (up to
degrees) to change the view of point clouds. We also apply random scaling, shifting and noisy translation of points with std. dev = 0.01. These operations are commonly found in the related literature. We apply them on-the-fly in each training epoch of the network.
Network Configuration: Table II provides the summary of network configurations used in our experiments for the classification and segmentation tasks. We use identical configurations for semantic segmentation on the realistic datasets RueMonge2014, ScanNet and S3DIS, but a different one for the part segmentation of the synthetic ShapeNet. Our network for the realistic datasets takes input point clouds of size . To put this size into perspective, it is four times of points accepted by PointCNN . Further discussion on network configuration is also provided the related sections below.
The benchmark ModelNet40 dataset  is used to demonstrate the promise of our technique for object classification. The dataset comprises object meshes for 40 categories with 9,843/2,468 training/testing split. To train our network, we create the point clouds by sampling on mesh surfaces. Compared to the existing methods (e.g. [33, 35, 41, 28]), the convolutions performed in our network enable processing large input point clouds. Hence, our network is trained employing 10K input points. The channel settings of the first MLP and the six SPH3D layers is 32 and 64-64-64-128-128-128. We use the same classifier 512-256-40 as the previous works [33, 41, 32]. The Encoder4 in Table II indicates that the network learns a global representation of the point cloud using G-SPH3D. For that, we create a virtual vertex whose associated coordinates are computed as the average coordinates of the real vertices in the graph. We connect all the real vertices to the virtual vertex, and use a spherical kernel of size for feature computation. G-SPH3D computes the feature only at the virtual vertex, that becomes the global representation of point cloud for the classifier.
Following our preliminary work for -CNN , we boost performance of the classification network by applying max pooling to the intermediate layers, i.e. Encoder1, Encoder2, Encoder3. We concatenate these max-pooled features to the global feature representation in Encoder4 to form a more effective representation. This results in features with channels for the classifier. We use weight decay of in the end-to-end network training, where 0.5 dropout  is also applied to the fully connected layers of the classifier to alleviate overfitting.
Table III benchmarks the performance of our technique that is abbreviated as SPH3D-GCN. All the tabulated techniques uses coordinates as the raw input features. We also report the training and inference time of PointNet222https://github.com/charlesq34/pointnet., PointNet++333https://github.com/charlesq34/pointnet2., -CNN and SPH3D-GCN on our local Titan Xp GPU. The timings for PointCNN are taken from , which are based on a more powerful Tesla P100 GPU. Titan Xp and Tesla P100 performance can be compared using [76, 77]. As shown in the Table, SPH3D-GCN and -CNN - our preliminary work - achieve very competitive results. Comparing the computational and memory advantage of SPH3D-GCN over -CNN, for 10K input points, SPH3D-GCN requires much less parameters (0.78M vs. 3.0M) and performs much faster (18.1/8.4ms vs. 84.3/34.1ms). We also report the performance of SPH3D-GCN for points, where the training/inference time becomes comparable to PointNet++, but performance does not deteriorates much. It is worth mentioning that, relative to PointCNN, the slightly higher number of parameters for our technique results from the classifier. In fact, our parameter size for learning the global feature representation is 0.2M, which is much less than the 0.5M for the PointCNN.
The ShapeNet part segmentation dataset  contains 16,881 synthetic models from 16 categories. The models in each category have two to five annotated parts, amounting to 50 parts in total. The point clouds are created with uniform sampling from well-aligned 3D meshes. This dataset provides coordinates of the points as raw features, and has 14,007/2,874 training/testing split defined. Following the existing works [78, 16, 32], we train independent networks to segment the parts of each category. The configuration of our U-shape graph network is shown in Table II. The output class number of the classifier is determined by the number of parts in each category. We standardize the input models of ShapeNet by normalizing the input point clouds to unit sphere with zero mean. Among other ground truth labelling issues pointed out by the existing works [32, 41, 49], there are some samples in the dataset that contain parts represented with only one point. Differentiating these point with only geometric information is misleading for deep models, both from training and testing perspective. Hence, after normalizing each model, we also remove such points from the point cloud444We remove parts represented with a single point in the range 0.3..
In Table IV, we compare our results with the popular techniques that also take irregular point clouds as input, using the part-averaged IoU (mIoU) metric proposed in . In the table, techniques like PointNet, PointNet++, SO-Net also exploit normals besides point coordinates as the input features, which is not the case for the proposed SPH3D-GCN. In our experiments, SPH3D-GCN not only achieves the same instance mIoU as -CNN , but also outperforms the other approaches on 9 out of 16 categories, resulting in the highest class mIoU 84.9%. We also trained a single network with the configuration shown in Table II to segment the 50 parts of all categories together. In that case, the obtained instance and class mIoUs are 85.4% and 82.7%, respectively. These results are very close to highly competitive method SFCNN . In all segmentation experiments, we apply the random sampling operation multiple times to ensure that every point in the test set is evaluated.
We test our technique for semantic segmentation of the real-world outdoor scenes using RueMonge2014 dataset . This dataset contains 700 meters Haussmanian style facades along a European street annotated with point-wise labelling. There are 7 classes in total, which include window, wall, balcony, door, roof, sky and shop. The point clouds are provided with normals and color features. We use coordinates as well as normals and color values to form 9-dim input features for a point. The detailed network configuration used in this experiment is shown in Table II, for which for RueMonge2014. The original point clouds are split into smaller point cloud blocks following the pcl_split.mat indexing file provided with the dataset. We randomly sample points from each block and use the sampled point clouds for training and testing. To standardize the points, we force their and dimensions to have zero mean values, and the dimension is kept non-negative. In the real-world applications (here and the following sections), we use data augmentation but no weight decay or dropout. As compared to the preliminary work , we do not perform pre-processing in terms of alignment of the facade plane and gravitational axis correction. Besides, the processed blocks are also mostly much larger. Under the evaluation protocol of , Table V compares our current approach SPH3D-GCN with the recent methods, including -CNN . It can be seen that SPH3D-GCN achieves very competitive performance, using only 0.4M parameters.
ScanNet  is an RGB-D video dataset of indoor environments that contains reconstructed indoor scenes with rich annotations for 3D semantic labelling. It provides scenes for training and scenes for testing. Researchers are required to submit their test results to an online server for performance evaluation. The dataset provides 40 class labels, while only 20 of them are used for performance evaluation. For this dataset, we keep the network configuration identical to that used for RueMonge2014, as shown in Table II, where . To process each scene, we first downsample the point cloud with the VoxelGrid algorithm  using a grid. Then, we split each scene into
blocks, padding along each side withcontext points. The context points themselves are neither used in the loss computation nor the final prediction. Following , the split is only applied to the and dimensions, whereas both spatial coordinates and color values are used as the input features. Here, refer to the coordinates after aligning and of each block to its center, while keeping . We compare our approach with PointConv , PointCNN , Tangent-Conv , SPLATNet , PointNet++  and ScanNet  in Table VI. These algorithms report their performance using the coordinates and values as input features similar to our method. A common evaluation protocol is followed by all the techniques in Table VI. As can be noticed, SPH3D-GCN outperforms other approaches on 16 out of 20 categories, resulting in significant overall improvement in mIoU. The low performance of our method on picture can be attributed to the lack of rich 3D structures. We observed that the network often confuses pictures with walls.
|Ground truth||Proposed||Ground truth||Proposed|
The Stanford large-scale 3D Indoor Spaces (S3DIS) dataset  comprises colored 3D point clouds collected for 6 large-scale indoor areas of three different buildings using the Matterport scanner. The segmentation task defined on this dataset aims at labelling 13 semantic elements, namely; ceiling, floor, wall, beam, column, window, door, table, chair, sofa, bookcase, board, and clutter. The elements that are not among the first 12, are considered clutter. We use the same network configuration for this dataset as used for the RueMonge2014 and ScanNet, except that now. Following the convention [33, 82, 83, 37], we perform 6-fold experiment using the six areas, and explicitly experiment with the Area 5. It is a common practice to separately analyze performance on Area 5 because it relates to a building not covered by the other areas 
. The used evaluation metrics include the Overall Accuracy (OA), the mean Accuracy of all 13 categories (mAcc), the Intersection Over Union (IoU) for each category, and their mean (i.e. mIoU).
Most of the scenes in S3DIS has millions of points. We use the same downsampling and block splitting strategy as in ScanNet. The input features also comprise 3D coordinates and color values that are standardized similar to those in ScanNet. The results of our experiments are summarized in Table VII. With 0.4M parameters, the proposed SPH3D-GCN achieves much better performance than the other convolutional networks (e.g. [83, 37]). For the experiments on Area 5, we also report results of an additional experiment with SPH3D-GCN(9-dim) that follows PointNet  in creating the input feature. The 9-dim input feature comprises + values and the relative location of the point in the scene. Comparing the performance of the proposed network that uses 6-dim input feature, we notice that removing the relative locations actually benefits the performance, which can be attributed to sensitivity of the relative locations to the scene scale. Finally, we visualize two representative prediction examples generated by our technique for the segmentation of Area 5 in Fig. 5. As can be noticed, despite the complexity of the scenes, SPH3D-GCN is able to segment the points effectively.
Scalability: The combination of discrete kernel, separable convolution and graph-based architecture adds to the scalibility of the proposed SPH3D-GCN. In Table VIII, we compare our network on computational and memory grounds with a highly competitive convolutional network PointCNN that is able to take points as input. The reported values are for S3DIS, using the configuration in Table II for our network, where we vary the input point size. We show the memory consumption and training/testing time of our network. With a batch size 16 on 12GB GPU, our network can take point cloud of size up to , which is identical to the number of pixels in a image. It is worth mentioning that the memory consumption of our ‘segmentation’ network for input points is slightly lower than that of PointNet++ ‘classification’ network for points (8.45GB vs. 8.57GB), using the same batch size, i.e. 16. Our 0.4M parameters are 10+ times less than the 4.4M of PointCNN. Considering that we use a larger batch size than the PointCNN, we include both the per-batch and per-sample training/testing time for a fair comparison. It can be seen that our per-sample running time for , , and points is less than or comparable to that of PointCNN for points. We refer to the websites [76, 77] for a speed comparison between Tesla P100 and Tian Xp. Although our SPH3D-GCN can take larger input size, we use point cloud of size for S3DIS in Table VII in the interest of time.
Graph coarsening visualization: We coarsen point cloud along our network with the Farthest Point Sampling (FPS) that reduces graph resolution layer-by-layer, similar to the image resolution reduction in the standard CNNs. We visualize the coarsening effects of the FPS in Fig. 6(top), using a chair from ModelNet40 as an example. The point clouds from left to right associate to the vertices of graphs in the network of ModelNet40. The resolution of the point cloud systematically reduces from left to right. Specifically, according to Table II, these point clouds contain K, , , points.
Kernel visualization: We also visualize few learned spherical kernels in Fig. 6(bottom). The two rows correspond to the spherical kernels of two SPH3D layers in Encoder2 of the network for S3DIS dataset. The size of these kernels are . As can be noticed, the weights of different kernels distribute differently in the range . For example, the third kernel in the first row contains positive weights dominantly in its upper hemisphere, but negative weights in the lower hemisphere, while the kernel exactly below it is mainly composed of negative weights. These differences indicate that different kernels can identify different features for the same neighborhoods. For better visualization, we color each bin only on the sphere surface, not the 3D volume. Moreover, we also do not show the weight of self-loop.
We introduced separable spherical convolutional kernel for point clouds and demonstrated its utility with graph pyramid architectures. We built the graph pyramids with range search and farthest point sampling techniques. By applying the spherical convolution block to each graph resolution, the resulting graph convolutional networks are able to learn more effective features in larger contexts, similar to the standard CNNs. To perform the convolutions, the spherical kernel partitions its occupied space into multiple bins and associates a learnable parameter with each bin. The parameters are learned with network training. We down/upsample the vertex features of different graphs with pooling/unpooling operations. The proposed convolutional network is shown to be efficient in processing high resolution point clouds, achieving highly competitive performance on the tasks of classification and semantic segmentation on synthetic and large-scale real-world datasets.
This research is supported by the Australian Research Council (ARC) grant DP190102443. The Titan Xp GPU used for this research is donated by the NVIDIA Corporation.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Systems, 2012, pp. 1097–1105.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
Y. Zhang, M. Bai, P. Kohli, S. Izadi, and J. Xiao, “Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1192–1201.
X. Ye, J. Li, H. Huang, L. Du, and X. Zhang, “3D recurrent neural networks with context fusion for point cloud semantic segmentation,” inProceedings of the European Conference on Computer Vision, 2018, pp. 403–417.
, “Spatial transformer networks,” inAdvances in Neural Information Processing Systems, 2015, pp. 2017–2025.
International Conference on Machine Learning, 2001, p. 282–289.