We propose an octree guided neural network architecture and spherical convolutional kernel for machine learning from arbitrary 3D point clouds. The network architecture capitalizes on the sparse nature of irregular point clouds, and hierarchically coarsens the data representation with space partitioning. At the same time, the proposed spherical kernels systematically quantize point neighborhoods to identify local geometric structures in the data, while maintaining the properties of translation-invariance and asymmetry. We specify spherical kernels with the help of network neurons that in turn are associated with spatial locations. We exploit this association to avert dynamic kernel generation during network training that enables efficient learning with high resolution point clouds. The effectiveness of the proposed technique is established on the benchmark tasks of 3D object classification and segmentation, achieving new state-of-the-art on ShapeNet and RueMonge2014 datasets.READ FULL TEXT VIEW PDF
We propose a neural network for 3D point cloud processing that exploits
With the increased availability of 3D scanning technology, point clouds ...
This paper introduces a new definition of multiscale neighborhoods in 3D...
Graph convolutional networks are a new promising learning approach to de...
In recent years, point clouds have earned quite some research interests ...
We introduce PyTorch Geometric, a library for deep learning on irregular...
Successfully tracking the human body is an important perceptual challeng...
Convolutional Neural Networks (CNNs)  are known to learn highly effective features from data. However, standard CNNs are only amenable to data defined over regular grids, e.g. pixel arrays. This limits their ability in processing 3D point clouds that are inherently irregular. Point cloud processing has recently gained significant research interest and large repositories for this data modality have started to emerge [1, 4, 12, 39, 40]. Recent literature has also seen many attempts to exploit the representation prowess of standard convolutional networks for point clouds by adaption [23, 39]. However, these attempts have often led to excessively large memory footprints that restrict the allowed input data resolution [29, 33]. A more attractive choice is to combine the power of convolution operation with graph representations of irregular data. The resulting Graph Convolutional Networks (GCNs) offer convolutions either in spectral domain [3, 7, 15] or spatial domain .
In GCNs, the spectral domain methods require the Graph Laplacian to be aligned, which is not straight forward to achieve for point clouds. On the other hand, the only prominent approach in spatial domain is the Edge Conditioned filters in CNNs for graphs (ECC)  that, in contrast to the standard CNNs, must generate convolution kernels dynamically entailing a significant computational overhead. Additionally, ECC relies on range searches for graph construction and coarsening, which can become prohibitively expensive for large point clouds. One major challenge in applying convolutional networks to irregular 3D data is in specifying geometrically meaningful convolution kernels in the 3D metric space. Naturally, such kernels are also required to exhibit translation-invariance to identify similar local structures in the data. Moreover, they should be applied to point pairs asymmetrically for a compact representation. Owing to such intricate requirements, few existing techniques altogether avoid the use of convolution kernels in computational graphs to process unstructured data [16, 27, 28]. Although still attractive, these methods do not contribute towards harnessing the potential of convolutional neural networks for point clouds.
In this work, we introduce the notion of spherical convolutional kernel that systematically partitions a spherical 3D region into multiple volumetric bins, see Fig. 1. Each bin of the kernel specifies a matrix of learnable parameters that weights the points falling within that bin for convolution. We apply these kernels between the layers of a Neural Network (-CNN) that we propose to construct by exploiting octree partitioning  of the 3D space. The sparsity guided octree structuring determines the locations to perform the convolutions in each layer of the network. The network architecture itself is guided by the hierarchy of the octree, having the same number of hidden layers as the tree depth. By exploiting space partitioning, the network avoids K-NN/range search and efficiently consumes high resolution point clouds. It also avoids dynamic generation of the proposed kernels by associating them to its neurons. At the same time, the kernels are able to share weights between similar local structures in the data. We theoretically establish that the spherical kernels are applied asymmetrically to points in our network just as the kernels in standard CNNs are applied asymmetrically to image pixels. This ensures compact representation learning by the proposed network in the point cloud domain. We demonstrate the effectiveness of our method for 3D object classification, part segmentation and large-scale semantic segmentation. The major contributions of this work are summarized below:
A novel concept of translation-invariant and asymmetric convolutional kernel is proposed and analyzed for point-wise feature learning from irregular point clouds.
The resulting convolutional kernel is exploited with an octree guided neural network that, in contrast to the previous voxelization applications of octree to point clouds [riegler2016octnet], hierarchically coarsens the data and constructs point neighborhoods using space partitioning to avoid time-consuming K-NN/range search.
PointNet  is one of the first instances of exploiting neural networks to represent point clouds. It directly uses
-coordinates of points as input features. The network learns point-wise features with shared MLPs, and extracts a global feature with max pooling. A major limitation of PointNet is that it explores no geometric context in point-wise feature learning. This was later addressed by PointNet++ with hierarchical application of max-pooling to the local regions. The enhancement builds local regions using K-NN search as well as range search. Nevertheless, both PointNets [27, 28]
aggregate the context information with max pooling, and no convolution modules are explored in the networks. In regards to processing point clouds with deep learning using tree structures, Kd-network is among the pioneering prominent contributions. Kd-network also uses point coordinates as its input, and computes feature of a parent node by concatenating the features of its children in a balanced tree. However, its performance depends heavily on the randomization of the tree construction. This is in sharp contrast to our approach that uses deterministic geometric relationships between the points. Another technique, SO-Net  reorganizes the irregular point cloud into an 2D rectangular map, and uses the PointNet architecture to learn node-wise features for the map. Similarly, KCNet  also builds on PointNet and introduces a point-set template to learn geometric correlations of local points in the point cloud. PointCNN  extracts permutation-invariant features by reordering the local points canonically with a learnable -transformation. All of these methods relate to our work in terms of directly accepting the spatial coordinates of points as input. However, they do not contribute towards the use of convolutional networks for processing 3D point clouds. Approaches advancing that research direction can be divided into two broad categories, discussed below.
Graph convolutional networks can be grouped into spectral networks [3, 7, 15] and spatial networks . The spectral networks perform convolution in the spectral domain relying on the graph Laplacian and adjacency matrices, while the spatial networks perform convolution in the spatial domain. A major limitation of spectral networks is that they demand the graph structure to be fixed, which makes their application to the data with varying graph structures (e.g. point clouds) challenging. Yi et al. 
attempted to address this issue with Spectral Transformer Network (SpecTN), similar to STN in the spatial domain. However, the signal transformation from spatial to spectral domains and vice-versa results in computational complexity . ECC  is among the pioneering works for point cloud analysis with graph convolution in the spatial domain. Inspired by the dynamic filter networks , it adapts MLPs to generate convolution filters between the connected vertices dynamically. The dynamic generation of filters comes with computational overhead. Additionally, the neighborhood construction and graph coarsening in ECC must rely on range searches, which is not efficient. We achieve coarsening and neighborhood construction directly from the octree partitioning, thereby avoiding expensive range searching. Moreover, our spherical convolutional kernel effectively explores the geometric context of each point without requiring dynamic filter generation.
3D-CNNs are applied to volumetric representations of 3D data. In the earlier attempts in this direction, only low input resolution could be processed, e.g. 303030 , 323232 . This issue transcended to subsequent works as well [13, 31, 42, 43]. The limitation of low input resolution was a natural consequence of the cubic growth of memory and computational requirements associated with the volumetric input data. Later methods [8, 20] mainly aim at addressing these issues. Most recently, Riegler et al.  proposed OctNet, that represents point clouds with a hybrid of shallow grid octrees (depth=3). Compared to its dense peers, OctNet reduces the computational and memory costs to a large degree, and is applicable to high-resolution inputs up to 256256256. Whereas OctNet also exploits octrees, there are major differences between OctNet and our method. Firstly, OctNet must process point clouds as regular 3D volumes due to its 3D-CNN kernels. No such constraint is applicable to our technique due to the proposed spherical kernels. Secondly, we are able to learn point cloud representation with a single deep octree instead of using hybrid of shallow trees.
Our network derives its main strength from spherical convolutional kernels. Thus, it is imperative to first understand the proposed kernel before delving into the network details. This section explains our convolutional kernel for 3D point cloud processing.
For images, hand-crafted features have traditionally been computed over more primitive constituents, i.e. patches. In effect, the same principle transcended to automatic feature extraction with the standard CNNs that compute feature maps using the activations of well-defined rectangular regions. Whereas rectangular regions are a common choice to process data of 2D nature, spherical regions are more suited to process unstructured 3D data such as point clouds. Spherical regions are inherently amenable to computing geometrically meaningful features for such data[9, 35, 34]. Inspired by this natural kinship, we introduce the concept of spherical convolutional kernel111Note that the term spherical in Spherical CNN  is used for spherical surfaces (i.e. images) not the ambient 3D space. Our concept of spherical kernel widely differs from , and it is used in different context. that uses a 3D sphere as the basic geometric shape to perform the convolution.
Given an arbitrary point cloud , where is the number of points; we define the convolution kernel with the help of a sphere of radius . For a target point , we consider its neighborhood to comprise the points within the sphere centered at , i.e. , where is a distance metric - distance in this work. We divide the sphere into ‘bins’ (see Fig. 1) by partitioning the occupied space uniformly along the azimuth () and elevation () dimensions. We allow the partitions along the radial dimension to be non-uniform because cubic volume growth for large radius values can become undesirable. Our quantization of the spherical region is mainly inspired by 3DSC . We also define an additional bin corresponding to the origin of the sphere to allow the case of self-convolution of points. For each bin, we define a weight matrix of learnable parameters, where - are the number of output-input channels and relates to self-convolution. Together, the weight matrices specify a single spherical convolutional kernel.
To compute the activation value for a target point , we must identify the relevant weight matrices of the kernel for each of its neighboring points . It is straightforward to associate with for self-convolution. For the non-trivial cases, we first represent the neighboring points in terms of their spherical coordinates that are referenced using as the origin. That is, for each we compute , where defines the transformation from Cartesian to Spherical coordinates and . Assuming that the bins of the quantized sphere are indexed by , and along the azimuth, elevation and radial dimensions respectively, the weight matrices associated with the spherical kernel bins can be indexed as , where . Using this indexing, we assign each ; and hence to its relevant weight matrix. In the network layer, the activation for the point can then be computed as:
where is the activation value of a neighboring point from layer , is the weight matrix,
is the bias vector, and25] in our experiments.
To elaborate on the characteristics of the proposed spherical convolutional kernel, let us denote the edges of the kernel bins along , and dimensions respectively as:
Due to the constraint of uniform splitting along the azimuth and elevation, we can write and .
Lemma 2.1: If , and , then for any two points within the spherical convolutional kernel, the weight matrices , are applied asymmetrically.
Proof: Let , then . Under the Cartesian to Spherical coordinate transformation, we have , and . We assert that and fall in the same bin indexed by , i.e. is applied symmetrically to the points and . In that case, under the inverse transformation , we have and . The condition entails that . Similarly, . Since , for we have . However, if , fall into the same bin, we have , which entails . Thus, the assertion can not hold, and can not be applied to any two points symmetrically unless both points are the same.
The asymmetry property of the spherical kernel is significant because it restricts the sharing of the same weights between point pairs, which facilitates in learning more effective features with finer geometric details. Lemma 2.1 also provides guidelines for the division of the convolution kernel into bins such that the asymmetry is always preserved. Note that asymmetric application of kernel weights to pixels comes naturally in standard CNN kernels. However, the proposed kernel is able to ensure the same property in the point cloud domain.
Relation to 3D-CNN: Here, we briefly relate the proposed notion of spherical kernel to the existing techniques that exploit CNNs for 3D data. Pioneering works in this direction rasterize the raw data into uniform voxel grids, and then extract features using 3D-CNNs from the resulting volumetric representations [23, 39]. In 3D-CNNs, the convolution kernel of size is prevalently used, that splits the space in 1 cell/voxel for radius (self-convolution); 6 cells for radius ; 12 cells for radius ; and 8 cells for radius . An analogous spherical convolution kernel for the same region can be specified with a radius , using the following edges for the bins:
This division results in a kernel size (i.e. total number of bins) , which is the coarsest multi-scale quantization allowed by Lemma 2.1.
Notice that, if we move radially from the center to the periphery of spherical kernel, we encounter identical number of bins (16 in this case) after each edge defined by , where fine-grained bins are located close to the origin that can encode detailed local geometric information of the points. This is in sharp contrast to 3D-kernels that must keep the size of all cells constant and rely on increased input resolution of the data to capture finer details - generally entailing memory issues. The multi-scale granularity of spherical kernel makes it a natural choice for raw point clouds.
Most of the existing attempts to process point clouds with neural networks [18, 19, 28, 32, 33] rely on K-NN or range searches to define local neighborhood of points, that are subsequently used to perform operations like convolution or pooling. However, to process large point clouds, these search strategies become computationally prohibitive. For unstructured data, an efficient mechanism to define point neighbourhood is tree-structuring, e.g. Kd-tree . The hierarchical nature of tree structures also provide guidelines for neural network architectures that can be used to process the point cloud. More importantly, a tree-structured data also possess the much desired attributes of permutation and translation invariance for neural networks.
We exploit octree structuring  of point clouds and design a neural network based on the resulting trees. Our choice of using octree comes from its amenability to neural networks as the base data structure , and its ability to account for more data in point neighborhoods compared to, for example, Kd-tree. We illustrate 3D space partitioning under octree, the resulting tree, and the formation of neural network using the proposed strategy of network construction in Fig. 2 for a toy example. For an input point cloud , we construct an octree of depth ( in the figure). In the construction, the splitting of nodes is fixed to use a maximum capacity of one point, with the exception of the last layer leaf nodes. The point in a parent node is computed as the Expected value of the points in its children. The allocation of multiple points in the last layer nodes directly results from the allowed finest partitioning of the space. For the sub-volumes in 3D space that are not densely populated, our splitting strategy can result in leaf nodes before the tree reaches its maximum depth. In such cases, to facilitate mapping of the tree to a neural network, we replicate the leaf nodes to the maximum depth of the tree. We safely ignore the empty nodes while implementing the network, resulting in computational and memory benefits.
Based on the hierarchical tree structure, our neural network also has hidden layers. Notice that, in Fig. 2 we use for the first hidden layer that corresponds to Depth for the tree. We will use the same convention in the text to follow. For each non-empty node in the tree, there is a corresponding neuron in our neural network. Recall that, a spherical convolutional kernel is specified with a target point over whose neighborhood the convolution is performed. Therefore, to facilitate convolutions, we associate a single 3D point with each neuron, except for the leaf nodes at the maximum depth of the tree. For a leaf node, the associated point is the mean value of data points allocated to that node. A neuron uses its associated point/location to select the appropriate spherical kernel and later applies the non-linear activation (not shown in Fig. 2
). In our network, all convolution layers before the last layer are followed by batch normalization and ReLU activations.
We denote the location associated with the neuron in the layer of the network as . From to , we can represent the locations associated with all neurons as , , . Denoting the raw input points as , is numerically computed by our network as:
where contains locations of the relevant children nodes in the octree. It is worth noting that the strategy used for specifying the network layers also entails that . Thus, from the first layer to the last, the features learned by our network move from lower to higher level of abstraction similar to the standard CNNs.
In relating the spherical nature of point neighborhood considered in our network to the cubic partitioning of space by octree, a subtle detail is worth considering. Say = , and = determine the range of point coordinates in a given cubic volume resulting from our space partitioning. The spherical neighborhood associated with a neuron in the layer is defined with the radius . This neighbourhood may not strictly circumscribe all points of the corresponding cubic volume at this level due to shape dissimilarity. Although the number of such points is minuscule in practice, we still take those into account by assigning them to the outer-most bins of our kernels based on their azimuth and elevation values.
Our neural network performs inter-layer convolutions instead of intra-layer convolutions. This drastically reduces the operations required to process large point clouds when compared with graph-based networks [3, 7, 15, 33, 41]. We note that for all nodes with a single child, only self-convolutions are performed in the network. Note that due to its unconventional nature, spherical convolutional kernel is not readily implemented using the existing deep learning libraries, e.g. matconvnet . Therefore, we implement it ourselves with CUDA C++ and mex interface222The implementation will be made public.. For the other modules such as ReLU, batch normalization etc., we use matconvnet.
OctNet  also makes use of octree structure. However, OctNet processes point clouds as regular 3D volumes - a 3D-CNN. In contrast, we process point clouds following their unstructured nature. Our network learns features for each point in the sets from to , which is in contrast to OctNet that must account for occupied and unoccupied voxels, entailing complexity. We exploit octree structure to simultaneously construct neighborhoods of all points and coarsen the original point cloud layer-by-layer, while OctNet uses this structure to voxelize the point cloud into different resolutions.
The classification and segmentation networks are basically variants of the same core architecture shown in Fig 2. However, we additionally insert an MLP layer prior to the octree structure to obtain more expressive point-wise features. This concept is inspired from Kd-Net . Figure 3 shows the complete architectures for classification and segmentation. To fully exploit the hierarchical features learned at different octree levels, we use features from all octree layers. For classification, we max pool the features from intermediate layers, including the raw features, and concatenate them with the features at the root node to form a global representation of the complete point cloud. For segmentation, we need point wise features. The feature of each point is the concatenation of raw features, MLP features and layer-wise features without any pooling. The final classification or segmentation is performed using three fully connected layers.
We conduct experiments on clean CAD Models as well as noisy point clouds to evaluate the performance of our method for the tasks of 3D object classification, part segmentation and semantic segmentation. Throughout the experiments, we keep the size of our convolution kernel fixed to , in which the radial dimension is split uniformly. We use three fully connected layers (512-256-
) followed by softmax as the classifier for both the classification and segmentation tasks. Here,
denotes the number of classes/parts. The training of our network is conducted using a Titan Xp GPU with 12 GB memory. We use Stochastic Gradient Descent with momentum to train the network. The batch size is kept fixed to 16 in all our experiments. These hyper-parameters were empirically optimized using cross-validation. We use only thecoordinates of points provided by point clouds to train our network, and the values when the color information is provided. Few existing methods in the literature also take advantage of normals, and use them as input features. However, normals are not directly sensed by 3D sensors and must be computed using the point coordinates. This also entails additional computational burden. Hence, we avoid using normals as input features. In our experiments, we follow the standard practice of taking advantage of data augmentation. To that end, we used random sub-sampling of the original point clouds, performed random azimuth rotation (up to
rad) and also applied noisy translation (std. dev = 0.02) to increase the number of training examples. These operations were performed on the fly in each training epoch of the network.
We use the benchmark datasets ModelNet10 and ModelNet40  to evaluate our technique for the classification task. These datasets are created using clean CAD models. ModelNet10 contains 10 categories of object meshes, and the samples are split into 3,991 training examples and 908 test instances. ModelNet40 contains object meshes for 40 categories with 9,843/2,468 training/testing split.
Compared to existing works (e.g. [27, 28, 32, 33]), the convolutions performed in our network allow the proposed method to consume large input point clouds. Hence, we train our network using 10K input points. For the classification task, we adopted a network with 6 levels of octree, whereas the number of feature channels are kept MLP(32)-Octree(64-64-64-128-128-128). The network comprises two components, octree based architecture for feature extraction and classification stage. We train the whole network in an end-to-end fashion. We standardize the input models by normalizing the 3D point clouds to fit into a cube of with zero mean.
Table 1 benchmarks the object classification performance of our approach that is abbreviated as -CNN333A Greek alphabet is chosen as prefix to avoid duplication with other OCNNs and SCNNs, e.g. [21, 26, 37].. Our method uses coordinates of points as raw features to achieve these results. As can be seen, -CNN consistently achieves the best performance on ModelNets. We note that, like our method Kd-Net  and OctNet  are also tree structure based networks. However, they require twice the number of parametric layers as required by our method to achieve the reported performance. This is a direct consequence of effective exploration of geometric information by the proposed kernel.
ShapeNet part segmentation dataset  contains 16,881 CAD models from 16 categories. The models in each category have two to five annotated parts, amounting to 50 parts in total. The point clouds are created with uniform sampling from 3D meshes. This dataset provides coordinates of the points as raw features, and has 14007/2874 training/testing split defined. We use a 6-level octree for the segmentation network, with configuration MLP(64)-Octree(128-128-256-256-512-512). The output class number of the classifier is determined by the number of parts in each category. We use the part-averaged IoU (mIoU) proposed in  to report the performance in Table 2. Similar to the classification task, we also standardize the input models of ShapeNet by normalizing input point clouds to cube with zero mean.
In Table 2, we compare our results with the popular methods that also take irregular point clouds as input. Yet, to achieve their results, some of these methods exploit normals besides coordinates as input features, e.g. PointNet, PointNet++, SO-Net. It can seen that -CNN not only achieves the highest mIoU , but also outperforms the other approaches on 11 out of 16 categories. To the best of our knowledge, -CNN records the new state-of-the-art performance on this part segmentation dataset that is higher than the specialized segmentation networks, SSCN  and SGPN .
In Fig. 4, we show few representative segmentation results. High mIoU is achieved by -CNN for the high-quality results, whereas the mIoU value is low for the other case. Examining the low-quality results, we found that most of these cases are caused by one of the two conditions. (1) Confusing ground truth labelling: E.g. the axle in Skateboard is labelled as a separate segment in most of the ground truth samples but part of the wheels in few other samples. Hence, the network learns the more dominant segmentation. Similar is the case for the legs of Chairs. (2) Small parts without clear boundaries: E.g. handles of a Bag are considered separate segments in the ground truth. From these results, we can easily conclude the success of -CNN for the part segmentation task.
|High-Quality Segmentation||Low-Quality Segmentation|
We also test our model for Semantic Segmentation of real world data with RueMonge2014 dataset . This dataset contains 700 meter facades along a street annotated with point-wise labelling. The classes includes window, wall, balcony, door, roof, sky and shop. The point clouds are provided with color features. To train our network, we split both the training and testing data into blocks. We align the facade plane of all the blocks into the same plane, and adjust the gravitational axis to be upright. We only force the and dimensions to have zero-means, but not the axis. This processing strategy is adopted to avoid loosing the height information. We use + as input raw features to train our network. The used network configuration is MLP(64)-Octree(64-64-128-128-256-256). Table 3 compares the results of our approach with the current state-of-the-art on this dataset, under the evaluation protocol of . With 7 parametric layers, we achieve better performance than OctNet, which uses 20 parametric layers to learn the final representation of each point. These results demonstrate the promises of -CNN in practical applications.
For geometrically meaningful convolutions, knowledge of local neighborhood of points is imperative. A related approach, ECC  exploits range search for this purpose. Another obvious choice is K-NN clustering. However, with tree structures, e.g. octree; the point neighborhood information is already readily available that adds to computational efficiency of -CNN. In Fig. 5, we report the timings of computing neighborhoods under different choices, and compare them to octree construction. As can be seen, for larger number of input points, octree structuring is more efficient as compared to K-NN and range searching. Moreover, its efficiency is also better than Kd-tree for large input sizes because the binary split in Kd-tree forces it to be much deeper than octree.
Running our classification network on 1K randomly selected samples from ModelNets, we compute the test time of our network for point clouds of sizes 10K, and report timings in Table 4. The test time for a sample consists of time required to construct the octree and performing the forward pass. We also show the time of normal computation in the table for reference. Our approach does not compute normals to achieve the results reported in the previous section. To put these timings into perspective, PointNet++  requires roughly 115ms for a forward pass of input with 1024 points on the same machine. In Fig. 6, we also show a representative example of point cloud coarsening by our method under octree structuring. Our network gradually sparsifies the point cloud by applying spherical convolutional kernel at each level.
We introduced the notion of spherical convolutional kernels for point cloud processing and demonstrated its utility with a neural network guided by octree structure. The network successively performs convolutions in the neighborhood of its neurons, the locations of which are governed by the nodes of the underlying octree. To perform the convolutions, our spherical kernel divides its occupied space into multiple bins and associates a weight matrix to each bin. These matrices are learned with network training. We have shown that the resulting network can efficiently process large 3D point clouds in effectively achieving excellent performance on the tasks of 3D classification and segmentation on synthetic and real data.
Acknowledgments This research was supported by ARC Discovery Grant DP160101458 and partially by DP190102443. We also thank NVIDIA corporation for donating the Titan XP GPU used in our experiments.
Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding.In Proceedings of the IEEE International Conference on Computer Vision, pages 1192–1201, 2017.