Point clouds are unstructured and unordered data, as opposed to images. Thus, most of machine learning approaches, developed for images, cannot be directly transferred to point clouds. It usually requires data transformation such as voxelization, inducing a possible loss of information. In this paper, we propose a generalization of the discrete convolutional neural networks (CNNs) able to deal with sparse input point cloud. We replace the discrete kernels by continuous ones.The formulation is simple, does not set the input point cloud size and can easily be used for neural network design similarly to 2D CNNs. We present experimental results, competitive with the state of the art, on shape classification, part segmentation and semantic segmentation for large scale clouds.READ FULL TEXT VIEW PDF
From 3D surface reconstruction for historical heritage preservation to autonomous driving, a large range of applications make use of 3D point clouds. The point clouds are either the direct input, e.g. lidar acquisitions, or intermediary products, e.g. photogrammetry.
These point sets are sparse samples of the underlying surface of the scene or objects. With the exception of structured acquisition (e.g. Lidars), point clouds are generally unordered and not spatially structured. They cannot be sampled on a regular grid and processed as image pixels. Moreover, most of the time, the points do not hold colorimetric features and the observation is consequently only defined by the relative positions of the points.
Due to these considerations, processing point clouds is difficult and methods developed for image processing do not apply. To do so, data transformations and problem reformulation are required. For example, look at convolutional neural networks (CNNs) which have reached the state of the art in many image processing tasks. These neural networks make an extensive use of the grid relations between pixels by using small convolution kernels. This assumption does not hold with point clouds. A first way to adapt this approach is to voxelize the space such that usual convolution applies. However voxelization may induce information loss and directional bias.
In this paper, we propose a generalization of the discrete CNNs for unstructured data. Starting from these networks, we build a new continuous convolutional framework able to ingest sparse point data. The contribution of this paper is two-fold. First, we introduce the convolutional kernels for points. They can be used with points, not necessarily sampled on a grid. It is a simple and straightforward extension of the usual discrete convolutions for grid sampled data. Second, we design neural networks using a hierarchical data representation structure based on a search tree. By using a progressive reduction of the number of observed point in the point cloud space, we end up with a structure very similar to the usual network architectures used with grid-structured input data such as voxels or images. The layer and network architecture are then trained for point cloud classification, part segmentation and semantic segmentation. For each task we show that our method is competitive with the state of the art.
The paper is organized as follow: section 2 presents the related works, section 3 decribes the continuous convolutional layer, section 4 is dedicated to the spatial representation of the data and the description of the networks. Finally, the section 5 shows experiments on different datasets for classification and semantic segmentation.
Point cloud processing is a widely discussed topic. We focus here on machine learning techniques for point cloud classification or local attribute estimation.
Most of the methods use handcrafted features defined using a point and its neighborhood or a local estimate of the underlying surface around the point. These features describe statistical properties of the shape and are designed to be invariant to rigid or non rigid transformation of the shape [JH99, ASC11, BK10, LJ07]
In the last years, the release of large annotated point cloud databases has allowed the development of deep neural networks methods able to learn both descriptors and decision function.
The direct adaptation to 3D of CNNs developed for image processing is to use 3D convolutions. Methods like [WSK15, MS15] apply 3D convolutions on voxel grid. Even though recent hardware advances allow us to use these network on larger scenes, they are still time consuming and require a relatively low voxel resolution which may result in a loss of information and undesirable bias due to grid axis alignment. In order to avoid these drawbacks, [GvdM17, Gra15] use sparse convolutional kernels or [LPS16, WP15] observe the 3D space to focus computation where objects are located.
A second class of algorithms avoid 3D convolutions by creating 2D representations of the shape, applying 2D CNNs and projecting the results back to 3D. This approach have been used for object classification [SMKLM15], jointly with voxels [QSN16] or for semantic segmentation [BGLSA17]. One the main issues using multi-view frameworks is to define an efficient and robust view generation procedure which can be, depending on the data, very difficult.
The previous methods are based on a 2D and 3D CNN. Somehow, it implies to organize the data (3D grid or 2D image) to process it. A third class of machine learning approaches considers unstructured data: PointNet [QSMG17] and derivatives. The key idea is to construct a transfer function invariant by permutation of the inputs features, obtained by using the symmetric function max pooling. The coordinates of the points are given as input features and geometric operations (affine transformation) are obtained with small auxiliary networks. However, it fails to capture local structures. Its improvement, PointNet++ [QYSG17], uses a cascade of PointNet networks from local scale to global scale.
Convolutional neural layers are widely used in machine learning for image processing and more generally for data sampled on a regular grid (such as 3D voxels). However, the use of CNNs when data is missing or not sampled on a regular grid is not straightforward. Several works have studied the generalization of such powerful schemes to other data.
, the authors are able to deal with point clouds. The input signal is interpolated on the convolutional kernel, convolution is applied, and output interpolated back to input shape. The proposed approach shares ideas with these works but is not dependent of a convolution kernel designed on a grid. The kernel elements location are optimized like in[DQX17] and the input points are weighted according to their distance to kernel elements, like in [SJS18].
In [LBS18], a -transform is applied on the point coordinates to create geometrical features to be combined with the input point features. It is a major difference with our approach: as in [QYSG17], [LBS18]
makes use of the input geometry as features. We want the geometry to behave as in discrete convolutions, i.e. weighting the relation between kernel and input. In other word, the geometric aspects are defined in the network structure (convolution strides, pooling layers) and pixel coordinates are not a network input.
, the authors propose a convolution layer defined as a weighted sum of the input features, the weights being computed by a multi layer perceptron (MLP). Our approach shares with[WSPU18] both in concept and implementation but differs in two parts. First, our approach computes a dense weighting function that takes into account the whole kernel. Second, the derived kernel is an explicit set of points associated with weights as opposed to an implicit version using a multi layer perceptron (MLP).
We build our continuous convolutional layer by deriving the discrete convolution formulation, used for grid-sampled data, such as images.
In discrete convolutions, the kernel (where is an integer sequence and is the number of elements in ) is convoluted with an input of same size , being the feature vector associated with the -th element of . Given a bias , the output is:
where is the indicator function such that if , otherwise. This is a one to one relation between kernel elements and input elements. This is illustrated on figure 1 top for a convolution on grid-sampled data.
If we now consider , resp. the spatial locations of the kernels elements, resp. the input elements (for an image it would be the pixel coordinates of the patch), then and . We can rewrite the expression:
where we added a normalization according to the input set size, for robustness to input variation in size.
In this study, we consider points in dimension , without spatial structure. In the general case, using will be too restrictive, taking almost all the time value. More generally, we need a continuous function , to establish a relation between the s and . is a geometrical weighting function that takes into account the relations between the elements of and . In a more general case may not only consider but the whole kernel configuration. To this end, we choose a function, specific to the -th kernel element, function of the relative position between the kernel elements and the input points, . We note the set :
The main difference in formulation with [WSPU18] resides in the explicit formulation of the kernel in a way similar to discrete formulation. In [WSPU18], the author directly use a MLP to weight the input features centered around the reference point used for computing the neighborhood. It is particular case of our formulation with and is the reference point. Moreover, as opposed to approaches like PointNet [QSMG17] and PointNet++ [QYSG17], the geometrical space and the feature space are separated, i.e. point spatial coordinates are not input features.
In practice designing by hand such functions is not easy. Intuitively, it would decrease with the norm of . We tested several functions including Gaussian functions. They require hand crafted parameters, difficult to tune. Instead, we choose to learn this function with a simple multi-layer perceptron (MLP). Such approach does not presume the behavior of the function.
The whole convolutional layer is represented on figure 2
At training, both parameters of and MLP parameters are optimized using gradient descent.
In addition, even though we can hope for MLP to partly deal with this issue, setting kernel element locations remains a problem. As an example, setting positions on a regular grid could induce bias. To this end, we randomly select the locations of in the unit sphere and also optimize these parameters, thanks to the fact that the formula (3) is differentiable.
Permutation invariant As stated in [QSMG17], operators on points must be invariant by permutation of the points. In the general case, the points are not ordered. For robustness reasons it is important that permuting two inputs has no influence on the extracted features. is a sum over the input points and a permutation of the point indices has no effect on the sum results.
Translation invariant As the geometric relations are relative between the points and the kernel elements (), applying a global translation on the point cloud and the kernel doesn’t change the result.
Insensible to input point cloud scale Many point clouds, such as photogrammetric pointclouds,have no metric scale information. In order to make the convolution robust to the input scale, the input geometric points are normalized.
Reduced sensibility to input point cloud size Dividing the results by makes the output less sensible to input size. E.g., using ans input does not change the result.
Each convolutional layer operates a projection of the input features of point cloud on the output point cloud .
For each point of the output point cloud, the convolutional kernels are applied on the nearest neighbors of in . Doing that ensures the spatial locality of the convolution operation.
In practice, chosing (or more generally ) leads to a dimension reduction by the convolutional layer (figure 3(a)). It is similar to the convolution with stride for discrete convolutions.
Using (or more generally ) do not change the point cloud size (figure 3(b)). It can be compared with the discrete convolution with stride 1.
And Finally, using or more generally ) is a dimension augmentation, similar to the up-convolutional layer in discrete pipelines (figure 3(c)).
The similarity between our definition of the convolutional layer and the discrete convolution allows network design comparable to image analysis networks. Our architectures are widely inspired from the literature on CNN for 2D computer vision tasks and particularly LeNet-like models[LBBH98] for classification and U-net [RFB15] for segmentation and regression.
The classification network is presented on figure 4 (top). It is a stack of five convolutional layers followed by a linear layer with a number of outputs corresponding to the number of classes. After the first convolutional layer, the point cloud is reduced to 1024 points, and then downsized by a factor 4 at each layer, except the last one where information is compressed to one output with 96 features. This output is then used by a linear layer with output number equal to the number of classes.
The segmentation network (figure 4) has an encoder-decoder structure, similar to U-net. The encoder has the same structure as the classification network and the decoder is composed of a stack of five convolutions and a linear layer. In the decoder the points used for upsampling are the same as the points at corresponding size in the encoder. The features from the encoder and the decoder are concatenated at the input of the convolutional layers of the decoder. Finally, the last layer is a point-wise linear layer, with the same purpose as in the previous layer.
For all the convolutions, we choose 27 kernel elements. This number has been chosen with reference to the number of elements of 3D convolution kernel. Future works will include an extensive study of the influence of the number of kernel elements. The number of neighbors used for convolution is noted on figure 4. It varies from 4 for deconvolution to 16 for the first convolutions.
We train the networks using a stochastic gradient descent with Adam optimizer. In order to make training compatible with mini-batch training, we use the same input point cloud size for each sample and the neighborhood sizes are fixed for each convolutional layer. This ensures that even tough the spatial structure is randomly generated (random selection of the points in layers with dimension reduction), the global structure is similar for each sample.
The first task for experimentation is shape classification. We use the network defined in figure 4 (top).
In order to show the flexibity of the framework, we experiment in both 2D and 3D.
The 2D experiment is done on the MNIST dataset. We consider the input image as 2D points (pixel coordinates) associated with color features (grey value). Results are presented on table 1(left). For comparison, we have also reported the results from two usual 2D CNN, LeNet [LBBH98] and Network in Network [LCY13] and two methods working directly on points, PointNet++ [QYSG17] and PointCNN [LBS18].
The classification task is also experimented in the 3D, on the ModelNet40 dataset. This dataset is a set of meshes from 40 various classes (plane, cars, chairs, tables…). We generated point clouds by sampling point on the triangular faces of the meshes. In our experiments, we use an input size of 1024 points. Table 1(right) presents the results.
In both 2D and 3D, our approach is competitive with the state of the art methods. With this experiment, we show that implicit coding of the geometry (in the spatial representation) is as efficient as using point positions as features like in [QYSG17] or [LBS18].
Due to stochastic process for spatial downsampling, different runs on the same shape may result in different output: point picked may be different at each level, which implies that neighborhoods and resulting features may be diferent. To increase robustness, we run several times the spatial sampling, predict and average the output scores. This is refered as the number of sampling (1,8 or 16) in table 1. The performances increase with the number of samplings. In practice, we only presented up to 16 samplings because we observed that a larger number would not increase score significantly.
Given a point cloud, the part segmentation objective is to recognize the different part of the underlying shape. In practice, it is a semantic segmentation at shape level. We use the Shapenet [YKC16] dataset. It is composed of 16680 models splitted in train/test sets, belonging to 16 shape categories. Each category is annotated with 2 to 6 part labels. The total is 50 part classes. As in [LBS18], we consider the part annotation problem as a 50 classes semantic segmentation problem. The scores are then computed at shape level.
We use the semantic segmentation network from figure 4(bottom). The provided sample have various sizes. We randomly subsample the point clouds to 1024 points and predict the labels for each input point. As the points do not come with particular features, we set the input features to one. As all points may not have been selected for labelling, the final labels are obtained by the 16 nearest labeled neighbors and their scores are summed.
The results are preseted in table 2. The scores are the part intesection over union (pIoU) and the mean part intersection over union (mpIoU). The first is the global IoU computed over all points, the second compute the IoU at shape level and average the scores over the shapes.
Our approach outperforms the state of the art by 6% looking at the pIoU score and is among the best methods according to the mpIoU. The difference may be explained by the fact that when the model recognises a shape, it produces really accurate segmentation. While on the contrary, some other shapes more poorly segmented due to a confusion in the main classes, producing a mpIoU at 82.6%.
|1 scale 2m||84.05||-||55.36||90.49||92.19||75.84||35.90||20.53||60.99||45.86||59.35||68.27||15.68||52.48||48.94||53.21|
|2 scales 2m,1m||84.93||-||58.54||90.93||92.99||76.72||40.17||33.05||62.13||48.34||61.53||72.95||23.53||53.20||50.62||54.85|
|Method||AvIoU||OA||Man made||Natural||High veg.||Low veg.||Buildings||Hard scape||Artefacts||Cars|
We then experiment on the Standord 2D-3D-Semantics dataset [ASZS17]. It is an indoor point cloud dataset for semantic segmentation. It is composed of six scenes, each corresponding to an office floor. The points are labeled according to 13 classes, 6 building elements classes (floor, ceiling…), 6 office equipment classes (tables, chairs…) and a stuff class regrouping all the small equipment (computers, screens…) and rare items.
For each scene considered as a test set, we train on the other five. At training and testing time, we randomly select a point and all the points in a box of given side size (a scale) for the horizontal axes and unbounded in vertical axis. From these points, we sample the 4096 points, input of the network. The input features are the RGB colors. At test time, we follow the same procedure as for part segmentation. For a given point, the final labeled is inferred from the predictions of the neighboring labeled points.
Table 3 regroups the results of our method and state of the art methods. We trained the network on a two meter scale (column selection size is 2m), scores are presented on the first line of results. We obtain competitive results on most of the classes, getting even the top score en walls. However, the training mostly fails on a few classes, e.g. columns that are mistaken with walls.
One explication could be that looking at the data with a 2m scale could be too coarse. And details such as frontiers between walls and columns are mis-estimated. The direct training with a finer scale (1m) did not give good results, indicating a loss of performance on walls and ceiling.
To refine the results, we trained a second model at 1m scale. The inputs of this second model are the output scores of the first model concatenated with the RGB features. By doing that we include knowledge of the 2m scale in this training. It sequential training, the first network is trained, we freeze the parameters, and use the predictions as input features for the next network. The scores are increased in each category, we can even see an amelioration of 13% on the column category.
The last segmentation experiment is on the Semantic8 dataset [HSL17]. In this experiment we explore the ability of the training process to scale to very large scenes (up to hundred of millions of points) with high variation in point density accross the scenes. The Semantic8 dataset is composed of 30 ground lidar scenes, 15 for training and 15 for evaluation. The test labels are unknown and evaluated on an online server. Here, we train the network with a neighborhood column of 8 meters.
The results of evaluation are presented in table 4 and figure 6. We have reported the state of the benchmark leaderboard at the time of article writing (for entries that are not anonymous). The PointNet++ has two entries in the benchmark, we only reported the best one. Our convolutional network for segmentation places at the third position behind Super Point Graph (SPGraph) [LS18] and SnapNet [BGLSA17]. It is the first among the direct point processing methods as SPGraph relies on a pre-segmentation of the point cloud and SnapNet uses 2D segmentation network on virtual pictures of the scene to produce segmentation. We surpass the PointNet++ by 3% on the average IoU. We perform particularly well on car detection where other methods except for SPGraph get relatively low results. On the contrary, we obtain poor performances on the vegetation classes (high and low). When looking at the results, this is due to a confusion between these two classes. In our understanding, this is a consequence of the absence of absolute scale in the pipeline. Basically, as all neighborhhods are rescaled to the unit sphere, making the process robust to scale variation of models (efficient for mesh classification without scale), a small tree or an hedge (low vegetation) may be confused with a large one (high vegetation).
One of the main advantages of the convolution using a dense formulation and averaging the final weights by the cardinal of the input size is that can accept any input point cloud size. To test the robustness to point cloud size variation, we use the classification network trained with 1024 points. We run the predictions on the test set (with one spatial sample) for different input sizes.
We observe that overall accuracy on ModelNet40 is maintained even if we test with a number of points greater than the size used for training. In our opinion, this is related to the size needed to estimate local features. The first layer of the network compute local features (e.g. normals or curvatures) and these do not vary a lot once a given point density is reached (upper than 1024). Starting from second layer, the number of points is fixed and the behavior of the network is the same.
On the opposite, with small point clouds (smaller than 1024 points), there are two phenomena that tend to lower classification scores. First, a small point cloud induces loss of information on the surface compared to large ones. Second, the local features may not be well estimated as neighborhoods becomes larger. Moreover, the 1024 point sampling of the second convolutional layer implies to use several times the same points, which could magnify bad feature estimation of the first convolution. Nevertheless, even with 128 points we have 48% of good classification over 40 classes. This is up to 88% with 512 points, half of the training size.
For timing computation at inference time, we use a Nvidia GTX1070 8Gb. On S3DIS, the segmentation network, with a batch size of 2, processes around 31 point clouds per second. The point cloud size is 4096 and this timing includes the spatial sampling (operated on 4 separated threads). One point cloud is processed in 0.032s.
At training, on the Semantic8 dataset, with batchsize 2 and Adam optimizer, we get 0.41s for a batch (forward and backward). The input features have size 3 (RGB only).
One of the main limitations is that our networks are agnostic to the object scales. It is of interest when dealing with non metric data such as in the ModelNet40 or ShapeNet datasets, where objects are CAD objects designed without scale. It would also be interesting to experiment on photogrammetric point clouds where scales are not always avalaible. On the opposite, in metric scans such as Semantic8 or S3DIS the object sizes are valuable informations that can be used for accurate segmentation. For example, a beam or column may be considered as wall or ceiling if its size is not taken its size into account. It would be the extension of the proposed framework along with using fixed radius neighborhood instead of fixed neighbors number. As pointed out, this would forbid batch training but the network would have to deal at each layer with object at prefined size.
Another perspective is to explore the use of precomputed features as inputs. In this study, we only use raw data for network inputs: RGB colors when available, features set to one otherwise. In the future, we will work on feeding the networks with features such as normals or curvatures.
We will also deepen the multiscale approach. The use of two scales for the 3DIS is only a scratch over the use for multiscale approaches for 3D semantic segmentation. Even though we restricted the multiscale approach to sequential learning of two models in order to stick to moderate GPU usage (see implementation details section for hardware information) We will explore joint learning of the scales and explore which features are the most relevant for passing to the next scale.
Finally, we proposed two networks architectures widely inspired from computer vision models. It is a interesting to explore various network configurations. As the formulation generalizes the discrete convolution, it is possible to explore more recent architectures such as residual networks.
In this paper, we presented a new CNN framework for point cloud processing. The objective was to provide an extension of the discrete convolution for sparse, unstructured data. From this convolutional layer, we built two networks, for shape classification and one for semantic segmentation. Through several experiments on various benchmark datasets, real and simulated, we have shown the method to be efficient and flexible. We also proved that even without particular extension, the approach scales well to very large scale datasets such as Semantic8.
We implemented our method using Pytorch framework. All operations are implemented using the native autograd functions and runs with Nvidia CUDA. For the experiments, we used a Nvidia Titan Xp for training, and a GTX1070 for inference.
The code is available under open source licence in the following repository https://github.com/aboulch/ConvPoint.
This work is supported by the ONERA project Delta. This project aims at developing innovative machine learning approaches for aerospace applications.
Joint 2D-3D-Semantic Data for Indoor Scene Understanding.ArXiv e-prints, February 2017.
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1704–1711. IEEE, 2010.
PointNet: Deep learning on point sets for 3D classification and segmentation.Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.