Point cloud data is becoming more ubiquitous than ever: anyone can create a point cloud from a set of photos with easy to use photogrammetry software or capture a point cloud directly with one of many consumer-grade depth sensors available worldwide. These sensors will soon be used in most aspects of our daily lives, with autonomous cars recording streets and city environments and VR and AR devices recording our home environment on a regular basis. The resulting data represents a great opportunity for computer vision research: it complements image data with depth information and opens up new fields of research.
However, point cloud data itself is unstructured. This leads to a variety of problems: (a) point clouds have no fixed cardinality, varying their size depending on the recorded scene. They are also not ‘registered’ in the sense that it is not trivial to find correspondences between points across recordings of the same or of a similar scene. (b) Point clouds have no notion of neighborhood. This means that it is not clear how convolutions, one of the critical operations in deep learning, should be performed.
In this paper, we present a novel solution to the aforementioned problems, in particular the varying cloud cardinality. For an illustration, see Fig. 1. We propose to encode point clouds as minimal distances to a fixed set of points, which we refer to as basis point set. This representation is vastly more efficient than classic extensive occupancy grids: it reduces every point cloud to a relatively small fixed-length vector. The vector length can be adjusted to meet computational constraints for specific applications and represents a trade-off between fidelity of the encoding and computational efficiency. Compared to other encodings of point clouds, the proposed representation also has an advantage in being more efficient with the number of values needed to preserve high frequency information of surfaces.
Given its fixed length, the presented encoding can be used with most of the standard machine learning techniques. In this paper we apply mostly artificial neural networks to build models with it, due to their popularity and accuracy. In particular, we analyze the performance of the encoding in two applications: point cloud classification and mesh registration over noisy 3D scans (, Fig. 2).
For point cloud classification, we achieve the same accuracy on the ModelNet40 [wu20153d] shape classification benchmark as PointNet [qi2017pointnet], while using an order of magnitude less parameters and three orders of magnitudes less floating point operations.To demonstrate the versatility of the encoding, we show how it can be used for the task of mesh registration. We use the encoded vectors as input to a neural network that directly predicts mesh vertex positions. While showing competitive performance to the state-of-the-art methods on the FAUST dataset [bogo2014faust], the main advantage of our method is the ability to produce an aligned high resolution mesh from a noisy scan in a single feed-forward pass. This can be executed in real time even on a non-GPU laptop computer, requiring no additional post-processing steps. We make our code for both presented tasks available, as well as a library for usage in other projects111https://github.com/sergeyprokudin/bps.
2 Related Work
In this section, we describe existing 3D data representations and models and put them in relation to the presented method. We focus on representations that are compatible with deep learning models, due to their high performance on a variety of 3D shape analysis tasks.
Point clouds. Numerous methods [qi2017pointnet, qi2017pointnet++, shen2018mining, zaheer2017deep, li2018so] were proposed that process 3D point clouds directly, amongst which the PointNet family of models gained the most popularity. This approach processes each point separately with a small neural network followed by an aggregation step with a pooling operation to reason about the whole point cloud. Similar pooling-based approaches for achieving feature invariance on general unordered sets were proposed in other works as well [zaheer2017deep]. Other methods working directly on point clouds organize the data in kd-trees and other graphs [klokov2017escape, gadelha2017shape, landrieu2018large]. These structures define a neighborhood and thus convolution operations can be applied. Vice versa, specific convolutional filters can be designed for sparse 3d data [tatarchenko2018tangent, shen2018mining].
We borrow several ideas from these works, such as using kNN-methods for searching efficiently through local neighborhoods or achieving order invariance through the use of pooling operations over computed distances to basis points. However, we believe that the proposed encoding and model architectures offer two main advantages over existing point cloud networks: (a) higher computational efficiency and (b) conceptually simpler, easy-to-implement algorithms that do not rely on a specific network architecture or require custom neural network layers.
Occupancy grids. Similar to pixels for 2D images, occupancy grid is a natural way of encoding 3D information. Numerous deep models were proposed that work with occupancy grids as inputs [maturana2015voxnet, qi2016volumetric, moon2018v2v]. However, the main disadvantage of this encoding is their cubic complexity. This results in a high amount of data needed to accurately represent the surface. Even relatively large grids by our current memory standards () are not sufficient for an accurate representation of high frequency surfaces like human bodies. At the same time, this type of voxelization results in very sparse volumes when used to represent 3D surfaces: most of the volume measurements are zeros. This makes this representation an inefficient surface descriptor in multiple ways. A number of methods was proposed to overcome this problem [wang2017cnn, riegler2017octnet]. However, the problem of representing high frequency details remains, together with a large memory footprint and low computational efficiency for running convolutions.
Signed distance fields. Truncated signed distance fields (TSDFs) [curless1996volumetric, newcombe2011kinectfusion, riegler2017octnetfusion, song2017semantic, zeng20173dmatch, dai2018scan2mesh, park2019deepsdf] can be viewed as a natural extension of occupancy grids: they store distance-to-surface information in grid cells instead of a simple occupancy flag. While this partially resolves the problem of representing surface information, the cubic requirement for memory and the low computational efficiency for convolutions remains. In comparison, our method can be viewed as one that uses an arbitrary subset of points from the distance field. The crucial difference is that the distance field we sample from is unsigned and non-truncated, and the number of samples is proportional to the number of points in the original cloud. We further investigate the connection between occupancy grids, TSDFs and BPS in Sec. 4.1.
2D projections. Another common strategy is to project 3D shapes to 2D surfaces and then apply standard frameworks for 2D input processing. This includes depth maps [wu20153d], height maps [sarkar2018learning], as well as a variety of multi-view models [su2015multi, kanezaki2018rotationnet, feng2018gvcnn]. Closely related are approaches that project 3D shapes into spheres and apply spherical convolutions to achieve rotational invariance [esteves2018learning, cohen2018spherical]. While projection-based approaches show high accuracy in discriminative tasks (classification, shape retrieval), they are fundamentally limited in representing shapes that have multiple ‘folds’, invisible from external views. In comparison, our encoding scheme can accurately preserve surface information of objects with arbitrary topology as we show in our experiments in Sec. 4.
We now describe the algorithm for constructing the proposed basis point representation from a given point cloud.
The presented encoding algorithm takes a set of point clouds as input . Every point cloud can have a different number of points :
where for the case of 3D point clouds. In a first step, we normalize all point clouds to a fit a unit ball:
Next, we form a basis point set. For this task, we sample random points from a ball of a given radius :
It is important to mention that this set is arbitrary but fixed for all point clouds in the dataset. and
are hyperparameters of the method, andcan be used to determine the trade-off between computational complexity and the fidelity of the representation.
Next, we form a feature vector for every point cloud in a dataset by computing the minimal distance from every basis point to the nearest point in the point cloud under consideration:
Alternatively, it is possible to store the full directional information in the form of delta vectors from each basis point to the nearest point in the original point cloud:
Other information about nearest points (, RGB values, surface normals) can be saved as part of this fixed representation. The feature computation is illustrated in Fig. 1. The formulas (4) and (5) give us fixed-length representations of the point clouds that can be readily used as input for learning algorithms.
BPS selection strategies.
We investigate a number of basis point selection strategies and provide details of these experiments in Sec. 4.2
. Overall, random sampling from a uniform distribution in the unit ball provides a good trade-off between efficiency, universality of the generation process and surface reconstruction results, and we apply it throughout the experiments in this paper. Alternatively, an extensive 3D grid of basis points could be used in tandem with any existing 3D convolutional neural network in order to achieve maximum performance at the cost of increased computational complexity.
In this work, we use Euclidean distances between points for creating our encoding, but other metrics could be used in principle. Since we are working with 3D point clouds (which corresponds to having a small value for ), the nearest neighbor search can be made efficient by using data structures like ball trees [omohundro1989five]. Asymptotically, operations are needed for constructing a ball tree from the point cloud and operations are needed to run nearest neighbor queries for basis points. This leads to an overall encoding complexity of ) per point cloud. The kNN search step can be also efficiently implemented as part of an end-to-end deep learning pipeline [kaiser2017learning]. Practically, we benchmark our encoding scheme for different values of and and show real-time encoding performance for values interesting for current real world applications. Please refer to the supplementary materials for further details.
4.1 Comparison to occupancy grids, TSDFs and plain point clouds
Compared to occupancy grids and TSDFs, the efficiency and superiority of the proposed BPS encoding is based on two key observations. First, it is beneficial for both surface reconstruction and learning to store some continuous global information (, Euclidean distance to the nearest point) in every cell of the grid instead of simple binary flags or local distances. In the latter case, most of the voxels remain empty and, moreover, the feature vector will change dramatically when slight translations or rotations are applied to an object. In comparison, every BPS cell always stores some information about the encoded object and the feature vector changes smoothly with respect to affine transformations. From this also stems the second important observation: when every cell stores some global information, we can use a much smaller number of them in order to represent the shape accurately, thus avoiding the cubical complexity of the extensive grid representation. This can be seen in Fig. 1 and bottom right Fig. 3, where basis points are able to capture the outline of the original cloud.
We will now validate this intuition by comparing the aforementioned representations in terms of surface reconstruction and actual learning capabilities.
Surface reconstruction experiments.
Independent of a certain point cloud at hand, how well does the encoding capture the details from the object? To answer this question, we take random CAD models from the ModelNet40 [wu20153d] dataset and construct synthetic point clouds by sampling points from each surface. We compare three approaches of encoding the resulting point clouds: storing them as is (raw point cloud), occupancy grid and the proposed encoding via basis point sets as suggested in Eq. 5.
For all methods we define a fixed allowed description length (as floating point values) and compare the normalized bidirectional Chamfer distance between the original point cloud and the reconstructed point cloud for the different encodings:
With the same length of the description we can either store points from the original point cloud, binary occupancy flags or basis points with the matrix defined in Eq. 5. From this matrix, a subset of original points can be reconstructed by simply adding corresponding basis point coordinates to every delta vector. For the occupancy grid encoding, we use the centers of occupied grid cells; please note that though a full floating point representation is not necessary to store the binary flag, in reality the majority of machine learning methods will work with floating point encoded occupancy grids and we assume this representation.
Fig. 4 shows the encoding length and the reconstruction quality measured as Chamfer distance (, Eq. 6). The proposed encoding produces less than half of the encoding error compared to occupancy grids for point clouds up to roughly points (see Fig. 3 for a qualitative comparison). This is an indicator for its superiority for preserving shape information. The error curve for the basis point sets is close to the one of the subsampled point cloud representation. The basis point set representation is less accurate than the raw point cloud since the resulting extracted points are not necessarily unique. However, the basis point set is an ordered, fixed-length vector encoding well-suited to apply machine learning methods.
4.2 Basis point selection strategies
We investigate four different variants of selecting basis points visualized in Fig. 5.
Rectangular grid basis.
A basic approach to basis set construction is to simply arrange points on a rectangular grid. In that case, the basis point set representation resembles the truncated signed distance field [curless1996volumetric] representation. However, one important difference is that we do not truncate the distances for far-away basis points, allowing every point in the set to store some information about the object surface. We will show in Sec. 5.1 that this small conceptual difference has an important effect on performance. We are also allowing the full directional information to be stored in the cell as defined in Eq. 5. Finally, BPS does not require the point clouds to be converted into watertight surfaces since unsigned distances are used.
Ball grid basis.
Since all point clouds are normalized to fit in the unit ball by the transformation defined in Eq. 2, the basis points at the corners of the rectangular grid are located far away from the point cloud. These corner points in fact constitute of all the samples (this can be derived by comparing the volume ratio of a unit ball to a unit cube). Hence we can improve our sampling efficiency by simply trimming the corners of the grid and using more sampling locations within the unit ball.
Random uniform ball sampling.
One generic simple strategy to select points lying inside a dimensional ball is uniform sampling. This can be done by either rejection sampling from a dimensional cube or other efficient methods that are summarized in [harman2010decompositional].
Hexagonal close packing (HCP).
We also experiment with hexagonal close packing [conway2013sphere] of basis points. Informal intuition behind this point selection strategy is that it will optimally cover the unit ball with equally sized balls centered at the basis points [hales2005proof].
We show a comparison of reconstruction errors of ModelNet objects using the different sampling strategies in Fig. 4. Overall, the random uniform and HCP selection strategies provide the best reconstruction results. Using regular grids opens up possibilities for applying convolution operations and adds the possibility to learn translation and rotation invariant features.
We now evaluate the different encodings and basis point selection strategies with respect to their applicability with machine learning algorithms.
5 Learning with Basis Point Sets
5.1 3D Shape Classification
|2||Occ-MLP ( grid)|
|3||Occ-MLP ( grid)|
|4||TDF-MLP ( grid)|
|5||TDF-MLP ( grid)|
|6||BPS-MLP ( grid)|
|7||BPS-MLP ( grid)|
|8||BPS-MLP ( ball)|
|9||BPS-MLP ( rand)|
|10||BPS-MLP ( HCP)|
|11||BPS-Conv3D ( grid)|
|12||9 direct. vect.|
|13||11 direct. vect.|
|14||BPS-ERT [geurts2006extremely] ( g.)||N/A||N/A|
One of the classic tasks to perform on point clouds is classification. We present results for this task on the ModelNet40 [wu20153d] dataset. We benchmark several deep learning architectures that use the proposed point cloud representation and compare them to existing methods that use alternative encodings. The dataset consists of CAD models from 40 different categories, of which are used for training. We use the same procedure for obtaining point clouds from CAD models as in [qi2017pointnet], , we sample points from mesh faces, followed by the normalization process defined in Eq. (2).
Comparison to occupancy grids and VoxNet.
To show the superiority of BPS features and to disambiguate contributions (, the BPS encoding itself and the proposed network architectures), we fix a simple generic MLP architecture with 2 blocks of [fully-connected, relu, batchnorm, dropout] layers and perform training withrectangular grids of occupancy maps, truncated distance fields (TDFs) and BPS as inputs.
Results are summarized in Tab. 1, rows 1-7. Using global distances as features instead of occupancy flags with the same network clearly improves accuracy, outperforming an architecture that was specifically designed for processing this type of input: VoxNet [maturana2015voxnet] (row 1). TDFs store only local distances within the grid cell and suffer from the same locality problem as voxels (r. 4). It is also important to note that reducing the grid size affects these methods dramatically (rows 3 and 5, drop in accuracy), while the effect on the BPS is marginal (r. 6, ).
We also compare different BPS selection strategies in the rows 7-10 of Tab. 1. In the absence of network operators exploiting the point ordering (e.g. 3D convolutions), random and HCP strategies give a slight boost in performance. When the point order in a rectangular BPS grid is exploited with 3D convolutional deep learning models like VoxNet, performance improves at the cost of increased computational complexity (approximately two orders of magnitude more flops, Tab. 1, r. 11).
Substituting Euclidean distances with full directional information defined by Eq. 5 negatively affects the performance of a plain fully-connected network (Tab. 1, r.12) whereas it improves the performance of a 3D convolutional model (Tab. 1, r. 13).
To show the versatility of the proposed representation, we also use the same BPS features as input to an ensemble of extremely randomized trees (ERT [geurts2006extremely]) and XGBoost [chen2016xgboost] frameworks.
Comparison to other methods.
Finally, we combine these findings with other enhancements (, augmenting the data with few fixed rotations, improving learning schedule and regularization - please refer to the supplementary material and corresponding repository for further details) and compare our two best-performing models to other methods in Tab. 2.
In summary, simple fully connected network, trained on BPS features in several minutes on a single GPU, is reaching the performance of PointNet [qi2017pointnet], one of the most widely used networks for point cloud analysis. 3D-convolutional model trained on BPS rectangular grid is matching the performance of the PointNet++[qi2017pointnet++], while still being computationally more efficient. Finally, crude ensembling of 10 such models allows us to match state-of-the-art performance [klokov2017escape] among methods working only on point clouds as inputs (, without using surface normals that are available in CAD models but rarely in real-world scenarios).
RotationNet 20x [kanezaki2018rotationnet]
|MVCNN 80x [su2015multi]||90.1%|
|Spherical CNNs [esteves2018learning]||88.9%|
|point cloud based methods:|
|DeepSets (micro) [zaheer2017deep]||82.0%|
|Ours (BPS-Conv3D, 10x)|
5.2 Single-Pass Mesh Registration from 3D Scans
We showcase a second experiment with a different, generative task to demonstrate the versatility and performance of the encoding. For this, we pick the challenging problem of human point cloud registration. In this problem, correspondences are found between an observed, unstructured point cloud and a deformable body template. blue Traditionally, human point cloud registration has been approached with iterative methods [hirshberg2012coregistration, zuffi2015stitched]. However, they are typically computationally expensive and require the use of a deformable model at application time. Machine learning based methods [3dcoded] remove this dependency by replacing them with a sufficiently large training corpus. However, current solutions like [3dcoded] rely on multistage models with complex internal representations, which makes them slow to train and test. We encourage the reader interested in human mesh registration to review the excellent summary of previous work provided in [3dcoded].
We use a simple DenseNet-like [huang2017densely] architecture with two blocks (see Figure 2), where the input is a BPS encoding of a point cloud and the output is the location of each vertex
in the common template. Note that there is no deformable model in our system and that we do not estimate deformable model parameters or displacements; the networks learns to reproduce coherent bodies just based on its training data.
|Method||Intra (mms)||Inter (mms)|
|Stitched puppets [zuffi2015stitched]||1.568||3.126|
|Deep functional maps [FMNet]||2.436||4.826|
To generate this training data, we use the SMPL body model [loper2015smpl]. SMPL is a reshapeable, reposable model that takes as input pose parameters related to posture, and shape parameters related to the intrinsic characteristics of the underlying body (, height, weights, arm length). We sample shape parameters from the CAESAR [Robinette2002] dataset, which contains a wide variety of ages, body constitution and ethnicities. For sampling poses we use two sources: the CMU dataset [cmu] and a small set of poses inferred from a 3D scanner blue. Since the CMU dataset is heavily populated with walking and running sequences, we perform weighted sampling of poses with the inverse Mahalanobis distance from the sample to the CMU distribution as weight.blue We roughly align the CMU poses to be frontal. To increase the variation of the training data, we introduce noise sampled from the covariance of all the considered poses to half of the data points. From these meshes, a set of
points is sampled uniformly from the surface of the posed and shaped SMPL template. These point clouds are then used to compute the BPS encoding. We train the alignment network for 1000 epochs in only 4 hours and its inference time is less than 1ms on a non-GPU laptop.
To evaluate our method, we process the test set from the FAUST [bogo2014faust] dataset. It is used to compare mesh correspondence algorithms by using a list of scan points in correspondence. To find correspondences between two point clouds, we process each of them with our network, obtaining as a result two registered mesh templates. The templates then define the dense correspondences between the point clouds.
We obtain an average performance of mm in the intra-subject challenge and mm in the inter-subject challenge (see Tab. 3). These numbers are comparable, but higher than state-of-the-art methods like [3dcoded] or [zuffi2015stitched]. However, we note that the two methods outperforming BPS in the FAUST intra challenge are orders of magnitude slower than our system. The two-stage procedure in [3dcoded] takes multiple minutes and the particle optimization in [zuffi2015stitched] takes hours, while our system produces alignments in 1ms (for qualitative results, see Fig. 6). This enables real-time processing of 3D scans, which was previously impossible, or can be used as a first step for faster multistage systems that refine the accuracy of this single stage method. We also provide a qualitative evaluation on the Dynamic FAUST[dfaust:CVPR:2017] dataset in the supplementary video 222https://youtu.be/kc9wRoI5JbY.
6 Conclusion and Future Work
In this paper, we introduced basis point sets for obtaining a compact fixed-length represenation of point clouds. BPS computation can be used as a pre-processing step for a variety of machine learning models. In our experiments, we demonstrated in two applications and with different models the computational superiority of our approach with orders of magnitudes advantage in processing time compared to existing methods, remaining competitive accuracy-wise. We have shown the advantage of using rectangular BPS grid in combination with standard 3D-convolutional networks. However, in future work it would be interesting to consider other types of BPS arrangements and corresponding convolutions [hoogeboom2018hexaconv, cohen2018spherical, esteves2018learning, graham2017submanifold] for improved efficiency and learning rotation-invariant representations.