1 Introduction
Point cloud data is becoming more ubiquitous than ever: anyone can create a point cloud from a set of photos with easy to use photogrammetry software or capture a point cloud directly with one of many consumergrade depth sensors available worldwide. These sensors will soon be used in most aspects of our daily lives, with autonomous cars recording streets and city environments and VR and AR devices recording our home environment on a regular basis. The resulting data represents a great opportunity for computer vision research: it complements image data with depth information and opens up new fields of research.
However, point cloud data itself is unstructured. This leads to a variety of problems: (a) point clouds have no fixed cardinality, varying their size depending on the recorded scene. They are also not ‘registered’ in the sense that it is not trivial to find correspondences between points across recordings of the same or of a similar scene. (b) Point clouds have no notion of neighborhood. This means that it is not clear how convolutions, one of the critical operations in deep learning, should be performed.
In this paper, we present a novel solution to the aforementioned problems, in particular the varying cloud cardinality. For an illustration, see Fig. 1. We propose to encode point clouds as minimal distances to a fixed set of points, which we refer to as basis point set. This representation is vastly more efficient than classic extensive occupancy grids: it reduces every point cloud to a relatively small fixedlength vector. The vector length can be adjusted to meet computational constraints for specific applications and represents a tradeoff between fidelity of the encoding and computational efficiency. Compared to other encodings of point clouds, the proposed representation also has an advantage in being more efficient with the number of values needed to preserve high frequency information of surfaces.
Given its fixed length, the presented encoding can be used with most of the standard machine learning techniques. In this paper we apply mostly artificial neural networks to build models with it, due to their popularity and accuracy. In particular, we analyze the performance of the encoding in two applications: point cloud classification and mesh registration over noisy 3D scans (, Fig. 2).
For point cloud classification, we achieve the same accuracy on the ModelNet40 [wu20153d] shape classification benchmark as PointNet [qi2017pointnet], while using an order of magnitude less parameters and three orders of magnitudes less floating point operations.To demonstrate the versatility of the encoding, we show how it can be used for the task of mesh registration. We use the encoded vectors as input to a neural network that directly predicts mesh vertex positions. While showing competitive performance to the stateoftheart methods on the FAUST dataset [bogo2014faust], the main advantage of our method is the ability to produce an aligned high resolution mesh from a noisy scan in a single feedforward pass. This can be executed in real time even on a nonGPU laptop computer, requiring no additional postprocessing steps. We make our code for both presented tasks available, as well as a library for usage in other projects^{1}^{1}1https://github.com/sergeyprokudin/bps.
blue
2 Related Work
In this section, we describe existing 3D data representations and models and put them in relation to the presented method. We focus on representations that are compatible with deep learning models, due to their high performance on a variety of 3D shape analysis tasks.
Point clouds. Numerous methods [qi2017pointnet, qi2017pointnet++, shen2018mining, zaheer2017deep, li2018so] were proposed that process 3D point clouds directly, amongst which the PointNet family of models gained the most popularity. This approach processes each point separately with a small neural network followed by an aggregation step with a pooling operation to reason about the whole point cloud. Similar poolingbased approaches for achieving feature invariance on general unordered sets were proposed in other works as well [zaheer2017deep]. Other methods working directly on point clouds organize the data in kdtrees and other graphs [klokov2017escape, gadelha2017shape, landrieu2018large]. These structures define a neighborhood and thus convolution operations can be applied. Vice versa, specific convolutional filters can be designed for sparse 3d data [tatarchenko2018tangent, shen2018mining].
We borrow several ideas from these works, such as using kNNmethods for searching efficiently through local neighborhoods or achieving order invariance through the use of pooling operations over computed distances to basis points. However, we believe that the proposed encoding and model architectures offer two main advantages over existing point cloud networks: (a) higher computational efficiency and (b) conceptually simpler, easytoimplement algorithms that do not rely on a specific network architecture or require custom neural network layers.
Occupancy grids. Similar to pixels for 2D images, occupancy grid is a natural way of encoding 3D information. Numerous deep models were proposed that work with occupancy grids as inputs [maturana2015voxnet, qi2016volumetric, moon2018v2v]. However, the main disadvantage of this encoding is their cubic complexity. This results in a high amount of data needed to accurately represent the surface. Even relatively large grids by our current memory standards () are not sufficient for an accurate representation of high frequency surfaces like human bodies. At the same time, this type of voxelization results in very sparse volumes when used to represent 3D surfaces: most of the volume measurements are zeros. This makes this representation an inefficient surface descriptor in multiple ways. A number of methods was proposed to overcome this problem [wang2017cnn, riegler2017octnet]. However, the problem of representing high frequency details remains, together with a large memory footprint and low computational efficiency for running convolutions.
Signed distance fields. Truncated signed distance fields (TSDFs) [curless1996volumetric, newcombe2011kinectfusion, riegler2017octnetfusion, song2017semantic, zeng20173dmatch, dai2018scan2mesh, park2019deepsdf] can be viewed as a natural extension of occupancy grids: they store distancetosurface information in grid cells instead of a simple occupancy flag. While this partially resolves the problem of representing surface information, the cubic requirement for memory and the low computational efficiency for convolutions remains. In comparison, our method can be viewed as one that uses an arbitrary subset of points from the distance field. The crucial difference is that the distance field we sample from is unsigned and nontruncated, and the number of samples is proportional to the number of points in the original cloud. We further investigate the connection between occupancy grids, TSDFs and BPS in Sec. 4.1.
2D projections. Another common strategy is to project 3D shapes to 2D surfaces and then apply standard frameworks for 2D input processing. This includes depth maps [wu20153d], height maps [sarkar2018learning], as well as a variety of multiview models [su2015multi, kanezaki2018rotationnet, feng2018gvcnn]. Closely related are approaches that project 3D shapes into spheres and apply spherical convolutions to achieve rotational invariance [esteves2018learning, cohen2018spherical]. While projectionbased approaches show high accuracy in discriminative tasks (classification, shape retrieval), they are fundamentally limited in representing shapes that have multiple ‘folds’, invisible from external views. In comparison, our encoding scheme can accurately preserve surface information of objects with arbitrary topology as we show in our experiments in Sec. 4.
We now describe the algorithm for constructing the proposed basis point representation from a given point cloud.
3 Method
Normalization.
The presented encoding algorithm takes a set of point clouds as input . Every point cloud can have a different number of points :
(1) 
where for the case of 3D point clouds. In a first step, we normalize all point clouds to a fit a unit ball:
(2) 
Bps construction.
Next, we form a basis point set. For this task, we sample random points from a ball of a given radius :
(3) 
It is important to mention that this set is arbitrary but fixed for all point clouds in the dataset. and
are hyperparameters of the method, and
can be used to determine the tradeoff between computational complexity and the fidelity of the representation.Feature calculation.
Next, we form a feature vector for every point cloud in a dataset by computing the minimal distance from every basis point to the nearest point in the point cloud under consideration:
(4) 
Alternatively, it is possible to store the full directional information in the form of delta vectors from each basis point to the nearest point in the original point cloud:
(5) 
Other information about nearest points (, RGB values, surface normals) can be saved as part of this fixed representation. The feature computation is illustrated in Fig. 1. The formulas (4) and (5) give us fixedlength representations of the point clouds that can be readily used as input for learning algorithms.
BPS selection strategies.
We investigate a number of basis point selection strategies and provide details of these experiments in Sec. 4.2
. Overall, random sampling from a uniform distribution in the unit ball provides a good tradeoff between efficiency, universality of the generation process and surface reconstruction results, and we apply it throughout the experiments in this paper. Alternatively, an extensive 3D grid of basis points could be used in tandem with any existing 3D convolutional neural network in order to achieve maximum performance at the cost of increased computational complexity.
Complexity.
In this work, we use Euclidean distances between points for creating our encoding, but other metrics could be used in principle. Since we are working with 3D point clouds (which corresponds to having a small value for ), the nearest neighbor search can be made efficient by using data structures like ball trees [omohundro1989five]. Asymptotically, operations are needed for constructing a ball tree from the point cloud and operations are needed to run nearest neighbor queries for basis points. This leads to an overall encoding complexity of ) per point cloud. The kNN search step can be also efficiently implemented as part of an endtoend deep learning pipeline [kaiser2017learning]. Practically, we benchmark our encoding scheme for different values of and and show realtime encoding performance for values interesting for current real world applications. Please refer to the supplementary materials for further details.
4 Analysis
4.1 Comparison to occupancy grids, TSDFs and plain point clouds
Informal intuition.
Compared to occupancy grids and TSDFs, the efficiency and superiority of the proposed BPS encoding is based on two key observations. First, it is beneficial for both surface reconstruction and learning to store some continuous global information (, Euclidean distance to the nearest point) in every cell of the grid instead of simple binary flags or local distances. In the latter case, most of the voxels remain empty and, moreover, the feature vector will change dramatically when slight translations or rotations are applied to an object. In comparison, every BPS cell always stores some information about the encoded object and the feature vector changes smoothly with respect to affine transformations. From this also stems the second important observation: when every cell stores some global information, we can use a much smaller number of them in order to represent the shape accurately, thus avoiding the cubical complexity of the extensive grid representation. This can be seen in Fig. 1 and bottom right Fig. 3, where basis points are able to capture the outline of the original cloud.
We will now validate this intuition by comparing the aforementioned representations in terms of surface reconstruction and actual learning capabilities.
Surface reconstruction experiments.
Independent of a certain point cloud at hand, how well does the encoding capture the details from the object? To answer this question, we take random CAD models from the ModelNet40 [wu20153d] dataset and construct synthetic point clouds by sampling points from each surface. We compare three approaches of encoding the resulting point clouds: storing them as is (raw point cloud), occupancy grid and the proposed encoding via basis point sets as suggested in Eq. 5.
For all methods we define a fixed allowed description length (as floating point values) and compare the normalized bidirectional Chamfer distance between the original point cloud and the reconstructed point cloud for the different encodings:
(6) 
With the same length of the description we can either store points from the original point cloud, binary occupancy flags or basis points with the matrix defined in Eq. 5. From this matrix, a subset of original points can be reconstructed by simply adding corresponding basis point coordinates to every delta vector. For the occupancy grid encoding, we use the centers of occupied grid cells; please note that though a full floating point representation is not necessary to store the binary flag, in reality the majority of machine learning methods will work with floating point encoded occupancy grids and we assume this representation.
Fig. 4 shows the encoding length and the reconstruction quality measured as Chamfer distance (, Eq. 6). The proposed encoding produces less than half of the encoding error compared to occupancy grids for point clouds up to roughly points (see Fig. 3 for a qualitative comparison). This is an indicator for its superiority for preserving shape information. The error curve for the basis point sets is close to the one of the subsampled point cloud representation. The basis point set representation is less accurate than the raw point cloud since the resulting extracted points are not necessarily unique. However, the basis point set is an ordered, fixedlength vector encoding wellsuited to apply machine learning methods.
4.2 Basis point selection strategies
We investigate four different variants of selecting basis points visualized in Fig. 5.
Rectangular grid basis.
A basic approach to basis set construction is to simply arrange points on a rectangular grid. In that case, the basis point set representation resembles the truncated signed distance field [curless1996volumetric] representation. However, one important difference is that we do not truncate the distances for faraway basis points, allowing every point in the set to store some information about the object surface. We will show in Sec. 5.1 that this small conceptual difference has an important effect on performance. We are also allowing the full directional information to be stored in the cell as defined in Eq. 5. Finally, BPS does not require the point clouds to be converted into watertight surfaces since unsigned distances are used.
Ball grid basis.
Since all point clouds are normalized to fit in the unit ball by the transformation defined in Eq. 2, the basis points at the corners of the rectangular grid are located far away from the point cloud. These corner points in fact constitute of all the samples (this can be derived by comparing the volume ratio of a unit ball to a unit cube). Hence we can improve our sampling efficiency by simply trimming the corners of the grid and using more sampling locations within the unit ball.
Random uniform ball sampling.
One generic simple strategy to select points lying inside a dimensional ball is uniform sampling. This can be done by either rejection sampling from a dimensional cube or other efficient methods that are summarized in [harman2010decompositional].
Hexagonal close packing (HCP).
We also experiment with hexagonal close packing [conway2013sphere] of basis points. Informal intuition behind this point selection strategy is that it will optimally cover the unit ball with equally sized balls centered at the basis points [hales2005proof].
We show a comparison of reconstruction errors of ModelNet objects using the different sampling strategies in Fig. 4. Overall, the random uniform and HCP selection strategies provide the best reconstruction results. Using regular grids opens up possibilities for applying convolution operations and adds the possibility to learn translation and rotation invariant features.
We now evaluate the different encodings and basis point selection strategies with respect to their applicability with machine learning algorithms.
5 Learning with Basis Point Sets
5.1 3D Shape Classification
id  Method  acc.  FLOPs  params 

1  VoxNet [maturana2015voxnet]  
2  OccMLP ( grid)  
3  OccMLP ( grid)  
4  TDFMLP ( grid)  
5  TDFMLP ( grid)  
6  BPSMLP ( grid)  
7  BPSMLP ( grid)  
8  BPSMLP ( ball)  
9  BPSMLP ( rand)  
10  BPSMLP ( HCP)  
11  BPSConv3D ( grid)  
12  9 direct. vect.  
13  11 direct. vect.  
14  BPSERT [geurts2006extremely] ( g.)  N/A  N/A  
15  BPSXGBoost ( g.) 
N/A  N/A 
One of the classic tasks to perform on point clouds is classification. We present results for this task on the ModelNet40 [wu20153d] dataset. We benchmark several deep learning architectures that use the proposed point cloud representation and compare them to existing methods that use alternative encodings. The dataset consists of CAD models from 40 different categories, of which are used for training. We use the same procedure for obtaining point clouds from CAD models as in [qi2017pointnet], , we sample points from mesh faces, followed by the normalization process defined in Eq. (2).
Comparison to occupancy grids and VoxNet.
To show the superiority of BPS features and to disambiguate contributions (, the BPS encoding itself and the proposed network architectures), we fix a simple generic MLP architecture with 2 blocks of [fullyconnected, relu, batchnorm, dropout] layers and perform training with
rectangular grids of occupancy maps, truncated distance fields (TDFs) and BPS as inputs.Results are summarized in Tab. 1, rows 17. Using global distances as features instead of occupancy flags with the same network clearly improves accuracy, outperforming an architecture that was specifically designed for processing this type of input: VoxNet [maturana2015voxnet] (row 1). TDFs store only local distances within the grid cell and suffer from the same locality problem as voxels (r. 4). It is also important to note that reducing the grid size affects these methods dramatically (rows 3 and 5, drop in accuracy), while the effect on the BPS is marginal (r. 6, ).
We also compare different BPS selection strategies in the rows 710 of Tab. 1. In the absence of network operators exploiting the point ordering (e.g. 3D convolutions), random and HCP strategies give a slight boost in performance. When the point order in a rectangular BPS grid is exploited with 3D convolutional deep learning models like VoxNet, performance improves at the cost of increased computational complexity (approximately two orders of magnitude more flops, Tab. 1, r. 11).
Substituting Euclidean distances with full directional information defined by Eq. 5 negatively affects the performance of a plain fullyconnected network (Tab. 1, r.12) whereas it improves the performance of a 3D convolutional model (Tab. 1, r. 13).
To show the versatility of the proposed representation, we also use the same BPS features as input to an ensemble of extremely randomized trees (ERT [geurts2006extremely]) and XGBoost [chen2016xgboost] frameworks.
Comparison to other methods.
Finally, we combine these findings with other enhancements (, augmenting the data with few fixed rotations, improving learning schedule and regularization  please refer to the supplementary material and corresponding repository for further details) and compare our two bestperforming models to other methods in Tab. 2.
In summary, simple fully connected network, trained on BPS features in several minutes on a single GPU, is reaching the performance of PointNet [qi2017pointnet], one of the most widely used networks for point cloud analysis. 3Dconvolutional model trained on BPS rectangular grid is matching the performance of the PointNet++[qi2017pointnet++], while still being computationally more efficient. Finally, crude ensembling of 10 such models allows us to match stateoftheart performance [klokov2017escape] among methods working only on point clouds as inputs (, without using surface normals that are available in CAD models but rarely in realworld scenarios).
Method 
acc.  FLOPs  params 

RotationNet 20x [kanezaki2018rotationnet] 
>  
MVCNN 80x [su2015multi]  90.1%  
VoxNet [maturana2015voxnet]  83.0%  >  
Spherical CNNs [esteves2018learning]  88.9%  
point cloud based methods:  
KDnetworks [klokov2017escape]  >  >  
KCNet [shen2018mining]  91.0%  >  
SONet [li2018so]  90.9%  >  > 
DeepSets [zaheer2017deep]  90.0%  
PointNet++ [qi2017pointnet++]  90.7%  
PointNet [qi2017pointnet]  89.3%  
PointNet(vanilla) [qi2017pointnet]  87.2%  
DeepSets (micro) [zaheer2017deep]  82.0%  
Ours (BPSMLP)  89.0%  
Ours (BPSConv3D)  
Ours (BPSConv3D, 10x)  

5.2 SinglePass Mesh Registration from 3D Scans
We showcase a second experiment with a different, generative task to demonstrate the versatility and performance of the encoding. For this, we pick the challenging problem of human point cloud registration. In this problem, correspondences are found between an observed, unstructured point cloud and a deformable body template. blue Traditionally, human point cloud registration has been approached with iterative methods [hirshberg2012coregistration, zuffi2015stitched]. However, they are typically computationally expensive and require the use of a deformable model at application time. Machine learning based methods [3dcoded] remove this dependency by replacing them with a sufficiently large training corpus. However, current solutions like [3dcoded] rely on multistage models with complex internal representations, which makes them slow to train and test. We encourage the reader interested in human mesh registration to review the excellent summary of previous work provided in [3dcoded].
We use a simple DenseNetlike [huang2017densely] architecture with two blocks (see Figure 2), where the input is a BPS encoding of a point cloud and the output is the location of each vertex
in the common template. Note that there is no deformable model in our system and that we do not estimate deformable model parameters or displacements; the networks learns to reproduce coherent bodies just based on its training data.
Method  Intra (mms)  Inter (mms) 

Stitched puppets [zuffi2015stitched]  1.568  3.126 
3DCODED [3dcoded]  1.985  2.878 
Ours  2.327  4.529 
Deep functional maps [FMNet]  2.436  4.826 
FARM [FARM]  2.81  4.123 
ConvexOpt [CONVEXOPT]  4.86  8.304 
To generate this training data, we use the SMPL body model [loper2015smpl]. SMPL is a reshapeable, reposable model that takes as input pose parameters related to posture, and shape parameters related to the intrinsic characteristics of the underlying body (, height, weights, arm length). We sample shape parameters from the CAESAR [Robinette2002] dataset, which contains a wide variety of ages, body constitution and ethnicities. For sampling poses we use two sources: the CMU dataset [cmu] and a small set of poses inferred from a 3D scanner blue. Since the CMU dataset is heavily populated with walking and running sequences, we perform weighted sampling of poses with the inverse Mahalanobis distance from the sample to the CMU distribution as weight.blue We roughly align the CMU poses to be frontal. To increase the variation of the training data, we introduce noise sampled from the covariance of all the considered poses to half of the data points. From these meshes, a set of
points is sampled uniformly from the surface of the posed and shaped SMPL template. These point clouds are then used to compute the BPS encoding. We train the alignment network for 1000 epochs in only 4 hours and its inference time is less than 1ms on a nonGPU laptop.
To evaluate our method, we process the test set from the FAUST [bogo2014faust] dataset. It is used to compare mesh correspondence algorithms by using a list of scan points in correspondence. To find correspondences between two point clouds, we process each of them with our network, obtaining as a result two registered mesh templates. The templates then define the dense correspondences between the point clouds.
We obtain an average performance of mm in the intrasubject challenge and mm in the intersubject challenge (see Tab. 3). These numbers are comparable, but higher than stateoftheart methods like [3dcoded] or [zuffi2015stitched]. However, we note that the two methods outperforming BPS in the FAUST intra challenge are orders of magnitude slower than our system. The twostage procedure in [3dcoded] takes multiple minutes and the particle optimization in [zuffi2015stitched] takes hours, while our system produces alignments in 1ms (for qualitative results, see Fig. 6). This enables realtime processing of 3D scans, which was previously impossible, or can be used as a first step for faster multistage systems that refine the accuracy of this single stage method. We also provide a qualitative evaluation on the Dynamic FAUST[dfaust:CVPR:2017] dataset in the supplementary video ^{2}^{2}2https://youtu.be/kc9wRoI5JbY.
6 Conclusion and Future Work
In this paper, we introduced basis point sets for obtaining a compact fixedlength represenation of point clouds. BPS computation can be used as a preprocessing step for a variety of machine learning models. In our experiments, we demonstrated in two applications and with different models the computational superiority of our approach with orders of magnitudes advantage in processing time compared to existing methods, remaining competitive accuracywise. We have shown the advantage of using rectangular BPS grid in combination with standard 3Dconvolutional networks. However, in future work it would be interesting to consider other types of BPS arrangements and corresponding convolutions [hoogeboom2018hexaconv, cohen2018spherical, esteves2018learning, graham2017submanifold] for improved efficiency and learning rotationinvariant representations.
Comments
There are no comments yet.