In the past years, the interest in 3D scanning technologies has constantly grown in the computer vision community. The benefits of combining 3D and semantic information are fundamental for robotic applications or autonomous driving. To assign the right label to every point in a 3D scene, semantic classification algorithms need to understand the geometry of the scene. Among the ways to achieve such an understanding, two paradigms stand out. In the first instance, the point cloud is segmented and then a label is given to each segment[7, 22, 19]
. The weakness of this first strategy is that it depends on a prior segmentation which does not use semantic information. In the second instance, each point is considered individually and is given a semantic label or class probabilities (see Figure1) [27, 9]. Without any prior segmentation, classifying 3D points only relies on the appearance of points neighborhoods. Thus, we need a set of expressive features to describe the geometry in a point neighborhood. Demantké et al. proposed a description based on local covariance . To complement this local shape description, Weinmann et al. added measures of verticality and height distribution . We could find more complex descriptors in the literature like Spin Images  or Fast Point Feature Histograms . However, we chose to use a multiscale approach with simple features, which has been proven to be more expressive .
Standard machine learning techniques are used to classify 3D points described by geometric features. According to Weinmann et al.’s extensive work, Random Forest is the most suitable classifier. In that case, spatial relations between points are ignored. Features can instead be used as unary potentials in Markov Random Fields  to ensure spatial coherence in the classification. These two techniques can also be combined by using class probabilities given by a standard classifier as unary potentials . In that case, the random fields can be seen as a subsequent semantic segmentation, and the first point-wise classification problem remains. We focus on this first classification, because the better its results are, the better further processing will perform.
3D neural networks have recently been used for 3D Semantic Segmentation. Huang et al.
used a 3D version of fully convolutional neural networks (FCNN) to label point clouds from voxel-wise predictions. An original Multilayer Perceptron (MLP) architecture, named Pointnet, has been presented by Qi et al., and is able to extract both global and local features from a 3D point cloud. Its extension, Pointnet++ , achieves even better results by the aggregation of local features in a hierarchical manner. 3D neural networks can also be combined with graph-based segmentation methods to design more elaborate 3D semantic segmentation algorithms. Tchapmi et al.  used a CRF on 3D FCNN predictions to enforce global consistency and provide fine-grained semantics. On the other hand, Landrieu and Simonovsky  prefer to segment the cloud in a Superpoint Graph first, and then use Pointnet architecture and graph convolutions to classify each Superpoint. Even though hand-crafted features rarely perform at the level of Deep Learning architectures, our multiscale features compete with most of these methods.
Although we propose a slightly different set of features than Hackel et al. , the originality of our method lies in the points neighborhood selection. In the case of 3D points classification, the two most commonly used neighborhood definitions are the spherical neighborhood  and the k-nearest neighbors
(KNN). For a given point , the spherical neighborhood comprises the points situated less than a fixed radius from , and the k-nearest neighbors comprises a fixed number of closest points to . We can also add a third definition, which is mostly used for airborne lidar data [5, 15], the cylindrical neighborhood which comprises the points situated less than a fixed radius from , on a 2D projection of the cloud (frequently on the horizontal plane). Whatever definition is chosen, the scale of the neighborhood has to be determined. Using a fixed scale across the scene is inadequate because most scenes contain objects of various sizes. Weinmann et al. explored a way to adapt the scale to each point of the cloud . However, using a multiscale approach has proven to be more effective, whether it is used with KNN [16, 9], with spherical/cylindrical neighborhoods [4, 15] or with a combination of all neighborhood types . The major drawback of the multiscale neighborhoods is their computational time, but Hackel et al.  suggested a simple and efficient solution to implement them, based on iterative subsamplings of the cloud. However, their definition of multiscale neighborhoods, using KNN, lacks geometrical meaning. Section 2. describes our definition of multiscale spherical neighborhoods, which keeps the features undistorted while ensuring sufficient density at each scale.
We chose to evaluate our multiscale spherical neighborhoods definition on a semantic classification basis. The features we use and our learning strategy are described in Section 3. We conduct several experiments detailed in Section 4 on various datasets. First, we validate that our multiscale features outperform state of the art features in the same experimental conditions on two small outdoor datasets. Then we compare our classification results to more elaborate semantic segmentation methods on three bigger datasets. The parameters’ influence is eventually highlighted in the last paragraph.
2 Multiscale Spherical Neighborhoods
Our new definition of multiscale neighborhoods is inspired by  with spherical neighborhoods instead of KNN. This section highlights the differences between both definitions. Let be a point cloud, the spherical neighborhood of point in with radius is defined by:
Unlike KNN, this neighborhood corresponds to a fixed part of the space as shown in Figure 2. This property is the key to give a more consistent geometrical meaning to the features. But, in that fixed part, the number of points can vary according to the cloud density. As Hackel et al.  explained, radius search should be the correct procedure from a purely conceptual viewpoint but it is impractical if the point density exhibits strong variations. Two phenomena appear in particular:
Having too many points when the neighborhood scale is too big or the density too high
Having too few points when the neighborhood scale is too small or the density too low
The first phenomenon has computational consequences as getting a large number neighbors in a larger set of points takes a lot of time. The computational cost of multiscale features does not come from the fact that we have to compute the features times, where is the number of scales. The real limiting factor is the number of points contained in the biggest scales. Furthermore, all those points are not required to compute relevant features. Our features capture a global shape in the neighborhood and do not need fine details. The solution proposed by Hackel et al.  to subsample the cloud proportionally to the scale of the neighborhood can be adapted to the spherical definition. This solution better suits the spherical neighborhoods than the KNN. With a uniform density, the number of points in the neighborhood becomes a feature itself, describing the neighborhood occupancy rate. We chose to subsample the cloud with a grid, by keeping the barycenter of the points comprised in each cell. Let be the size of the grid cells, for any radius of a neighborhood , we can control the maximum number of points in our neighborhood with the parameter . If is too low, we will not have enough points and the features will not be discriminant, but the higher its value is, the longer computations are.
The impact of the second phenomenon, caused by low densities, should be limited by the use of multiple scales. As illustrated in Figure 2, if the density is too low, the points will remain the same after subsampling between two consecutive scales (neighborhood C). With spherical neighborhoods, the small scale might not contain enough points for a good description, but the large scale will deliver the information. In the same case, the KNN behave differently, giving exactly the same information at both scales. Regardless of the neighborhoods, there is no information to get from the data at the smaller scale. However, the KNN give a false description of the smaller scale without any measure of its reliability, whereas spherical neighbors give the number of points, which is an indication of the robustness of the description.
The scales are defined by three parameters: the radius of the smallest neighborhood , the number of scales , and the ratio between the radius of consecutive neighborhoods . We can then define the neighborhood at each scale around point as:
with being the radius at scale and being the cloud subsampled with a grid size of .
Despite its similarity with the definition proposed by Hackel et al. , our multiscale neighborhood definition stands out with its geometrical meaning. With spherical neighborhoods instead of KNN, the features always describe a part of the space of the same size at each scale. Moreover, the number of points in the neighborhood is now a feature itself adding even more value to this definition. More than a theoretical good behaviour, this leads to better feature performances, as shown in Section 4.
3 Point-wise Semantic Classification
3.1 Geometric and Color Features
Sum of eigenvalues
|Change of curvature|
Absolute moment (x6)
|Vertical moment (x2)|
|Number of points|
|Average color (x3)|
Color variance (x3)
For benchmarking purposes, we divide our features in two sets described in Table 1. The first set does not use any additional information like intensity, color, or multispectral measure, to keep previous work conditions [27, 9] in our first experiment (Section 4.1). In the other experiments, additional color features are used when available. We use covariance based features that simply derive from the eigenvalues
and corresponding eigenvectorsof the neighborhood covariance matrix defined by:
Where is the centroid of the neighborhood . From the eigenvalues, we can compute several features: sum of eigenvalues, omnivariance, eigenentropy, linearity, planarity, sphericity, anisotropy, and change of curvature. However, among those commonly used features, we eliminate anisotropy defined by as it is strictly equivalent to sphericity. We can notice that, thanks to the nature of our neighborhoods, we do not need to normalize the eigenvalues as in previous works. Their values do not vary with the original point cloud density which means the features that do not involve ratios, e. g. sum of eigenvalues, omnivariance, and eigenentropy, make more sense. Our feature set is completed by verticality that we redefined as . Unlike Hackel et al. 
, we keep the verticality for the first and the last eigenvectors. The first one encodes the verticality of linear objects, and the last one the verticality of the normal vector of planar objects. We also use first and second ordermoments around all three eigenvectors, but in absolute value as the eigenvectors have random orientations. Following our assumption that vertical direction plays an important role, additional vertical moments are computed around the vertical vector in relative value as the upward direction is always the same. Eventually, as explained in Section 2, the number of points in a neighborhood completes our first set of features, which contains values at each scale. In Section 4.2, we use colors as previous works did, because some objects like closed doors or windows are indistinguishable in 3D. We chose simple features, the mean and the variance of each color channel, bringing the total number of features per scale to .
3.2 Learning Strategy
There are two setbacks when classifying a point cloud. First, its size is generally huge, and then, the classes are heavily unbalanced. To fix those problems, one can take a subset of the training data, small enough to allow reasonable training times, and balance the classes in that subset. The scope of the results also depends on the test set. With small datasets, the rest of the points are used as the test set, even if they represent the same scene. The results from such experiments would be questionable as a measure of the classification performances, however, they still may be used to compare the descriptive power of different features. The recent appearance of bigger point cloud datasets allowed the separation of the training set and the test set. With such point clouds, it is possible to get a relevant measure of how well the classification generalizes to unseen data.
In Section 4.1, we compare our multiscale features to state of the art features [27, 9] in the same experimental conditions. The same number of points is randomly picked in each class to train a classifier, and this classifier is tested on the rest of the cloud. We go further than previous works by computing our results several times with different training sets. We cannot ensure that the comparison is valid without checking the distribution of results on a large number of trials. The quality of our multiscale features can be assessed more reliably in these conditions.
On bigger datasets, we use a different learning strategy. We iteratively add points to the training set with a trial and error procedure. A classifier is trained on a set of points from the training clouds , then the classifier is tested on , and we randomly add some of the misclassified points to . After some iterations, the classifier is used on the test clouds. The experiments in Section 4.2 use this learning strategy, which only consists in a smart choice of the training points. We can’t use this strategy on small datasets, because the test scene is the same as the training scene and our classification would show overfitted results.
4.1 State of the art features comparison
(with standard deviation) on Rue Madame (top) and Rue Cassette (bottom) datasets. Results for and  are converted from corresponding articles.
The goal of our first experiment is to assess the performances of our multiscale features against other state-of-the-art features [27, 9], thus, we keep the same experimental conditions as Weinmann et al.  and Hackel et al. . We use the Paris-Rue-Madame dataset , a 160-meter street scan containing 20 million points and the Paris-Rue-Cassette dataset , a 200-meter street scan containing 12 million points. To focus the comparison on the features, we use a random forest classifier trained on 1000 random points per class for each dataset as previous works did. We chose the parameters , , and . The first three parameters were chosen so that the scales of our neighborhoods cover a range from the smallest object size to the order of magnitude of a facade and the last parameter
was chosen empirically (see Section 4.3). With an average personal computer setup (32 GB RAM, Intel Core i7-3770; 3.4 GHz), our feature extraction took about 319 seconds on Rue Cassette, which is the same order of magnitude as the method in (191s) and way faster than the method in  (23000s).
where , , and respectively denote true positives, false positives, and false negatives for each class. As stated in Section 3.2, we reproduce our results 500 times to ensure the validity of the comparison despite the random factor in the choice of the training set. In Table 2, we report the average class to compare with previous results and the standard deviations to prove the consistency of the classification. The performances of our multiscale features exceed previous results by mean points on Rue Madame and mean points on Rue Cassette. We can also note that our results do not vary much, the standard deviations being limited to a few percents even for the hardest classes with fewer points.
|PointNet  111As Pointnet was evaluated in a k-fold strategy in the original paper, we obtained the results on this particular split from the authors.|
As a conclusion, the low standard deviation validates the random selection of the training set and legitimates the comparison of the different sets of features. Our multiscale features thus proved to be superior to state of the art features. The difference between Hackel et al.’s multiscale features  and ours may seem like an implementation detail, with radius neighborhoods instead of KNN. However, the results prove that the type of local neighborhood definition has a great impact on the robustness of the features.
4.2 Results on large scale data
Our second experiment shows how our classification method generalizes to unseen data. As shown in Table 4, we chose three large scale datasets from different environments and acquisition methods. With these datasets, we use the smart choice of training points described in Section 3.2 and a random forest classifier. This simple classification algorithm is designed to focus on the point-wise descriptive power of our features, we called it RF_MSSF for "Random Forest with Multi-Scale Spherical Features."
|Acquisition||Cameras||Fixed lidar||Mobile lidar|
Stanford Large-Scale 3D Indoor Spaces Dataset (S3DIS)  was acquired by 3D cameras and covers six large-scale indoor areas from three different buildings for a total of and 273 million points. To keep the same experimental conditions as Tchapmi et al. , we use the fifth area as the test set and train on the rest of the data. Original annotation comprises 12 semantic elements which pertain to the categories of structural building elements (ceiling, floor, wall, beam, column, window, and door) and commonly found furniture (table, chair, sofa, bookcase, and board). A clutter class exists as well for all other elements. This last class has no semantic meaning like the "unclassified" points in the other datasets and will not be considered during training and testing. As the object scales in this dataset are smaller than the object scales in a street, we adapt the parameters to , , , and . The classifier is trained on 50000 sample points chosen with the procedure described in Section 3.2. Table 3 shows that our classification method outperforms the deep learning architectures of [17, 25] but is unable to compete with the cutting edge algorithm of Landrieu and Simonovsky .
Semantic3D  is an online benchmark comprising several fixed lidar scans of different outdoor places. This is currently the dataset with the highest number of points (more than 4 billion), and the greatest covered area (around ). We kept the parameters used in previous outdoor experiment: , , , and . Table 5 provides our results on the reduced-8 challenge. Our classification method ranked second at the time of the submission. Once again, it beats several deep learning architectures and is only outperformed by the same algorithm . We can notice that our results exceed Hackel et al. results  by a large margin, consolidating the conclusion in Section 4.1.
Paris-Lille-3D  is a recent dataset that was acquired with a Mobile Laser Scanning system in two cities in France: Lille and Paris. Overall, the scans contain more than 140 million points on of streets, covering a area, which is much bigger than other mobile mapping datasets like Rue Madame and Rue Cassette. This dataset, fully annotated by hand, comprises 50 classes unequally distributed in three scenes Lille1, Lille2, and Paris. Following the authors’ guideline, we designed 10 coarser classes defining meaningful groups: Unclassified, Ground, Building, Signage, Bollard, Trash cans, Barriers, Pedestrians, Cars, and Vegetation. We provide an "XML" file in supplementary materials, which maps original classes to our coarse classes. Among our ten classes, the first one Unclassified will be ignored during training and test. We choose to train our classifier on the two scenes Lille1 and Lille2 and to use Paris as the test fold. This dataset does not include colors, so we only use our first set of features and choose the parameters used in the other outdoor environments: , , , and . Our results are shown in Table 6. Although this dataset is recent and does not have any other baseline result for now, we find it very interesting because of its cross-city split. We see that our classifier can transfer knowledge from one city to another and is particularly efficient on buildings. This is remarkable given that Lille and Paris architectural styles are very different.
Figures 3, 4, and 5 show some examples of classified scenes. First, we can notice that the classification has no object coherence as some unstructured patches appear, for example on the columns in Figure 3 or on the facades in Figure 5. This highlights the particularity of our method to focus on points independently, not using any segmentation scheme. Another very interesting pattern appears on the second scene in Figure 5: when a car is close to a tree, it is misclassified and we can actually see the influence area of the tree on the car. We can assume that the classifier relies more on the large scales to distinguish those two particular classes.
Overall, our classification algorithm ranks among the best approaches, beating nearly every other elaborate method apart from Superpoint Graphs  on these datasets. However, this has to be considered in light of the fact that we do not use any segmentation or regularization process and only focus on the descriptive power of our features. We proved that our features beat state-of-the-art features in terms of classification performances, and that they could, alone, compete with complex classification schemes, including deep learning methods.
4.3 Density parameter influence
We eventually evaluate the influence of the parameter in our classification method. As a reminder, this parameter controls the number of subsampled points that a neighborhood can contain. A high value means better features but slower computations. In this experiment, we chose to use Paris-Lille-3D for two reasons. First, we want to focus on the 3D descriptors and, thus, do not need color information. Then, the results generalize well because they are cross-city, tested on Paris after being trained on Lille. With the parameters previously used on this dataset, we compute average scores across all classes for different values of . Figure 6 shows the evolution of the results along with the features computation speed for every split of the dataset. We can note that average scores rise quickly up to and do not increase a lot for higher values of . Depending on the application, one can choose to optimize the results or the processing speed with this parameter. Although our performances could be slightly increased with a higher value, we chose to keep in our work because it is a trade-off between performance and computation speed.
This paper presents a 3D point cloud semantic classification approach articulated around new multiscale features. The use of spherical neighborhoods instead of KNN increases the discriminating power of our features, leading to better performances than state-of-the-art features in the same experimental conditions. We also showed that the performances of our algorithm are consistent on three datasets acquired with different technologies in different environments. Eventually, we proved that our approach outperforms recent and complex classification schemes, including deep learning methods, on large scale datasets. Deep learning is becoming the standard for several classification tasks, but there is room for improvements with handcrafted methods. Furthermore, the ideas that come up from such methods, like our new multiscale neighborhood definition, could benefit other frameworks including deep learning.
I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and
3d semantic parsing of large-scale indoor spaces.
Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2016.
-  R. Blomley, B. Jutzi, and M. Weinmann. Classification of airborne laser scanning data using geometric multi-scale features and different neighbourhood types. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, 3(3), 2016.
-  A. Boulch, B. L. Saux, and N. Audebert. Unstructured point cloud semantic labeling using deep segmentation networks. In Eurographics Workshop on 3D Object Retrieval, volume 2, page 1, 2017.
-  N. Brodu and D. Lague. 3d terrestrial lidar data classification of complex natural scenes using a multi-scale dimensionality criterion: Applications in geomorphology. ISPRS Journal of Photogrammetry and Remote Sensing, 68:121–134, 2012.
N. Chehata, L. Guo, and C. Mallet.
Airborne lidar feature selection for urban classification using random forests.International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 38(Part 3):W8, 2009.
-  J. Demantke, C. Mallet, N. David, and B. Vallet. Dimensionality based scale selection in 3d lidar point clouds. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 38(Part 5):W12, 2011.
-  A. Golovinskiy, V. G. Kim, and T. Funkhouser. Shape-based recognition of 3d point clouds in urban environments. In Computer Vision, 2009 IEEE 12th International Conference on, pages 2154–2161. IEEE, 2009.
-  T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, and M. Pollefeys. Semantic3d. net: A new Large-scale Point Cloud Classification Benchmark. arXiv preprint arXiv:1704.03847, 2017.
-  T. Hackel, J. D. Wegner, and K. Schindler. Fast semantic segmentation of 3d point clouds with strongly varying density. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Prague, Czech Republic, 3:177–184, 2016.
-  J. Huang and S. You. Point cloud labeling using 3d convolutional neural network. In Proc. of the International Conf. on Pattern Recognition (ICPR), volume 2, 2016.
-  A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on pattern analysis and machine intelligence, 21(5):433–449, 1999.
-  L. Landrieu and M. Simonovsky. Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs. arXiv preprint arXiv:1711.09869, 2017.
-  F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg. Deep projective 3d semantic segmentation. In International Conference on Computer Analysis of Images and Patterns, pages 95–107. Springer, 2017.
-  D. Munoz, N. Vandapel, and M. Hebert. Onboard contextual classification of 3-d point clouds with learned high-order markov random fields. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on. IEEE, 2009.
-  J. Niemeyer, F. Rottensteiner, and U. Soergel. Contextual classification of lidar data and building object detection in urban areas. ISPRS journal of photogrammetry and remote sensing, 87:152–165, 2014.
-  M. Pauly, R. Keiser, and M. Gross. Multi-scale feature extraction on point-sampled surfaces. In Computer graphics forum, volume 22, pages 281–289. Wiley Online Library, 2003.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
-  C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
-  X. Roynard, J. E. Deschaud, and F. Goulette. Fast and Robust Segmentation and Classification for Change Detection in Urban Point Clouds. In ISPRS 2016-XXIII ISPRS Congress, 2016.
-  X. Roynard, J.-E. Deschaud, and F. Goulette. Paris-lille-3d: a large and high-quality ground truth urban point cloud dataset for automatic segmentation and classification. arXiv preprint arXiv:1712.00032, 2017.
-  R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3212–3217. IEEE, 2009.
A. Serna and B. Marcotegui.
Detection, segmentation and classification of 3d urban objects using mathematical morphology and supervised learning.ISPRS Journal of Photogrammetry and Remote Sensing, 93:243–255, 2014.
-  A. Serna, B. Marcotegui, F. Goulette, and J.-E. Deschaud. Paris-rue-Madame database: a 3d mobile laser scanner dataset for benchmarking urban detection, segmentation and classification methods. In 4th International Conference on Pattern Recognition, Applications and Methods ICPRAM 2014, 2014.
-  R. Shapovalov, E. Velizhev, and O. Barinova. Nonassociative markov networks for 3d point cloud classification. the. In International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XXXVIII, Part 3A. Citeseer, 2010.
-  L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and S. Savarese. Segcloud: Semantic segmentation of 3d point clouds. In International Conference on 3D Vision (3DV), 2017.
-  B. Vallet, M. Brédif, A. Serna, B. Marcotegui, and N. Paparoditis. Terramobilita/iqmulus urban point cloud analysis benchmark. Computers & Graphics, 49:126–133, 2015.
-  M. Weinmann, B. Jutzi, S. Hinz, and C. Mallet. Semantic point cloud interpretation based on optimal neighborhoods, relevant features and efficient classifiers. ISPRS Journal of Photogrammetry and Remote Sensing, 105:286–304, 2015.
-  M. Weinmann, B. Jutzi, and C. Mallet. Feature relevance assessment for the semantic interpretation of 3d point cloud data. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 5:W2, 2013.