Lidar sensors are used in a large number of fields, providing a three-dimensional representation of the given environment. Object detection on Lidar data is widely considered to be a crucial aspect for perception in automotive active safety and autonomous driving. Hence, there is a large number of works on object detection [26, 35, 33, 18] and odometry [22, 5, 19] in the Lidar space, with many approaches reporting incremental improvements on benchmark datasets [13, 3] targeted towards automotive applications.
While many methods show inventive data processing combined with deep learning techniques to achieve very good performance on those tasks, the vast majority is heavily optimised to these benchmarks and does not reach real-time performance, even when executed on powerful GPUs. This makes their application in vehicles unfeasible today and the foreseeable future, since dedicating so much computational power towards one algorithm is impractical concerning cost, power requirements and heat dissipation.
. Thus, in this work we propose a meaningful representation for object instances in Lidar sensor data, to facilitate a real-time capable, CPU-based classification. We accomplish this task with a small custom neural network architecture, by applying methods to maintain three-dimensional image information in a two-dimensional representation and selective object statistics.
Ii Related Work
There is a multitude of publications concerned with object detection and classification on Lidar data. In an attempt to provide a short overview, we roughly divide them into three categories; algorithms that work on unregularised point clouds, those that use regularised ones, and algorithms that use fusion approaches combining Lidar with other sensors, mostly camera data.
A lot of methods process unregularised point clouds and usually generate per-point features, with PointNet , offering more global feature generation, and PointNet++ , cascading instances of the aforementioned approach for more localised features, being the backbone for a larger number of publications including PointRCNN . While facilitating good performance, those approaches are computationally intensive, which rules them out for embedded application.
A second group of algorithms regularises Lidar point clouds before processing them. To do so, grid structures are used in fixed [35, 28, 7] or variable sizes . To alleviate the negative performance influence of many empty grid cells created by this methods, works like  established specific network layers to exploit this sparsity, providing a significant speed-up.  reduce grids to a two-dimensional representation, enabling omission of costly 3D convolutions for faster runtime.
The third and last set of works adds information from camera sensors to enrich Lidar data. Some of them use camera images to create region proposals, which are refined by a Lidar algorithm in a subsequent step. Here, a popular approach is the use of a frustum for projection into the point cloud, as shown in [24, 34] and  for example. Comparable methods utilise full three-dimensional object proposals from image inputs  or more complex deep learning fusion networks .
All of the approaches mentioned above are highly computationally intensive, with many of them not being real time capable even on high-end GPUs. There is not much work to be found concentrating on real-time application of Lidar algorithms on systems with less computational power.  claim such capabilities using specific optimisations for powerful CPUs, but report sub-par results.
To facilitate real-time CPU-based Lidar object classification of segmented instances, we consider the following aspects as valuable.
Iii-a Lidar data representation
From the raw data of a Lidar sensor, several different information aspects can be used to describe a measured point. The obvious first aspects are the X, Y and Z coordinate values of this point, usually given in Cartesian coordinates with the origin of the coordinate system in the location of the Lidar sensor. Derived from this, the distance or range of the point can be calculated, for example as Euclidean distance. Furthermore, the measured intensity of a point is normally given in the raw data, enabling indication of the surface reflectivity of an object.
To gain more knowledge about the surface of an object in the Lidar space, we propose to calculate an image representation of the horizontal and vertical component of the normal vector of each measured point. Such a representation is known in SLAM/odometry algorithms for Lidar data [22, 5, 19]. The normal vector for a measured Lidar point can be determined using the angle relationships shown in Figure 2. and are the angles between the line from the sensor origin to the given point and the line from the given point to its respective neighbour. The angle bisector of the combined angle equals one component of the normal vector. We decided to use the scalar values of the angles instead of the more common cross product calculation between neighbouring points in the three-dimensional space. This strategy reduces the calculation of the normal vector image to a simple element-wise matrix subtraction. To compute the horizontal component, the neighbouring points left and right of the point in question are used. For the vertical component, the neighbour above and below are considered respectively. Using just one neighbouring point here would allow for a potentially larger error. An exemplary representation of both components can be seen in Figure 3, with an additional combined view for illustration purposes.
To identify redundant information in these seven data representations and to reduce the number of input layers which later need to be processed, we conducted an ablation experiment testing all possible input combinations (see Appendix). It showed, that we are able to efficiently select only three layers, namely the intensity values and our horizontal and vertical normal vector component representations, and maintain performance with only a minimal accuracy decrease compared to using all seven of them.
Iii-B Object instance and mask
We build our classification approach on segmented object instances in the Lidar space. This facilitates a number of possible different applications. While integrating the proposed contributions of this work into a larger end-to-end neural network would be feasible, we considered the use of clustering algorithms to generate object instances. There are some methods available, which have been optimised to provide good performance in real-time Lidar instance segmentation on automotive data [14, 6]. General purpose clustering algorithms like DBSCAN  could also be used, but are much more computationally intensive.
The use of the clustered objects for our classification is twofold; Primarily, we can apply the segmented instance as a mask on the different Lidar data representations described in Section III-A
, emphasising the points belonging to the object and eliminating background influence. Furthermore, we can extract statistical information about each segmented object without much difficulty. While there is no way to be completely certain, that an object is recognised in its full extent, such values are clearer and more deterministic than many machine learning representations and can be especially helpful in edge cases.
A clustering approach will segment all kind of object instances above a separated ground plane, independent of whether they belong to a supported class relevant for automotive application. Hence, our classification will be presented with a majority of ”None” objects along the roadside. For example; walls of shops and houses can be confused with trucks, street lights or other poles can resemble pedestrians and in some cases, cars might even be confused with larger bushes.
We use the width, length and height of a segmented object, the number of Lidar points belonging to it, as well as the Euclidean and X- and Y-axis distance from the sensor origin. In this way we generate a vector of size seven.  use different but comparable geometric features in their approach to distinguish between main object categories in their closed-source dataset.
While influence on the overall classification performance scores is not particularly substantial, the use of our statistic vector proved to offer valuable decision support for critical edge cases in our experiments. Examples of this advantage in correcting both false positives and false negatives can be seen in Figure 5.
Iii-C Object instance classification architecture
Following our objective to facilitate a fast classification of Lidar instances, which is capable of running on CPU in real time, we propose a small Convolutional Neural Network (CNN) architecture. As depicted inFigure 4, it consists of two branches. The first one takes the statistic vector as input and comprises two fully connected layers. The second larger branch processes the Lidar image representations and is characterised by two residual separable modules in between common
convolution layers. Such a module features two depth-wise separable convolutions and maximum pooling in parallel to a residual connection withfilter kernel. Comparable structures have been popularised in deep learning architectures [15, 9, 31] and shown benefits in application to small real-time networks [12, 16, 2].
The complete CNN architecture has a total of parameters, which results in a weight checkpoint file of kBytes in size. For comparison, state-of-the-art end-to-end Lidar networks have many millions of parameters and need much more memory to store weights accordingly, which is another cost driving factor in addition to their much higher computational requirements.
Iv Experimental Evaluation
To evaluate the performance of our proposed method, we consider different aspects and metrics. While good detection/classification rates are important, we developed our approach with a focus on real-time capability for CPU-based platforms. Hence, we are not aiming to set new benchmark high scores competing directly with magnitudes larger end-to-end Lidar networks. We rather aspire to provide reliable output given strong computational limitations.
We use the public dataset SemanticKITTI by Behley et al. , which is an extension of the original KITTI  and provides semantic annotation for all sequences of its odometry benchmark, to make our evaluation transparent. An illustrative example of our method’s performance can be seen in Figure 1. The semantically segmented point clouds, as well as the additional instance labels in the dataset, allow for multiple pointwise evaluation approaches. Accordingly, we use three different tasks to assess the performance:
Iv-a Semantic Segmentation
For the semantic segmentation evaluation, we employ a general purpose clustering method to agnostically separate object instances in the three-dimensional point cloud. As described in Section III, we classify the segmented instances from these clusters.
Our approach classifies objects in five general automotive classes: “Cars”, “Trucks”, “Pedestrians”, “Bikes” and the “None” class, which embodies all static background classes such as road surface, buildings and vegetation. To achieve this mapping, we combined the SemanticKITTI classes “Bicycle”, “Bicyclist”, “Motorcycle” and “Motorcyclist” to “Bike”, as well as “Truck”, “Bus”, “On-Rails” and “Other-Vehicle” to “Trucks”. The classification network has been trained with the annotated point clouds of the available training logs. As suggested in the SemanticKITTI API documentation, we kept the 8th log separate for validation. In this manner we are able to test semantic segmentation results with the reduced class mapping on unseen data. Table I shows the results for the class-wise semantic segmentation intersection over union (IoU) metric of the combined approach of clustering the point cloud and classifying each cluster separately. The IoU or Jaccard index is defined as
and in practice can be described as the relation of true positives (TP) to the sum of TP, false positives (FP) and false negatives (FN).
This metric provides a good impression of pointwise segmentation quality, since the correct predictions and both types of incorrect predictions for each point are included in the equation.
The score of this metric is computed for two approaches. The first approach is the classification of clustered instances as an end-to-end pipeline on point cloud data to show the performance of the proposed method on unseen samples. The second approach applies the classification directly on the annotated ground truth instances in the dataset. This allows for a comparison, on how the performance is influenced by the grade of the provided object instances. As the results show, performance of the semantic segmentation directly depends on the quality of the given object clusters.
Iv-B Object Detection
The second evaluation method is meant to assess the performance of object detection. For this, the previously mentioned IoU metric is used to define bins of precision in the detection. Ten bins are defined with a pointwise overlap of the ground truth objects and proposed clusters, ranging from an IoU of 0.5 to 0.95 in steps of 0.05. The average of all 10 bins is the single metric score, the Average Precision (AP), which is shown in Table II.
Additionally the AP for the overlap values of 0.5, 0.75 and 0.95 are listed, in which the evaluation is restricted to objects above the denoted IoU.
Iv-C Panoptic Segmentation
Given the nature of our method, class-less separation of object clusters and background followed by object classification, we can use it to perform the task of panoptic segmentation. This term was coined by Kirillov et al. in their work of the same name . According to the authors, this task “unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance)”.
In panoptic segmentation the differentiation between Stuff ie. background and Things, in our case active road users, is as important as the separation of Things among themselves. After separation into background and clusters and classification of the latter, we use the predicted class labels to remove the instance labels from clusters which are not part of the Things, in our case all None labels such as utility boxes, road signs and vegetation.
We validate our approach on the panoptic segmentation benchmark of the SemanticKITTI dataset . This challenge uses the panoptic quality (PQ), originally proposed by , averaged over all classes , as used by Porzi et al. , on the whole test set. For a single class it is defined as
and is composed from the segmentation quality (SQ) and the recognition quality (RQ) as follows
At the time of writing, our method is the only submitted approach for the panoptic segmentation benchmark. We are positive, that at the time of publishing we will have dropped in this challenge, as our approach is as previously shown limited to only 4 of the 19 classes and will not yield competitive results summed over all classes. The results in Table III are therefore limited to the four classes described in Section III. We like to note, that our training set uses a reduced class mapping (see Section IV-A) and the reported official results are evaluated on the full class set. Therefore our performance on “Truck” and “Bike” suffers.
We implemented the proposed network architecture in Python with Tensorflow, devoid of any further customisation or optimisation. Forinput instances of object proposals from the point cloud, which is representative of a residential area to inner-city scene, inference time on an Intel i7-6820HQ laptop CPU @ 2.70 GHz is ms. If execution is limited to only two threads, resembling a small embedded processor, this value rises to ms.
Timing measurements are often not published alongside high-performing publications. In exceptional cases, they are reported on powerful GPUs. To get a rough estimate about the runtime of state-of-the-art networks, we timed the inference of one point cloud fromSemanticKITTI with a network closely comparable to PointPillars . Doing so took s, which is nearly times slower than our approach, on the same CPU.
We have presented an algorithmic approach to facilitate real-time CPU-based classification of segmented object instances in Lidar data. Our approach uses a component-wise decomposed normal vector image, instance masks and selected statistics to create a meaningful data representation. The proposed CNN architecture is efficient in processing large batches of those instances, in a frequency high enough to enable real-time application while running on CPU, even in Python without specific optimisation. Through evaluation on public data we have shown, that our method achieves good performance on automotive Lidar semantic segmentation and object detection tasks, while being orders of magnitude faster to compute than current state-of-the-art approaches.
Through the combined use of instance segmentation and classification of these separated objects, we can further provide panoptic segmentation of street scenes. There is little previous work on that and to our best knowledge, ours is the first to report a method which can accomplish this task in real time on CPU. In future work, we aim to improve our detection accuracy by taking past time steps into account for refinement through causal tracking approaches.
-  (2020) Exploiting polar grid structure and object shadows for fast object detection in point clouds. In International Conference on Machine Vision, pp. 114330G. Cited by: §II.
-  (2019) Real-time convolutional neural networks foremotion and gender classification. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 221–226. Cited by: §III-C.
SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In
IEEE International Conference on Computer Vision, pp. 9297–9307. Cited by: Fig. 1, §I, §IV.
-  (2020) A benchmark for lidar-based panoptic segmentation based on KITTI. arXiv:2003.02371 [cs.CV]. External Links: Cited by: §IV-C.
-  (2018) Efficient surfel-based slam using 3d laser range data in urban environments.. In Robotics: Science and Systems, Cited by: §I, §III-A.
-  (2016) Fast range image-based segmentation of sparse 3d laser scans for online operation. In IEEE International Conference on Intelligent Robots and Systems, pp. 163–169. External Links: Cited by: §I, §III-B.
-  (2019) Object as hotspots: an anchor-free 3d object detection approach via firing of hotspots. arXiv: 1912.12791 [cs.CV]. External Links: Cited by: §II.
Multi-view 3d object detection network for autonomous driving.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 6526–6534. External Links: Cited by: §II.
-  (2017) Xception: deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1800–1807. External Links: Cited by: §III-C.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: §IV-B.
-  (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In International Conference on Knowledge Discovery and Data Mining, pp. 226–231. Cited by: §III-B.
-  (2018) Effnet: an efficient structure for convolutional neural networks. In IEEE International Conference on Image Processing, Vol. , pp. 6–10. External Links: Cited by: §III-C.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. External Links: Cited by: §I, §IV.
-  (2020) Fast lidar clustering by density and connectivity. arXiv:2003.00575 [cs.CV]. Cited by: Fig. 1, §I, §III-B.
-  (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. External Links: Cited by: §III-C.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 [cs.CV]. External Links: Cited by: §III-C.
-  (2019) Panoptic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9404–9413. Cited by: §IV-C.
-  (2019) PointPillars: fast encoders for object detection from point clouds. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §I, §II, §IV-D.
-  (2019) LO-net: deep real-time lidar odometry. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 8465–8474. External Links: Cited by: §I, §III-A.
-  (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §IV-B.
-  (2018) LMNet: real-time multiclass object detection on cpu using 3d lidar. In Asia-Pacific Conference on Intelligent Robot Systems, pp. 28–34. Cited by: §II.
-  (2011) Velodyne slam. In IEEE Intelligent Vehicles Symposium, pp. 393–398. Cited by: §I, §III-A.
-  (2019) Seamless scene segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8277–8286. Cited by: §IV-C.
-  (2018) Frustum pointnets for 3d object detection from rgb-d data. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 918–927. External Links: Cited by: §II.
-  (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 77–85. External Links: Cited by: §II.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5100–5109. External Links: Cited by: §I, §II.
-  (2019) PointRCNN: 3d object proposal generation and detection from point cloud. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 770–779. External Links: Cited by: §II.
-  (2019) PV-rcnn: point-voxel feature set abstraction for 3d object detection. arXiv: 1912.13192 [cs.CV]. External Links: Cited by: §II.
-  (2019) Roarnet: a robust 3d object detection based on region approximation refinement. In IEEE Intelligent Vehicles Symposium, pp. 2510–2515. Cited by: §II.
-  (2018) Classifying 3d objects in lidar point clouds with a back-propagation neural network. Human-centric Computing and Information Sciences 8 (1), pp. 1–12. Cited by: §III-B.
-  (2015) Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–9. External Links: Cited by: §III-C.
-  (2019) Frustum convnet: sliding frustums to aggregate local point-wise features for amodal. In IEEE International Conference on Intelligent Robots and Systems, pp. 1742–1749. Cited by: §II.
-  (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §I, §II.
3D object detection using scale invariant and feature reweighting networks.
AAAI Conference on Artificial Intelligence, pp. 9267–9274. Cited by: §II.
-  (2018) VoxelNet: end-to-end learning for point cloud based 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. External Links: Cited by: §I, §II.