The semantic scene understanding from 3D LiDAR point clouds is one of the fundamental blocks to provide robust and real time 3D object detectors for the autonomous driving. LiDAR sensors provide range measurements by sampling a specific location with spanning vertical and horizontal angular resolutions, which is different from the 3D point clouds sampled densely and uniformly on all sides used in indoor scenes. In addition, LiDAR sensors are robust under almost all light conditions or the foggy weather. As a result, 3D LiDAR point clouds attract the significant research attention recently.
The major difficulty in processing LiDAR data is that the sensors provide non-Euclidean data in the form of point clouds with about 100k points around 360°, which poses a great challenge for object detection and segmentation tasks and thus needs the high computational cost. For the 3D object detection and the 3D semantic segmentation, the segmentation task gives the dense predictions of the scene understanding. The previous work such as SqueezeSeg 
proposed a light weight convolutional neural network (CNN) backbone with the conditional random field (CRF) to get the real time performance, but it still leaves room for accuracy improvements. Higher accuracy demands higher computational cost, which hurts real time performance.
To achieve high accuracy within a real time constraint, this paper proposes a instance-level segmentation, denoted as RangeSeg. RangeSeg exploits the uneven range distribution of 3D LiDAR point clouds as shown in Fig. 1 for the autonomous driving. It has far and small objects at the top of the range image, and large and near objects at the bottom of the range image. This algorithm achieves high accuracy for these far and small objects by adopting a heavy decoder only on the top of the image, and meets the real time demand by adopting a light decoder on the whole image with a shared backbone encoder. The semantic segmentation results are further clustered as instances by the density-based spatial clustering and applications with noise (DBSCAN)  method based on a resolution weighted distance. The result shows that the proposed method can improve the detection on far and small objects with the real time execution performance on NVIDIA® JETSON AGX Xavier.
The rest of the paper is organized as follows. Section II first introduces the related works. Section III presents our approach. Section IV shows the experimental results and comparisons with other methods. Finally, we conclude this paper in Section V.
Ii Related Works
Ii-a Data representation of 3D LiDAR point clouds
3D LiDAR point clouds have an unstructured data format. To tackle such unstructured data for 3D outdoor scene understandings, one approach is to transform point clouds into a structured format to utilize standard convolutional operations. The other approach is to define a new operation directly on unstructured data. Current data transformations are mainly divided into two types: 3D voxel grids or 2D projections. The 3D voxel grid method transforms the data into a regular space of 3D grid such that the following 3D convolution operations can be applied to extract the high order of feature representations. However, the 3D voxel grids are sparse since the point clouds are inherently sparse, which wastes lots of computations on unnecessary grids. The 2D projections such as the birds’ eye view (BEV) and the range image encoding are much more compact. BEV is sparse while preserves the size of objects. The range image encoding is dense but distorts objects. The novel graph-based neural networks can directly apply on point clouds, but the recent approaches only apply on the 3D indoor scene understanding, in which the point clouds are uniform sampled on surface with about 1k points.
Ii-B 3D Object Detection of Point Clouds
VoxelNet  encodes point clouds into hand-crafted 3D voxel grids by a voxel feature encoding layer to extract high order of features. They use 3D CNN layers to aggregate the voxel-wise features with expanded receptive fields. However, the 3D CNN has high computational cost even for such sparse representation. Its real time performance is limited to four frames per second (fps). PIXOR  uses a 2D birds’ eye view representation to build a real time pipeline, but fails to deliver a good performance on small objects. PointRCNN  is the first two-stage 3D detector that only uses 3D LiDAR point clouds, which uses PointNet++  on the unstructured data to get the preliminary bounding boxes and applies a simple multi-layer neural network to get a final prediction. This approach gets a good performance on small objects such as pedestrians and cyclists, but the two-stage pipeline makes it unsuitable for real time applications.
Ii-C 3D Semantic Segmentation of Point Clouds
SqueezeSeg , PointSeg  and SqueezeSegV2  all use a light weight SqueezeNet  as their backbone with range images as input for the semantic segmentation task. They use different post-processing methods to improve the accuracy. SqueezeSeg and SqueezeSegV2 use a recurrent CRF module  to reduce the blurry boundaries. SqueezeSegV2 additional tackles the inherent problems of missing points in range images with a context aggregation module. PointSeg uses a squeeze re-weighting layer  and an enlargement layer  to achieve a better performance. RIU-Net  directly uses U-Net  on range images with focal loss . These works get real time performance due to a light weight backbone but has a low accuracy. Moreover, none has discussed the instance-level segmentation.
Ii-D Graph Neural Networks of Point Clouds
Pointnet  proposed to use an end-to-end pipeline to learn point-wise features directly from point clouds. The follow-up work improves the performance by extracting local features . Furthermore, DGCNN  and PointCNN  define a new convolution operation on point clouds. They succeed in the 3D indoor scene understanding (1k points). However, the outdoor scene contains about 100k points, which will make the above network training demand high requirements of the memory and the computation.
Ii-E 3D Instance Segmentation of Point Clouds
A novel paper  proposed a pipeline with a feature learning network and a stacked hourglass network for the instance segmentation in the outdoor LiDAR point clouds, which could help localize the small and far-away objects. However, the high complexity of the model makes it hard to run in real time, where the elapsed time in TESLA V100 GPU was 300 ms.
Iii RangeSeg Framework
In this paper, we propose a real time pipeline, RangeSeg, that gets accurate instance-level segmentation results by exploiting 2D range image representations of LiDAR point clouds. Our range-aware framework attains a fast and accurate semantic segmentation by a complex heavy decoder to predict far and small objects and a light decoder to reduce complexity of general predictions. Next, we use DBSCAN with the proposed weighted distance function to get instance-level segmentation results. An overview of the whole pipeline is shown in Fig. 2. In the following subsections, we will introduce our input representation, range-aware network architecture and how to use DBSCAN as a post-processing to get instance-level segmentation results.
Iii-a Input Representation
3D point clouds are unstructured data while the standard neural networks perform discrete convolution operations on grids. Thus, several methods have been proposed to encode point clouds into a suitable format. In which, the 3D voxel grids are one possible solution. However, 3D convolution operations are computational intensive, and the sparse voxel grids will lead to lots of unnecessary computations. The 2D birds’ eye view is another solution, but leads to information loss. Instead, we use range images to represent the point clouds. The range image is a 2D dense image-like data format without the information loss, and does not need hand-crafted parameters during the format conversion.
The point clouds are converted to range images as following. First, project the points onto a spherical coordinate system with the grid-based representation as
where and are an azimuth angle and an elevation angle respectively. and are resolutions for the discretization and denotes the position of 2D spherical grids. Applying (1) on each point, we can get a 3D tensor. In this paper, we use the KITTI dataset [6, 5] collected by Velodyne HDL-64E S3, which has 64 laser beams, (). Also, the horizontal angular resolution is 0.1728° and the annotations are only available in the 90° front view, (). K is the number of features, which is 3 in this paper, (), including intensity, range measurements and occupancy. The occupancy channel indicates whether the grids contain points. The visualization of a range image representation is shown in Fig. 2.
Iii-B Range Distribution on Range Images
In order to get 360° view of scenes, the LiDAR sensors are mostly placed on the top of vehicles. As a consequence, the vertical view is asymmetric, where only few laser beams are emitted to the upper part of a scene. The lower position lasers can only sample the objects in a much shorter range. Fig. 1 shows the laser ID that can detect ”car” at different ranges. As shown in the figure, only the top part of lasers can detect objects in the whole range, especially if they are far away. Besides, the far objects only occupy few pixels on range images due to the lower density of points associated to objects at larger distances. Based on the observation, we propose the range-aware framework that uses different decoders for different parts of the range image to get higher accuracy and speedup as well.
Iii-C Network Architecture
RangeSeg as shown in Fig. 3 is a fully-convolutional network with one backbone encoder, two decoders, a fusion layer and a simple post-processing for final instance-level segmentation prediction. The outputs of network are the same size as inputs to get higher accuracy segmentation results.
For the encoder backbone, this paper uses two different backbones for testing. The first one is the modified version of ResNet-18 . The original ResNet-18 performs two down-sampling steps at the beginning of the network, which are removed in this paper to enable computations on original resolution feature maps for better accuracy, as shown in Fig. 4 (a). The other one is based on LaserNet  that performs 3D object detection on range images. In which, the residual blocks are also used to get better performances with deeper layers and fewer channels. Although the LaserNet backbone is much deeper, its parameters of kernels are fewer due to the fewer channels as shown in Fig. 4 (b).
For the target KITTI dataset, the input representation is
. Thus, we only perform the stride and downsample operations on the vertical dimension to minimize information loss.
The range-aware decoders use the same feature maps from the backbone network for the heavy and light decoders to exploit the different range distribution of range images. The heavy decoder only predicts the results for top of images, where the small and far objects locates and needs the deeper network aggregation for accurate predictions. The heavy decoder uses the ‘deep’ skip connections inspired by DLA , where the high level feature maps will be upsampled and aggregated with lower level ones repeatedly. In this paper, instead of using the tree-structured DLA, a much dense concatenation is used as shown in Fig. 5 (b) since this decoder only processes the top rows of the range maps. The light decoder predicts the results of whole images with low computational cost. The light decoder uses the ‘shallow’ skip connections like U-Net  that only concatenates the feature maps once as shown in Fig. 5 (a). It has low computational cost while preserves the information of different range features.
The predictions from the light and heavy decoders are fused together for final results. However, these two predictions have different spatial sizes, which prohibits a direct fusion. For a smooth fusion, the whole output of the heavy decoder is concatenated with the top same size output from the light decoder. Then the concatenated result changes its channel numbers by 1x1 convolutions to match the channel numbers of the bottom part from the light decoder, as shown in Fig. 3. Finally, these two parts are concatenated along the height dimension for the final results. In order to get the predictions on the same resolution, a transpose convolution layer is applied after the fusion layer. In our experiments, we apply the heavy decoder on the top 16 rows of range images based on the observations of the KITTI dataset.
Iii-D Training Strategy
RangeSeg uses the common multi-class cross entropy loss and Lovász-softmax loss  to train the network. Cross-entropy loss is used for the pixel-wise classification loss on the classification output with the target . If denotes the channels of the input images, cross-entropy loss is defined as
However, cross-entropy loss is not directly related to intersection over union (IoU). Therefore, Lovász-softmax loss, which is a Lovász extension of the Jaccard index, is applied to regularize the network. Ifdenotes the class, Lovász-softmax loss is defined as
where is the surrogate function of Jaccard loss (
). Lovász-softmax loss is IoU-aware and helps solve the data imbalance. Therefore, the loss function,, is defined as
with the Lovász weighted term . Therefore, the loss of the prediction part from the fusion layer as in Fig. 3 is defined as
The range-aware decoder loss combines the results from heavy and light decoders directly as a regularization term instead of computing loss on results of the fusion layer alone, which is defined as
The total network loss is defined as
with a range-aware weighted term . In our experiments, we set both and as 1.
We use the super convergence strategy as our learning rate scheduler, which has higher and dynamic learning rate for fast convergence.
Iii-E Resolution Weighted Distance Function for Instance Segmentation
|Car||Pedestrian||Cyclist||Overall||Top 16 Rows||Lower 48 Rows|
To the best of authors’ knowledge, none works have used 3D information to segment instances. Unlike objects in the 2D RGB images, the 3D objects will not be overlapped in the 3D space domain, which will make it much easier to segment different objects. Therefore, this paper uses the density-based spatial clustering applications with noise (DBSCAN)  for instance clustering, which does not require a pre-defined number of clustering. The clusters are defined by their density. In this paper, DBSCAN takes the points labeled as objects after semantic segmentation as input. Then, the clustering process is applied once for all the objects to save computations for background points and multiple iterations. For the DBSCAN distance function, instead of directly using vanilla distance function, we propose a weighted distance function to deal with the resolution differences of the LiDAR data. The resolution of the vertical dimension is twice fewer than the horizontal one, which leads to sparser vertical values compared with the horizontal ones. Therefore, a resolution weighted distance function is defined as
This distance function scales up the coordinates X and Y as well instead of scaling down the coordinate Z alone since scaling down the coordinate z alone cannot help segmentation if the distance is dominated by the horizontal one.
Iv Experimental Results
We evaluate our model on the KITTI dataset and empirically showcase the strengths and weaknesses of the proposed approach. First, we compare the vanilla segmentation frameworks with our range-aware framework using different backbones on the KITTI 3D object detection benchmark . We show that our RangeSeg outperforms on accuracy and inference speed. Second, the comparison of different distance functions on DBSCAN shows that the proposed distance function helps improve the accuracy. Third, we compare RangeSeg with related works on the KITTI raw data, where the accuracy gets lots of improvement with the same inference speed. Fourth, we implement RangeSeg on NVIIDA® JETSON AGX Xavier, which gets real time performance even on such small embedded system. Finally, the experiment results on synthetic foggy KITTI data 
show that our approach is still robust in the foggy weather. The evaluation metric used in this paper is mean class IoU (mIoU).
Iv-a KITTI 3D Object Detection Benchmark
We encode the front 90° into a
tensor with 3 features: intensity, range and occupancy. The value range of intensity is already within [0,1], and occupancy is either 0 or 1. Therefore, we normalize the range features to be within [0, 1]. In addition, the random horizontal flip with probability 0.5 is applied as the augmentation. For ground truth, we treat the points in the bounding boxes as the objects while the remains are background due to the limitations of the KITTI annotations.
Table I summarizes the comparisons between the vanilla segmentation frameworks and range-aware framework. For ResNet based backbone, the range-aware ResNet18 leads the ResNet18-UNet by 3.9%. The improvements are more significant on small objects such as pedestrians and cyclists by 3.2% and 6.5%, respectively. The range-aware ResNet18 even outperforms ResNet34-UNet by 0.3%. For LaserNet based backbone, the range-aware LaserNet leads LaserNet-DLA by 1.9%. Moreover, the results on top 16 rows of range images show that RangeSeg helps improve detection of the far and small objects since the far objects only lie in the top of images.
Table I shows the inference speed on the Nvidia TITAN Xp. The range-aware ResNet18 improves 3.9% than ResNet18-UNet with only 27% fps loss, while improves 0.3% than ResNet34-UNet with extra 7% speedup. Range-aware LaserNet improves 1.9% of mIoU than LaserNet-DLA with extra 42% speedup since the heavy decoder helps improvements on mIoU with computation overhead, but the low complexity light decoder helps overcome the problems.
Table II shows the ablation study on several design parameters. For loss function, the combination of cross-entropy loss (xent) and Lovász-softmax loss gets the best results because the Lovász-softmax loss is unstable alone even with its direct optimization on mIoU. In addition, both the data normalization and augmentation help improve the accuracy by 3.4%. Further regularized with the range-aware loss helps optimize the networks to higher accuracy by 3.9%.
Impact of and
This subsection shows the ablation study of and . For , we use the range-aware LaserNet with the same training strategy to see how improves the accuracy. We train the model with different values of with as set to . We can find that the accuracy of the large objects is not affected by the range loss as shown in Table III. However, the range loss can definitely improve the accuracy of the small objects compared with baseline. The accuracy of the cyclist can be even improved by by setting to when compared with no range loss. Similarly, Table IV shows the tuning result of along with set to . The accuracy of the cyclist has been improved by by setting to when compared with baseline in Table IV.
Iv-B Instance-Level Segmentation
For DBSCAN parameters, we choose value as twice the minimum number of points. Also, we choose value as the half of the average object size. After analyzing the data, we can see that the minimum number of points is about 3.5 in Fig. 1. Thus,we choose 7 for and 0.7 for . We have tested different distance functions: the original Euclidean distance function, (A), coordinate Z only scaling, (B), and the proposed function as (9), (C).
shows the instance-level segmentation evaluation results. The Z-only scaling distance function (B) is insufficient to segment instance objects, whose performance is almost the same as the original function (A). In contrast, our proposed weighted distance function (C) gets 2% of mIoU improvement. This post-processing step consumes only 16.48ms with standard deviation 24.13ms on i9-7900X. The visualization results of RangeSeg is shown in Fig.6. The proposed method can accurately predict the small and far objects when compared with the previous RIU-Net.
Iv-C Result Comparison on the KITTI Raw Data
Table VI shows the result comparison between RangeSeg and other related works on the KITTI raw data. We follow  to split the KITTI raw data . We choose the as 0.05 and as 0.5 which is the best setting in the KITTI raw data. RangeSeg outperforms other state-of-the-art methods. For more detailed comparisons, RangeSeg gets significant improvements on small objects. Also, RangeSeg gets 18.6% improvements of mIoU compared to PointSeg with almost the same inference speed.
Iv-D Real Time Implementation on Nvidia AGX Xavier
|Process||Time Avg.(ms)||Time Std.(ms)|
The range-aware LaserNet is implemented on NVIDIA® JETSON AGX Xavier for embedded system applications. In our experiment, we use TensorRT FP16 to optimize our framework. The processing time of data encoding, models and post-processing is summarized in Table VII. It shows that the fps is about 19Hz with TensorRT FP16 optimization, which is much higher than the 10Hz capture frequency of LiDAR sensors in the KITTI dataset.
Iv-E Synthetic Foggy KITTI Dataset
Foggy Dataset Details
An autonomous driving application should be robust in every environment. In this experiment, we use our range-aware LaserNet on the synthetic foggy KITTI dataset  with different visibilities. The visibility is defined as the maximum range that the objects can be visually seen by human. However, the range of objects can be detected by LiDAR is half of visibility since the point clouds are detected by reflection pulses. The experiments use visibility at 70m and 40m, which is much more extreme than the worst visibility at 150m in SFSU . The foggy weather will result in false alarm measurements for range within 2 meters mostly, and low intensity reflections for all points. Fig 7 shows the birds’ eye view of the LiDAR point clouds in different weathers. The figure shows that there is a blind zone due to car roof mounted LiDAR. The points with wrong range measurements lead to an inner circle as in Fig (b)b. Therefore, a simple defog method is applied that removes the points shorter than the 2 meters range.
Table VIII shows the evaluation results. When directly testing the model on the foggy data, the accuracy is significantly degraded to 32.0% of mIoU while the simple defogging (A) only get 8.1% improvements. The reason is that the distribution of the intensity channel in the foggy weather is different from that of the clear weather. Therefore, we train a new model that only contains 2 channels without intensity (denoted as 2 channels). After defogging on the 2 channels model (B), the accuracy is 52.9%. Combining (A) and (B) gets a robust accuracy at 54.1% of mIoU.
Next, we take foggy data as data augmentation. The 3 channels model gets higher mIoU with augmentation. The mIoU is more than 60% when augmented or tested at 70m or 40m, respectively. Also, the model trained on visibility at 70m gets 57.2% of mIoU on visibility at 40m while the model trained on visibility at 40m gets 59.8% of mIoU on visibility at 70m. This indicates the LiDAR sensors are robust even in different weather conditions.
By exploiting the LiDAR data distribution in the autonomous driving application, this paper proposes a range aware instance segmentation network that can achieve high accuracy with a heavy decoder and high speed with a light decoder. The heavy decoder is applied to the top of the range image where the far and small objects lie in for accurate detection. The light decoder is applied to the whole range image for low complexity computation. The proposed weighted distance metric helps segment instances with a simple post-processing. Compared with previous works, our range-aware framework is simple, efficient, fast and has great applications on autonomous driving in different weathers. While we have only conducted the experiments on two backbones, further experiments on state-of-the-art models are a potential area for improvements. Applying this range aware framework to other LiDAR tasks is another interesting future research.
-  (2018) The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In , pp. 4413–4421. Cited by: §III-D.
-  (2019) RIU-net: embarrassingly simple semantic segmentation of 3d lidar point cloud. arXiv preprint arXiv:1905.08748. Cited by: §II-C, TABLE VI.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §I, §II-C.
-  (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §I, §III-E.
-  (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §III-A, §IV-C.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-A, §IV.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-C.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §II-C.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §II-C.
-  (2018) Pointcnn: convolution on x-transformed points. In Advances in Neural Information Processing Systems, pp. 820–830. Cited by: §II-D.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §II-C.
-  (2019) LaserNet: an efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12677–12686. Cited by: §III-C.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §II-D.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §II-B, §II-D.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §II-C, §III-C.
-  (2018-09) Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision 126 (9), pp. 973–992. External Links: Cited by: §IV-E.
-  (2020) Mitigating effects of uniform fog on spad lidars. IEEE Sensors Letters 4 (9), pp. 1–4. Cited by: §IV-E, §IV.
-  (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §II-B.
-  (2019) Super-convergence: very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Vol. 11006, pp. 1100612. Cited by: §III-D.
-  (2018) Pointseg: real-time semantic segmentation based on 3d lidar point cloud. arXiv preprint arXiv:1807.06288. Cited by: §II-C, TABLE VI.
-  (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §II-D.
-  (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1887–1893. Cited by: §I, §II-C, §IV-C, TABLE VI.
-  (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In 2019 International Conference on Robotics and Automation (ICRA), pp. 4376–4382. Cited by: §II-C, TABLE VI.
-  (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §II-B.
-  (2018) Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412. Cited by: §III-C.
-  (2020) Instance segmentation of lidar point clouds. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9448–9455. Cited by: §II-E.
-  (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §II-B.