Python / Tensorflow implementation of LU-Net
We propose LU-Net -- for LiDAR U-Net, a new method for the semantic segmentation of a 3D LiDAR point cloud. Instead of applying some global 3D segmentation method such as PointNet, we propose an end-to-end architecture for LiDAR point cloud semantic segmentation that efficiently solves the problem as an image processing problem. We first extract high-level 3D features for each point given its 3D neighbors. Then, these features are projected into a 2D multichannel range-image by considering the topology of the sensor. Thanks to these learned features and this projection, we can finally perform the segmentation using a simple U-Net segmentation network, which performs very well while being very efficient. In this way, we can exploit both the 3D nature of the data and the specificity of the LiDAR sensor. This approach outperforms the state-of-the-art by a large margin on the KITTI dataset, as our experiments show. Moreover, this approach operates at 24fps on a single GPU. This is above the acquisition rate of common LiDAR sensors which makes it suitable for real-time applications.READ FULL TEXT VIEW PDF
Python / Tensorflow implementation of LU-Net
Pytorch implementation of LU-Net for L-CAS 3D Point Cloud People Dataset
Point cloud semantic segmentation using Deep Learning and 2D range image representation
The recent interest for autonomous systems has motivated many computer vision works over the past years. The importance of accurate perception models is a crucial step towards system automation, especially for mobile robots and autonomous driving. Modern systems are equipped with both optical cameras and 3D sensors, mostly LiDAR sensors. These sensors are now essential components of perception systems as they enable direct space measurements, providing an accurate 3D representation of the scene. However, for most automation-related tasks, raw LiDAR point clouds require further processing in order to be used. In particular, point clouds with accurate semantic segmentation provide a higher level of representation of the scene that can be used in various applications such as obstacle avoiding, road inventory, or object manipulation.
This paper focuses on the semantic segmentation of 3D LiDAR point clouds. Given a point cloud acquired with a LiDAR sensor, we aim at estimating a label for each point that belongs to objects of interest in urban environments (such as cars, pedestrians and cyclists). The traditional pipelines used to tackle this problem consider ground removal, clustering of remaining structures, and classification based on handcrafted features extracted on each clusters[8, 6]. The segmentation can be improved with variational models . These methods are often hard to tune as handcrafted features usually require tuning many parameters, which is likely to be data dependent and therefore hard to use in a general scenario. Finally, although the use of regularization can lead to visual and qualitative improvements, it often leads to a large increase of the computational time.
Recently, deep-learning approaches have been proposed to overcome the difficulty of tuning handcrafted features. This has become possible with the arrival of large 3D annotated datasets such as the KITTI 3D object detection dataset. Many methods have been proposed to segment the point cloud by directly operating in 3D  or on a voxel-based representation of the point cloud . However, this type of methods either needs very high computational power, or are not able to process the amount of points acquired in a single rotation of a sensor. Even more recently, faster approaches have been proposed [20, 19]. They rely on a 2D representation of the point cloud, called range-image 
, which can be used as the input of a convolutional neural network. Thus, the processing time as well as the required computational power can be kept low, as these range-images consist in low resolution, multichannel images. Unfortunately, the choice of input channels, as well as the difficulty of processing geo-spatial information using only 2D convolutions have limited the results of such approaches, which have not yet achieved good enough scores for practical use, especially on small objects classes such as cyclists or pedestrians.
In this paper, we propose LU-Net—for LiDAR U-Net—an end-to-end model for the semantic segmentation of 3D LiDAR point clouds. LU-Net benefits from a high-level 3D feature extraction module that can embed 3D local features in 2D range-images, which can later be efficiently used in a U-Net segmentation network. We demonstrate that, beside being a simple and efficient method, LU-Net largely outperforms state-of-the-art range-image methods, as shown in Figure 1.
The rest of the paper is organized as follows: We first discuss previous works on point cloud semantic segmentation, including methods designed for processing LiDAR data. We then detail our approach, and evaluate it on the KITTI dataset against state-of-the-art methods and discuss the results.
In this section, we discuss previous works on image semantic segmentation as well as 3D point cloud semantic segmentation below.
Semantic segmentation of images has been the subject of many works in the past years. Recently, deep learning methods have largely outperformed previous ones. The method presented in  was the first to propose an accurate end-to-end network for semantic segmentation. This method is based on an encoder in which each scale is used to compute the final segmentation. Only a few month later, the U-Net architecture  was proposed for the semantic segmentation of medical images. This method is an encoder-decoder able to provide highly precise segmentation. These two methods have largely influenced recent works such as DeeplabV3+  that uses dilated convolutional layers and spatial pyramid pooling modules in an encoder-decoder structure to improve the quality of the prediction. Other approaches explore multi-scale architectures to produce and fuse segmentations performed at different scales [14, 22]. Most of these methods are able to produce very accurate results, on various types of images (medical, outdoor, indoor). The survey  of CNNs methods for semantic segmentation provides a deep analysis of some recent techniques. This work demonstrates that a combination of various components would most likely improve segmentation results on wider classes of objects.
As mentioned above, the first approaches for point cloud semantic segmentation were done using heavy pipelines, composed of many successive steps such as: ground removal, point cloud clustering, feature extraction as presented in [8, 6]. However, as mentioned above, these methods often require many parameters and they are therefore hard to tune. In , a deep-learning approach is used to extract features from the point cloud. Then, the segmentation is done using a variational regularization. Another approach presented in 
proposes to directly input the raw 3D LiDAR point cloud to a network composed of a succession of fully-connected layers to classify or segment the point cloud. However, due to the heavy structure of this architecture, it is only suitable for small point clouds. Moreover, processing 3D data often increases the computational time due to the dimension of the data (number of points, number of voxels), and the absence of spatial correlation. To overcome these limitations, the methods presented in and  propose to represent the point cloud as a voxel-grid which can be used as the input of a 3D CNN. These methods achieve satisfying results for 3D detection. However, semantic segmentation would require a voxel-grid of very high resolution, which would increase the computational cost as well as the memory usage.
Recently, SqueezeSeg, a novel approach for the semantic segmentation of a LiDAR point cloud represented as a spherical range-image , was proposed. This representation allows to perform the segmentation by using simple 2D convolutions, which lowers the computational cost while keeping good accuracy. The architecture is derived from the SqueezeNet image segmentation method . The intermediate layers are ”fire layers”, i.e. layers made of one squeeze module and one expansion module. Later on, the same authors improved this method in 
by adding a context aggregation module and by considering focal loss and batch normalization to improve the quality of the segmentation. A similar range-image approach was proposed in, where a Atrous Spatial Pyramid Pooling  and squeeze reweighting layer  are added. Finally, in , the authors offer to input a range-image directly to the U-Net architecture described in . This method achieves results that are comparable to the state of the art of range-image methods with a much simpler and more intuitive architecture. All these range-image methods succeed in real-time computation. However, their results often lack of accuracy which limits their usage in real scenarios.
In the next section, we propose LU-Net: an end-to-end model for the accurate semantic segmentation of point clouds represented as range-images. We will show that it outperforms all other range-image methods by a large margin on the KITTI dataset, while offering a robust methodology for bridging between 3D LiDAR point cloud processing and 2D image processing.
In this section, we present our end-to-end model for the semantic segmentation of LiDAR point clouds inspired by the U-Net architecture . An overview of the proposed method is available in Figure 2.
As mentioned above, processing raw LiDAR point clouds is computationally expensive. Indeed, these 3D point clouds are stored as unorganized lists of Cartesian coordinates. Therefore processing such data often involves preprocessing steps to bring spatial structure to the data. To that end, alternative representations, such as voxel grids or 2D pinhole projections in 2D images, are sometimes used, as discussed in the Related Work section. However, high resolution is often needed in order to represent enough details, which involves heavy memory costs. Modern LiDAR sensors often acquire 3D points, following a strict sensor topology, from which we can build a dense 2D image , the so-called range-image. The range-image offers a lightweight, structured and dense representation of the point cloud.
Whenever the raw LiDAR data (with beam number) is not available, the point cloud has to be processed to extract the corresponding range-image. As 3D LiDAR sensors acquire 3D points with a sampling pattern of a few number of scan lines and quasi uniform angular steps between samples, the acquisition follows a grid pattern that can be used to create a 2D image. Indeed, each point is defined by two angles and a depth, respectively, with steps of () between two consecutive positions. Each point of the LiDAR point cloud can be mapped to the coordinates with of a 2D range-image of resolution , where each channel represents a modality of the measured point. A range-image is presented on Figure 3.
In perfect conditions, the resulting image is completely dense, without any missing data. However, due to the nature of the acquisition, some measurements are considered invalid by the sensor and they lead to empty pixels (no-data). This happens when the laser beam is highly deviated (e.g. when going through a transparent material) or when it does not create any echo (e.g. when the beam points in the sky direction). We propose to identify such pixels using a binary mask equal to for empty pixels and to otherwise. The analysis of multi-echo LiDAR scans is subject to future work.
In ,  and , the authors use a 5-channel range-image as input of their network. These 5 channels are made of the 3D coordinates (), the reflectance () and the spherical depth (). However, the analysis presented in  showed that feeding a 2-channel range-image with only the reflectance and depth information to a U-Net architecture achieves comparable results to the state of the art.
In all these previous works, the choice of the number of channels of the range-image appears to be empirical. For each application, a complete study or a large set of experiments must be conducted to choose the best within all the possible combinations of channels. This is tedious and time consuming. To bypass such an expensive study, we propose in this paper a feature extraction module that is able to directly learn meaningful features adapted to the target application—here, semantic segmentation.
Moreover, processing geo-spatial information using 2D convolutional layers can cause issues in terms of data normalization as LiDAR sensors sampling typically decreases when acquiring farther points.
Inspired by the Local Point Embedder presented in , we propose a high-level 3D feature extraction module that is able to learn meaningful high-level 3D features for each point and to output a range-image with channels. Contrary to , our module exploits the range-image to directly estimate the neighbors of each points instead of using a pre-processing step. Moreover, our module outputs a range-image, instead of a point cloud, which can be used as input to a CNN.
Given a point , and its associated reflectance, we define the set of neighboring points of in the range-image (e.g. the points that correspond to the 8-connected neighborhood of in the range-image). This set is illustrated Figure 4. We also define the set of neighbors in coordinates relative to . Note that if either or is an empty pixel, then .
Similarly to , the set of neighbors
is first processed by a multi-layer perceptron (MLP), which consists of a succession of linear, ReLU and batch normalization layers. The resulting set is then maxpooled to a point feature set, which is concatenated withand . The resulting vector is processed through another MLP that outputs a vector of 3D features for each . This module is illustrated in Figure 5.
As linear layers can be done using convolutional layers, the whole point cloud can be processed at once. In this case, the output of the 3D feature extraction module is a matrix, which can then be reshaped to a range-image.
convolutions followed by a rectified linear unit (ReLU) and amax-pooling layer that downsamples the input by a factor 2. Each time a downsampling is done, the number of features is doubled to compensate for the loss of resolution. The second half of the network consists of upsampling blocks where the input is upsampled using up-convolutions. Then, concatenation is done between the upsampled feature map and the corresponding feature map of the first half. This allows the network to capture global details while keeping fine details. After that, two convolutions are applied followed by a ReLU. This block is repeated until the output of the network matches the dimension of the input. Finally, the last layer consists in a 1x1 convolution that outputs as many features as the wanted number of possible labels i.e. 1-hot encoded.
The loss function of our model is defined as a variation of the focal loss presented in. Indeed, our model is trained on a dataset in which the number of example for each class is largely unbalanced. Using the focal loss approach helps improving the average score by few percents, as discussed later in Section 4. First, we define the pixel-wise softmax for each label :
where is the activation for feature at the pixel position . After that, we define the groundtruth label of pixel . We then compute the weighted focal loss as follows:
where is the domain of definition of , are the valid pixels, is the focusing parameter and is a weighting function introduced to give more importance to pixels that are close to a separation between two labels, as defined in .
We train the network with the Adam stochastic gradient optimizer and a learning rate set to . We also use batch normalization with a momentum of 0.99 to ensure good convergence of the model. Finally, the batch size is set to and the training is stopped after epochs.
We trained and evaluated LU-Net using the same experimental setup as the one presented in SqueezeSeg  as they provide range-images with segmentation labels exported from the 3D object detection challenge of the KITTI dataset . They also provide the training / validation split that they used for their experiments, which contains samples for training and for validation and which can be used for a fair comparison between each result of each method.
We have manually tuned the number of layers , i.e. the number of 3D features learned for each points. On all our experiments, best semantic segmentation results were obtained by setting . This small amount of channels is enough to highlight the structure of the objects that are latter used in the U-Net in charge of the segmentation task. All results reported in this section are with this value. Nevertheless, if using the high-level 3D feature extraction module for other applications, one should consider adapting this value.
We compare the proposed method to 4 range-image based methods of the state of the art: PointSeg , SqueezeSeg , SqueezeSegV2 , and RIU-Net . RIU-Net is a previous version of LU-Net we developed and was solely based on the raw reflectance and depth features instead of the 3D features learned in the end-to-end network of LU-Net. Similarly to  and , the comparison is done based on the Intersection-over-Union score:
where and denote the predicted and groundtruth sets of points that belongs to label respectively.
The performance comparisons between LU-Net and state-of-the-art methods are displayed Table 1. The first observation is that the proposed model outperforms existing methods in terms of average IoU by over %. In particular, the proposed model achieves better results on each of the classes compared to PointSeg, SqueezeSeg and RIU-Net. Our method also largely outperforms SqueezeSegV2 for both pedestrians and cyclists.
Our method is very similar to RIU-Net as both methods use a U-Net architecture with a range-image as input. While RIU-Net uses 2 channels—the reflectance and depth—LU-Net automatically extracts a N-dimensional high-level features per point thanks to the 3D feature extraction module. Table 1 demonstrates that using an additional network to automatically learn high-level features from the 3D point cloud largely improves the results, especially on classes that are less represented in the dataset.
Figure 7 presents visual results for SqueezeSegV2 and LU-Net. We here observe that visually, the results for cars are comparable. Nevertheless, by looking closer at the results, we observe that SqueezeSegV2 is more subject to false positives (Figure 7, orange rectangle). Moreover, our method provides a better segmentation of the cars in the back of the scene, compared to SqueezeSegV2 (Figure 7, purple rectangle).
|Zooms in the following order|
|Groundtruth, SqueezeSegV2, LU-Net|
Table 2 presents intermediate scores in order to highlight the contribution of some model components.
First, we analyse the influence of relative coordinates as input to the 3D feature extraction module (Figure 5). We trained and tested the model using absolute coordinates . We name this version LU-Net w/o relative. As Table 2 shows, relative candidates provide better results than neighbors in absolute coordinates. We believe that by reading relative coordinates as input, the network learns high-level features characterizing the local 3D geometry of the point cloud, independently of its absolute position in the 3D environment. These absolute positions are re-introduced once this geometry is learned, i.e. before the second multi-layer perceptron of the 3D feature extraction module.
|LU-Net w/o relative||62.8||39.6||37.5||46.6|
|LU-Net w/o FL||73.8||42.7||32.9||49.8|
For fair comparison, we also experimented using absolute coordinates and adding a supplementary convolutional layer as the first layer. Indeed, we could expect this additional layer to characterize the transformation from absolute and local coordinates. Nevertheless, this architecture brought numerical instability while not managing to learn such transform, as it ended up with an average IoU of %.
Next, we analyse the influence of the focal-loss. As seen in Table 2, the use of focal-loss largely improves the scores on both cyclists and pedestrians. This is related to the imbalance between each class in the dataset, where there are times more car examples than cyclists or pedestrians.
Apart from being convincing in terms of IoUs, the results produced by our method are also very convincing visually, as it is demonstrated Figure 1 and 9. Our segmentations are very close to those of the groundtruth. In Figure 9
d), one of the pedestrians was not detected. When looking closely at the depth values in the range-image, this pedestrian is in fact hardly visible. It is also the case in the reflectance image. This is also related to the resolution of the sensor as only few points fall on the pedestrian, and could probably be solved by adding an external modality such as an optical image.
In Figure 9e), a car in the foreground is missing from the groundtruth, this causes the IoU to drop from % when ignoring this region of the image, down to %. Thus, removing examples with wrong or missing annotations in the dataset could lead to better results on LU-Net as well as on other methods. However, due to the amount of examples in the dataset, having a perfect annotation is practically very difficult.
Finally, LU-Net is able to operate at 24 frames per second on a single GPU. This is a lower frequency compared to other systems, yet still above the frame rate of the LiDAR sensor (10fps for the Velodyne HDL-64e). Moreover, our system uses only a few more parameters than RIU-Net for a significant improvement in terms of IoU scores.
In this paper, we have presented LU-Net, an end-to-end model for the semantic segmentation of 3D LiDAR point clouds. Our method efficiently creates a multi-channel range-image using a learned 3D feature module. This range-image later serves as the input of a U-Net architecture. We show that this methodology efficiently bridges between 3D point cloud processing and image processing. The resulting method is simple, but yet provides very high quality results far beyond existing state-of-the-art methods.
The current method relies on the focal loss function. We plan to study possible spatial regularization schemes within this loss function. Finally, fusion of LiDAR and optical data would probably enable reaching a higher level of accuracy.
The authors thank GEOSAT for funding part of this work. This project has also received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 777826.
Conference on Computer Vision and Pattern Recognition, pages 663–672, 2018.
Fast Plane Extraction in Organized Point Clouds Using Agglomerative Hierarchical Clustering.In International Conference on Robotics and Automation, pages 6218–6225, 2014.