Being able to segment objects from point clouds is crucial for driver assistant systems, autonomous cars and other robotic perception tasks. Autonomous driving requires multiple sensors to capture all relevant information of the environment. Different types of sensors compensate the individual disadvantages and ensure robust perception in challenging environments. However, fusing and leveraging all this multi-modal data is a non-trivial task.
The task of 3D perception for autonomous vehicles is usually tackled with a combination of RGB cameras and LiDAR sensors (i.e. laser range scanners). Recently, numerous architectures with diverse and often complex designs for sensor fusion have been published. However, due to the complexity of this task many methods either use only single-modal input, e.g. [lang2018pointpillars, wu2018squeezeseg, wu2018squeezesegv2] or use the benefits of multi-modalities only after single-modal proposal generation, e.g. [chen2017multi, qi2018frustum, shi2018pointrcnn]. Thus, not all available information is leveraged jointly. Objects poorly visible in one single sensor are prone to be missed.
To address this problem, we propose a simple and effective fusion method utilizing a dense native representation of laser range scanner data, such that all available information can be processed jointly by common cnn architectures. The key idea is to warp expressive RGB features into this LiDAR representation, leveraging correspondences which can be established without any exhaustive search. In this work we focus on the task of point cloud segmentation to show the effectiveness and benefits of our fusion method.
In particular, we extend SqueezeSeg [wu2018squeezeseg] with an additional branch based on MobileNetV2 [sandler2018mobilenetv2]
to leverage RGB information as well. However, naïvely warping the RGB image into range space and applying an ImageNet cnn for early fusion,e.g. [gupta2014learning] or intermediate fusion, e.g. [hazirbas2016fusenet]
, hampers the transfer learning benefits of CNNs, as the input image is visually distorted.
To overcome this issue, we propose to apply the ImageNet cnn on the original undistorted RGB image to better leverage the benefits of cnn. Next, we warp the cnn features into the range space to get a dense and powerful representation. Thereby, we leverage the RGB/LiDAR calibration to establish control points
for a polyharmonic spline interpolation[fasshauer2007meshfree]. We improve SqueezeSeg’s segmentation results by a large margin without the use of any synthetic data (in contrast to [wu2018squeezeseg, wu2018squeezesegv2]).
We still perform at 50 fps on a NVIDIA GTX 1080Ti GPU, more than twice as fast as common LiDAR sensors dedicated for autonomous cars (typically operating at 20 Hz) and five times as fast as the LiDAR sensor used during the recording of the KITTI benchmark suite (10 Hz). Furthermore, we show that our approach performs better than state-of-the-art RGB semantic segmentation approaches.
2 Related Work
To better set our work in context, we will first consider recent approaches for 3D point clouds processing (Section 2.1) and then methods optimized for pseudo-3D/2.5D representations (Section 2.2). Finally, we will discuss works most related to ours, in particular about the fusion of depth and RGB information (Section 2.3).
2.1 3D Point Cloud Processing
Standard cnn require dense input representations on uniform grids. Thus, vanilla cnn can not be used directly on point clouds as they are sparse in 3D space. To overcome this issue, various approaches have been proposed recently. They have been applied to various tasks, e.g. classification, 3D object detection and (part-)segmentation. These approaches can be divided into two groups, i.e. direct and grid-/graph-based methods.
are deep architectures which are applied to the point cloud directly. One of the pioneering works in this group is PointNet by Qi [qi2017pointnet]
. They learn multi-layer perceptrons and linear transformations to map each point individually to an expressive feature space. Subsequently, a max pooling operation generates an order-independent global feature vector, which is utilized for classification and segmentation.
PointNet lacks the ability to encode local structures with varying density. The subsequent extension PointNet++ [qi2017pointnet++] tackles this problem by introducing a hierarchical processing strategy. Multiple works [xu2018spidercnn, li2018pointcnn, wang2018deep] introduce a generalization of the classical convolution to irregular point sets. Same as PointNet++, they use a k-nearest neighbor search to overcome the lack of a strictly defined neighborhood.
These methods are able to process a small and fixed amount of points (up to a few thousand). To deal with larger point clouds, various strategies like tiling or fps must be applied to reduce the amount of processed points. Due to the varying sparsity of LiDAR point clouds, these strategies are usually not very useful when directly applied to single sweeps, as often several samples at nearby salient regions are needed, e.g. to recover an object’s outline, instead of few wide-spread samples. For example, the native choice of fps are far distant points, which is, given a LiDAR point cloud, not valuable for any downstream task.
apply established cnn, transforming the point cloud into grid-based [riegler2017octnet, maturana2015voxnet, su2018splatnet] or graph-like [dgcnn, simonovsky2017dynamic] representations. The varying sparsity is the major issue here. Most of the covered space is empty and this would lead to a huge overhead by naïvely convolving over a regular 3D grid. To enable efficient convolutions, data structures like octrees [riegler2017octnet], voxels [maturana2015voxnet] or high-dimensional lattices [su2018splatnet] are utilized. These works use sophisticated strategies to avoid redundant computations. However, the required data preprocessing can be time consuming and computationally expensive, especially for larger point clouds.
To represent and process large scale point clouds Landrieu and Simonovsky [landrieu2018large] introduce spg. They transfer the idea of superpixels [achanta2012slic] to point clouds and propose a geometric pre-partitioning of the data into simple primitives. The resulting superpoints are modeled together with derived features within the spg and processed with [simonovsky2017dynamic].
All considered approaches so far are designed to process sceneries, where objects are fully described in 3D space (i.e. both the front and back of an object are reconstructed by the point cloud). However, a single LiDAR sweep just measures depth originating from the sensor center. Thus, it generates a 2.5D representation, where only the surface parts of an object facing the LiDAR are visible. While the point cloud is sparse, in 3D and when projected onto the RGB image plane, a dense representation can be obtained by considering the native properties of the sensor (see Section 3.1 for details).
As common LiDARs have a nearly constant horizontal angle resolution, dense representations can be obtained via cylindrical projection [li2016vehicle, chen2017multi, minemura2018lmnet] or spherical projection [wu2018squeezeseg, wang2018pointseg, wu2018squeezesegv2]. However, in practice the vertical resolution is not constant.
For example, the Velodyne HDL-64E laser scanner (used by the KITTI benchmark) sweeps 64 beams with approximately two different angular distances. The top set of 32 beams has a higher angular distance between subsequent beams than the bottom set. Other LiDARs (e.g. Velodyne VLP-32C) sample denser near the horizon to improve long-range detections.
Our work is based on SqueezeSeg [wu2018squeezeseg] by Wu , an adaptation of SqueezeNet [iandola2016squeezenet] for LiDAR point cloud segmentation. It uses a spherical projection to obtain a dense representation of the LiDAR point cloud and encodes 3D coordinates, range and reflectance intensity into the channels of the input image. In [wu2018squeezeseg], they synthesize large amounts of point cloud data utilizing Grand Theft Auto V (GTA-V), a famous video game, to increase its performance on KITTI’s car class. This synthetic data, however, does not sufficiently represent the other classes realistically, because the underlying geometry has been excessively simplified within the game. For example, the torso, head and limbs of pedestrians within GTA-V are crudely modeled as cylinders. In our work we do not rely on massively generated synthetic data and still achieve state-of-the-art results in real time.
2.3 RGB/3D Fusion
When depth information is densely available and properly registered with RGB imagery, it is an obvious choice to improve results on different vision tasks. Gupta [gupta2014learning] propose three handcrafted auxiliary channels derived from depth to improve segmentation compared to a single depth channel. Hazirbas [hazirbas2016fusenet] use a separate network branch for depth to improve results compared to an equivalent single branch architectures with additional input channels. Recently, Zeng [zeng2019deep]
use two network branches to estimate surface normals. Similar to these approaches we fuse the respective features at multiple layers as well. However, since depth is not densely available given a LiDAR point cloud, element-wise operations like summation are not sufficient. We introduce a progressive fusion scheme, based on polyharmonic spline interpolation[fasshauer2007meshfree] to overcome this issue efficiently.
Recently, various works utilize both RGB and LiDAR data, mostly for the task of 3D object detection. For example, the Multi-View 3D network (MV3D) [chen2017multi]
by Chen maps the LiDAR point cloud to a bev to generate object proposals. Given these proposals, features from the bev, a cylindrical LiDAR projection and an RGB image branch are fused to classify an object and regress its bounding box. In Frustum PointNets[qi2018frustum], Qi use Faster R-CNN [ren2015faster] to create 2D proposals from RGB imagery. The result is propagated to 3D space and refined. Except for the object class, there is no further information exchange between the RGB and the 3D detection head. Both works rely on proposals from a single data modality and thus, are prone to loose objects, because they are not using all available information from the beginning on. Ku [ku2018joint] propose Aggregate View Object Detection (AVOD), a network based on RGB and bev features. However, they evaluate a predefined set of 3D anchor boxes and thus, are limited by their predefined choice.
Liang [liang2018deep] propose a feature warping from an RGB cnn branch to a LiDAR bev. To this end, they need to perform a k-nearest neighbor search in the point cloud for each pixel in the bev image. However, with the distance to the sensor the point cloud becomes increasingly sparse. In [liang2019multi] they mitigate this issue utilizing an auxiliary depth completion task.
However, in contrast to these works, we use two native and dense representations which can be processed by standard cnn without any further preprocessing. Thereby, we are able to densely warp and fuse the features and leverage all information jointly as early as possible.
In this section we describe the proposed feature warping module and how we extend SqueezeSeg in order to utilize RGB information. In particular, rather than warping the RGB image into the range space, we apply an ImageNet cnn directly on the undistorted input images. Consequently, we can leverage the benefits of transfer learning better, as objects are not distorted in the original RGB input. We then fuse RGB features extracted at multiple layers of the ImageNet cnn (MobileNetV2) into the segmentation architecture.
In order to align the RGB features with the range features for segmentation, we warp them by leveraging the correspondences available due to the calibrated setup. Subsequently, the warped RGB features are concatenated with features from the range image to perform segmentation.
Figure 1 schematically illustrates our network architecture and the feature warping. For efficiency, we subsample point correspondences (control points) within the different input images. In the following, we discuss the discretization of the LiDAR point cloud (Section 3.1), the foundation of our architecture SqueezeSeg (Section 3.2) and the warping procedure (Section 3.3) in more detail.
3.1 LiDAR Geometry
A common LiDAR sensor dedicated for autonomous driving purposes sends out multiple vertically distributed beams and determines the distance to the first hit object by measuring the time-of-flight until the reflection is detected. A recording is usually obtained by a steady rotation of the laser transmitter itself or a respective deviation e.g., via mirrors.
SqueezeSeg processes the resulting point cloud on a spherical grid by discretizing the azimuth and zenith of each 3D point by
where and denote the discretization resolution and the coordinates on the spherical grid, respectively. The resulting spherical image constitutes a dense representation, which can be processed by a cnn. It incorporates five channels, the Cartesian point coordinates , range and the LiDAR’s reflectance intensity measurement. Unless stated otherwise, we adopt this channel configuration.
However, in practice the vertical resolution , which is the angle between subsequent LiDAR beams is not constant. Thus, we adapt the representation from [meyer2019lasernet] and utilize the beam id to assign each point to its row in the image. The beam id can be easily retrieved from the LiDAR sensor. This allows for an unambiguous vertical discretization to obtain a dense native range representation, which we use as the laser range image. This range representation is even easier to obtain than the spherical one (i.e. no need for zenith projection) and reduces holes and coordinate conflicts in the data. If (due to the horizontal discretization) multiple 3D points fall onto to the same pixel in the range image, we choose the one with azimuth position nearest to the respective pixel center.
We base our architecture on SqueezeSeg [wu2018squeezeseg]. It is a lightweight architecture based on SqueezeNet [iandola2016squeezenet], specifically designed to segment spherical images. It adapts the FireModule layers from [iandola2016squeezenet] and introduces related FireDeconvs instead of using convolutions and transposed-convolutions in order to reduce computational effort.
Similar to [chen2017deeplab], SqueezeSeg uses a crf in order to refine the segmentation results especially at the object borders. The crf penalizes assigning different labels to similar points in terms of angular and Cartesian coordinates. In other words, points with nearby coordinates in the range image as well as in 3D space are dedicated to get the same label.
Finally, it minimizes a pixel-wise cross-entropy loss. To mitigate the impact of the class imbalance, cyclists and pedestrians are stronger weighted. Furthermore, outliers, due to failed laser measurements, are masked out during loss computation.
3.3 Multi-modal Feature Fusion
In order to merge RGB features from a CNN layer with those from the laser range image, we propose to use the known calibration of LiDAR and RGB camera. We illustrate this process in Figure 2. For each valid pixel in the range image, the corresponding 3D position of the laser point is available. Given the projection matrix , we can project the 3D coordinates onto the image via
where and denote homogeneous 3D and pixel coordinates [hartley2003multiple], respectively. The projection matrix itself can be easily derived from the RGB camera calibration and the transformation form LiDAR to camera coordinate system.
Points visible in both, the RGB and range image denote correspondences between the two representations. A naïve approach would be to use these correspondences to look up every 3D point’s color within the RGB image and thereby colorize the range image.
However, the comparably dense and valuable information provided by the RGB image would be left unused. Thus, we propose to fuse the intermediate feature representations extracted from respective cnn. We use well studied architectures [sandler2018mobilenetv2, wu2018squeezeseg] capable of providing useful feature representations for both, the RGB and range image. We extract and warp RGB features at multiple levels of the network such that they align with their range counterparts. We map the ImageNet features from the , and layer of MobileNetV2 to the layers Fire2, Fire4 and Fire7 of SqueezeSeg, respectively. We choose the layers before a pooling operation in MobileNetV2 and warp into similar sized SqueezeSeg layers whilst avoiding the ones which are passed through the skip connections. As a consequence, we exploit the RGB features with the highest representational capabilities of the respective spatial resolution and save parameters within the decoder. Using different or less connection points leads to slightly inferior results.
Since we warp feature tensors at different network layers (instead of raw input images), we cannot rely on a simple lookup. This is due to the fact that we do not have explicit correspondences between positions within the range feature tensor and their counterparts within the RGB feature tensor. For proper feature warping, we need sub-pixel accuracy (see green line segments in Figure2). Additionally, we need to deal with laser measurement outliers (e.g. due to transparent surfaces or far distant objects) which cause missing range image-to-RGB correspondences.
To address these issues, we treat the range image-to-RGB correspondences and their positions as control points for a first-order polyharmonic spline interpolation [fasshauer2007meshfree]. Passing query positions in the range image, we obtain the corresponding interpolated position in the RGB image with
where are the range pixel coordinates with valid corresponding positions in the RGB image. By solving a linear system of equations, we obtain the interpolating spline weights and . Note that we need to do this computation only once for each sample and we can reuse the weights for all interleaved layers.
In order to retrieve correspondences for a specific spatial resolution, we scale the pixel positions within the range features such that they are aligned with the original input image. Subsequently, we sample the corresponding position in the RGB space using the calculated spline interpolation. This yields the sub-pixel accurate position within the input RGB image for each pixel in the range feature tensor. From this, we can retrieve the corresponding position within the RGB feature tensor as shown in Figure 2.
To derive the actual value at the non-discrete position in RGB feature space, we bilinearly interpolate the four nearest neighboring features. The part of the warped feature tensor with correspondences outside the RGB image is set to zero.
We evaluate our method on KITTI [geiger2012we, Geiger2013IJRR] and reuse the train/val-split from [wu2018squeezeseg]. We also follow their training protocol and adopt their parameters: We consider the three main classes cars, pedestrians and cyclists and add an auxiliary class to model the background. KITTI provides labels in the horizontal field of view of only, thus we limit our consideration to this area. Additionally, our range images do have the same resolution of and, unless otherwise stated, the same input channels as in [wu2018squeezeseg].
|SqSeg w/o RGB †||67.2||20.2||24.1||37.2||9|
|SqSeg w/ RGB||63.7||18.8||22.8||35.1||13|
|PointSeg [wang2018pointseg] *||67.4||19.2||32.7||39.8||-|
|SqSeg [wu2018squeezeseg] *||64.6||21.8||25.1||37.2||13.5|
|SqSegV2 [wu2018squeezesegv2] *||73.2||27.8||33.6||44.9||-|
We augment the data by random horizontal flips and slight deviations in saturation, contrast and brightness of the RGB image. Based on a checkpoint trained with LiDAR features, we re-initialize the respective weights and fine-tune the network. We implement our framework in TensorFlow[abadi2016tensorflow] and use a GeForce GTX 1080Ti GPU for all runtime evaluations.
In the following, we evaluate the effect of our proposed FuseSeg method on point cloud segmentation in comparison with state-of-the-art methods (Section 4.1). Subsequently, we compare the architecture with RGB semantic segmentation networks to validate our warping-based feature fusion (Section 4.2). Finally, we show that we can reduce the number of control points and the accompanied computational cost without negatively affecting the performance (Section 4.3).
4.1 Feature Fusion
We show the merit of the fused image features by comparing it not only with SqueezeSeg, but also with state-of-the-art point cloud segmentation methods. Table 1 shows the results for all three relevant object classes and the respective runtime, while Figure 3 shows some qualitative results. We report the best average intersection-over-union over all three classes.
To provide an additional baseline, we also pass the RGB channels to SqueezeSeg (SqSeg w/ RGB). Thus, we colorize its range representation. To this end, we project each point onto the RGB image and sample the underlying pixel’s color. Note that not the entire range image is colored, only those 3D points which are visible in the RGB image.
The additional color channels even lower the performance of SqueezeSeg. The reason for this drop is that SqueezeSeg is optimized for runtime speed. Consequently its representational power is not able to process all information. Since we utilize a separate lightweight network to process the RGB information, we introduce another baseline (FuseSeg R-RGB): We warp the RGB image to its range counterpart (see Figure 5 for an upscaled example) and pass it to our RGB branch. Note that this baseline has the same number of parameters as FuseSeg.
As we see in our experiments, using a pre-trained ImageNet CNN/MobileNetV2 for extracting features in a warped range image already benefits segmentation performance compared to using no ImageNet CNN for the RGB information. Further, by using our proposed warping method to fuse on the feature level instead of the (RGB) input level, we further significantly improve accuracy. The main reason for this is that the warped RGB input representation is heavily distorted and thus impairs the performance of ImageNet features. In contrast, with our approach the ImageNet CNN operates on an undistorted RGB input on which it better benefits from transfer learning.
FuseSeg improves segmentation, especially on the smaller classes pedestrian and cyclist, by a large margin. We increase the mean intersection-over-union (IoU) by 18% respectively 13.2% compared to SqueezeSeg. We even outperform its successor SqueezeSegV2 [wu2018squeezesegv2] on average by 3.1%, which could be improved by our approach as well.
4.2 FuseSeg vs RGB Semantic Segmentation Approaches
In order to show the effectiveness of our warping-based feature fusion, we compare our approach with semantic segmentation approaches solely relying on RGB information. More specifically, we compare FuseSeg with DeepLabv3+ [deeplabv3plus2018] in combination with two feature extraction backends, a MobileNetV2 [sandler2018mobilenetv2] and a more powerful Xception65 [chollet2017xception] feature extractor. We infer that outperforming equivalent state-of-the-art architectures validated our fusion approach. Figure 4 illustrates the process of deriving and evaluating labeled point clouds from RGB segmentation masks.
We fine-tune the pre-trained DeepLabv3+ models on CityScapes and the KITTI semantic segmentation data and ensure that no image of our validation set is used for training. We trained until convergence and choose the checkpoint with the best segmentation result on the KITTI validation set. To overcome the diverging annotation policies of the two datasets, we fuse neighboring bicycle and rider regions to cyclist.
We create segmentation masks for each RGB image by passing it through the trained models and segment the 3D points by projecting them onto the masks (see Eq. 3). All classes except car, bicycle and pedestrian are considered as background. Thus, we segment the point clouds without using any depth information. For this comparison, we only evaluate the part of the range image with color information for all methods (Thus, the evaluation region differs from Section 4.1).
Table 2 shows the IoU on the respective classes and the runtime of each method. While we clearly outperform DeepLabv3+ in terms of runtime, we outperform the network based on MobileNetV2 on all three classes. Note, that this is the same backend as used in FuseSeg for RGB information. As a consequence, this demonstrates that depth adds valuable information to the segmentation task and our fusion approach is an effective and very efficient method to utilize it.
We are even better than the powerful Xception65 DeepLabv3+ on average performance, despite using the weaker backend. Our modular design allows the exchange of the RGB backend in a plug-and-play manner, but one of our research goals is real-time speed leading to the choice of MobileNetV2.
4.3 Number of Control Points
|# Ctrl Pts||car||ped||cyc||avg||rt [ms]|
In KITTI there are up to 19k point correspondences between an RGB image and range representation. However, since computational cost of the interpolation increases with the number of control points, a small number of control points is desirable. To this end, to obtain a good coverage in the target domain, we perform fps on the coordinates in the range image (in contrast to fps on 3D coordinates) to reduce the number of control points.
We compare different configurations aiming at a reliable assessment. We vary the number of control points used by our architecture and evaluate segmentation accuracy as well as runtime. Table 3 shows the speed-vs-accuracy trade-off. Interestingly, we only need a very small number of control points, i.e. 24, to estimate a decent warping and achieve state-of-the-art results. We see that there is no notable variation of the accuracy for the car class, which can be explained by their size.
However, for smaller objects, i.e. pedestrians and cyclist, we observe a notable sensitivity regarding the control points and multiple spikes at certain point numbers. Due to the baseline between camera and LiDAR and the resulting parallax, a flawless warping is not always possible. This distortion peaks at high depth differences, e.g. at the edges of visible objects (see Figure 5). We hypothesize that a certain number of control points favors these distortions more than others. More elaborate sampling methods, e.g. focusing on depth discontinuities within the range image might mitigate these sensitivities, but are out of the scope of this paper.
We propose a simple and effective way to leverage RGB features for LiDAR point cloud segmentation. Utilizing the range representation of LiDAR point clouds allows us to process them with known cnn strategies. Then, our efficient warping-based feature fusion enables us to use the benefits of transfer learning on the dense and rich information provided by RGB data jointly with features derived from LiDAR data. Thereby, we still fulfill real-time requirements, performing at 50 fps. This is twice as fast as the recording speed of today’s LiDAR sensors. Thus, our method can easily be utilized in autonomous cars and robots.
Furthermore, the encoder of FuseSeg is applicable as feature extractor for various 3D perception tasks. Finally, our warping strategy in combination with the range representation can be used to interleave features in both directions and thus, also improve RGB-based object detection and semantic segmentation.
This project was supported by the Austrian Research Promotion Agency (FFG) project DGT (860820). This work was partially funded by the Christian Doppler Laboratory for Embedded Machine Learning.
This project was supported by the Austrian Research Promotion Agency (FFG) project DGT (860820). This work was partially funded by the Christian Doppler Laboratory for Embedded Machine Learning.