3D Object Detection From LiDAR Data Using Distance Dependent Feature Extraction

03/02/2020 ∙ by Guus Engels, et al. ∙ UPV/EHU Vicomtech 0

This paper presents a new approach to 3D object detection that leverages the properties of the data obtained by a LiDAR sensor. State-of-the-art detectors use neural network architectures based on assumptions valid for camera images. However, point clouds obtained from LiDAR are fundamentally different. Most detectors use shared filter kernels to extract features which do not take into account the range dependent nature of the point cloud features. To show this, different detectors are trained on two splits of the KITTI dataset: close range (objects up to 25 meters from LiDAR) and long-range. Top view images are generated from point clouds as input for the networks. Combined results outperform the baseline network trained on the full dataset with a single backbone. Additional research compares the effect of using different input features when converting the point cloud to image. The results indicate that the network focuses on the shape and structure of the objects, rather than exact values of the input. This work proposes an improvement for 3D object detectors by taking into account the properties of LiDAR point clouds over distance. Results show that training separate networks for close-range and long-range objects boosts performance for all KITTI benchmark difficulties.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 6

page 7

page 9

Code Repositories

arxiv-explorer

Search papers by key words at arxiv.org


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

"LiDAR is a fool’s errand and anyone relying on LiDAR is doomed." - Elon Musk, CEO of Tesla. The CEO of Tesla, a company that puts tremendous effort in the development of autonomous vehicles does not see value in using the LiDAR sensor, but a vast number of researchers and companies disagree and have shown that including LiDAR in their perception pipeline can be beneficial for advanced scene understanding.

Fig. 1: Frame from the KITTI dataset with the camera image on the bottom and two point cloud crops from the cars on the top figures. The left car on the image and the top left point cloud representation belong to each other and the top right and right car.

LiDAR is an abbreviation for light detection and ranging. It is a sensor that uses laser light to measure the distance to objects and their reflectiveness. It sends out beams of light that diverge over the distance and reflect if an object is hit. It has proven to be useful for many applications in areas such as archaeology and geology. However, there are some downsides that prohibit the adoption of widespread use for autonomous driving. Currently, LiDAR is an expensive, computationally demanding sensor that requires some revision of the car’s layout. Cameras are a rich source of information that are very cheap in comparison. Nonetheless, there are situations where cameras struggle and LiDAR can be of significant help. Driving in low light conditions is challenging for a camera, but does not make a difference for LiDAR because it does not depend on external light sources. Overall, the future of LiDAR and its role in a future autonomous vehicle is unclear, but it does have very valuable properties that can play an important role in autonomous driving and make its research decisive.

For self-driving vehicles it is of the utmost importance to be aware of its surroundings. Therefore, objects have to be located in the 3D space around the ego vehicle, which requires using sensors such as LiDAR to capture the surrounding information. Research into 3D object detection and LiDAR have gone hand in hand. The most influential dataset for 3D object detection for autonomous applications has been the KITTI dataset [5]. It hosts leaderboards where different methods are compared. The growing interest in 3D object detection can be observed from these leaderboards, where the state of the art gets replaced quickly by newer architectures. The best performing network in 2018, Pixor [30], only just falls in the top 100 as of October 2019. All detectors on the leaderboards, used for 3D object detection are influenced in varying degrees by 2D detectors. It is obvious that the advances in 2D object detection can be of big help for 3D object detection. However, 2D object detection is done on images, a very different data type compared to point cloud data. The LiDAR beams diverge, which causes objects placed farther away to be represented with less points than the exact same object nearby. This is very different compared to images, where the object becomes smaller when it is farther away, but the underlying features and its representation are still the same. This can be clearly seen from Figure 1. The effect of this has not yet received much attention in the literature.

In this paper, differences between image and point cloud data are researched. A new pipeline is proposed that learns different feature extractors for objects that are close by and far away. This is necessary because features in these two ranges are very different, and one shared feature extractor would perform sub-optimally in both ranges. Furthermore, visualization of the changing features over the distance is provided. Previous works like ZFnet [34] were not only able to increase performance, but also showed the importance of understanding how the network operates. Inspired by this we focus on how point cloud input data is different from image input data to have a better understanding of the limitations of using 2D CNNs on point clouds. In addition we show how to alter the detection pipeline to exploit these differences.

To summarize, the contributions of this paper are the following:

  • We provide an analysis into the effects of different input features on the detection performance.

  • We show the influence of the changing representation over distance and how it affects the underlying features of the objects on the car class.

  • We propose a new detection pipeline that is able to learn different feature extractors for close-by objects and far-away objects.

Ii Related Work

3D object detection has been tackled with many different approaches. The first networks were heavily influenced by 2D detectors. They converted point clouds to top view images, which made it possible to use conventional 2D detectors. A variety of input features and conversion methods were used. Many of these detectors have been later on improved by adding more information from other sensors or maps.

Currently, most of the best performing architectures use methods that directly process the point cloud data [22, 33]. They do not require handcrafted features or conversion to images like the previous methods did. As of late, new datasets have become public and will push the field of 3D object detection further forward.

Ii-a 2D CNN approaches

Many early and current 3D object detectors are adjusted 2D object detectors such as Faster R-CNN [20] and YOLO [19]. In particular, ResNet [7] is often used as backbone network for feature extraction. In order to detect 3D objects with 2D detectors, point clouds are compressed to image data. Two common representations are (i) a front view approach like in LaserNet [15] and (ii) a bird’s eye view (BEV) approach where point cloud data are compressed in the height dimension [1] . Some methods only use three input channels to keep the input exactly the same as for 2D detectors. Other methods use more height channels to save more important height information of the point cloud [30, 3]. Which features are important to retain has been researched before but is still an open research topic. The BEV has proven to be such a powerful representation that methods using camera data convert their data to a BEV to perform well on 3D object detection tasks [24]. How the conversion from point cloud to BEV is performed will be explained in the Methods section.

Ii-B Multimodal approaches

A modern autonomous vehicle will have more than just a LiDAR sensor. This is why many methods use a multimodal approach that combines LiDAR with camera sensors [3, 11, 9, 17]. Other methods fuse the data at different stages in the network. MTMS [11] uses an earlier fusion method, whereas M3VD [3] uses a deep fusion method. Besides different sensors, map information can also be employed to improve performance [29]. Some approaches track objects over frames [14, 26] which boosts performance, especially for objects with only a small number of LiDAR points and partially occluded objects in a sequence.

Ii-C Detection directly on the point cloud

A major shift in 3D object detection came when networks started to extract features directly from the point cloud data. Voxelnet [35] introduced the VFE layer, which computed 3D features from a set of points in the point cloud and stores the extracted value in a voxel, containing the value of a volume in 3D space, similar to what a pixel does for 2D space. A second influential network that works directly with the point cloud data is PointNet [18]. It processes directly on the unordered point cloud and is invariant to transformations, such as translations and rotations. These two methods are the foundation of many architectures in the top 50 on the KITTI leaderboards. Pointpillars [10] uses PointNet to extract features and then transform it to a BEV representation to apply a detector with a FPN inspired backbone [13]. Other networks use this approach and focus on the orientation [28] or try bottom up anchor generation [21, 32].

Fig. 2:

Proposed pipeline for 3D object detection using point clouds as input. In this method, point clouds are converted to bird’s eye view (BEV) images in the "point cloud to BEV" stage. The point cloud is compressed to an image according to one of four configurations explained in the Methods section. The architecture inside the dotted lines is a Faster R-CNN architecture with an additional rotation branch that makes it possible to output rotated bounding boxes. RPN stands for region proposal network, which searches for regions of interest for the head-network that tries to classify and regress bounding boxes around these objects. ROI pooling converts different sized regions of interest to a common

so it can be handled by two fully connected layers called FC6 and FC7. The output of these layers are send to the last branches that predict either rotated or horizontal bounding boxes with a class label.

Ii-D Datasets

The KITTI dataset [5] has been one of the most widely used datasets for 3D object detection for autonomous driving so far. It is a large scale dataset that contains annotations in the front camera field of view for camera and LiDAR data. There are three main classes, namely, cars, pedestrians and cyclists. However, the car class contains more than half of all objects [5]. As of late, many new datasets have emerged. Most notable are NuScenes [2] and the H3D dataset, of Honda [16]. They are much larger datasets, 360 degrees annotated and contain 1 million and 1.4 million bounding boxes (respectively) compared to 200k annotations in KITTI. The NuScenes dataset was created with a 32-channel LiDAR, in comparison to the KITTI and H3D dataset which were both created with a 64-channel LiDAR. Sensors with more channels are able to produce denser, more complete 3D maps. The beam divergence will be larger when there are less channels and will make it more difficult to detect objects far away.

Iii Methods

In order to apply 3D object detection on LiDAR and perform the different tests mentioned in the Introduction, a 3D object detection algorithm is implemented according to the structure in Figure 2. The point cloud is converted to images, which can then be used by a Faster R-CNN detection network. This network is used to evaluate the effect of different input features on the performance and to create an architecture that uses different kernel weights for different regions of the LiDAR point cloud.

Iii-a Network

Iii-A1 Architecture

The architecture of the full baseline network is displayed in Figure 2. The full architecture has a point cloud as input and outputs horizontal and rotational bounding boxes. A Faster R-CNN [20] with a rotation regression branch in the head network is used. The backbone network is ResNet-50 [7]

of which only the first three blocks are used for the RPN network. Faster R-CNN is designed for camera images where objects can have all kinds of shapes. In a BEV image of LiDAR data all objects are relatively small in comparison. To make sure that the feature map is not downsampled in size too quickly, the stride of the first ResNet block is adjusted from 2 to 1. This causes the output feature map to be four times as big, which means that less spatial information is lost. The object size in a BEV image is directly proportional to the physical size of that object. This allows for tailoring the anchor boxes exactly to the object size to serve as good priors. When cars are rotated these anchor boxes will not fit accurately to the objects. To account for this, different orientations of the anchor boxes are considered similarly to

[31, 23]. In total, sixteen different orientations in the anchor boxes are used. The head network is slightly different from Faster-RCNN because it contains an additional rotation head [8]. Region proposals from the RPN network and the output feature map of the RPN network are the inputs of the head network. Region of interest pooling (ROI) outputs feature maps for each object proposal that are fed to two fully connected layers with parameters. These layers output four bounding box variables for non rotated objects, more specifically, the center of the box coordinates (, ) and the dimensions (, ). The rotated boxes have an additional parameter, , for the rotation angle of the bounding box. Each box has a class confidence score. The class with the highest score is the most likely for this bounding box and it is used as output.

Iii-A2 Point cloud to BEV Image

Fig. 3: Point cloud conversion to bird’s eye view (BEV) representation. This research uses four different possible configurations that are visualized in the bottom dotted block. In the complete pipeline only one of these will be used for each test.

Feature extractors such as ResNet are designed for RGB images. Using this exact architecture requires reshaping the point cloud input to a three channel image. A common approach consists of compressing the point cloud into a BEV image where the height dimension is represented by three channels. Which information to keep remains an open problem. In the past, it was shown that the reflectiveness, or intensity value, did not contribute much when there was maximum height information available [1]. Pixor [30] and MV3D [3] use multiple height maps to retain more height information. Instead of only storing 3 channels, as it would be the case in a regular 2D detector, they have an architecture that can handle inputs of more than 3 channels. To test the effect of different features, the inputs are pre-processed in four different input configurations. For all methods there are some shared steps. Firstly, a 3D space of m to m in longitudinal range, m to m in lateral range and m to m in the height dimension is defined. The LiDAR in the KITTI setup is located m off the ground, so m is where the road is. All points that fall outside of this box are not taken into account. For three of the four methods, pixel images are considered, which means that every m cube represents one pixel. The last method uses more channels and is of size . The exact specifications of the channels are as follows:

TABLE I: Statistics of the 4 cars, ordered from left to right.
Fig. 4: Representation of four cars at different ranges detected with a 64 channel LiDAR. The green arrow indicates the vertical distance between four adjacent points and the red arrow indicates the horizontal distance between four adjacent points. The distance of four points instead of the distance between adjacent points is used for visualization purposes.
Fig. 4: Representation of four cars at different ranges detected with a 64 channel LiDAR. The green arrow indicates the vertical distance between four adjacent points and the red arrow indicates the horizontal distance between four adjacent points. The distance of four points instead of the distance between adjacent points is used for visualization purposes.

Max height voxels: the cube can be divided in three voxels with each m m m dimensions. The first channel will have the height value of the highest point from m to m range. The second channel will have the highest value from m to m and the third one from m to m. An advantage of this method is that automatically a form of ground removal takes place. The points very close to the road are often not of much importance for the objects on the road. In addition, the highest points of an object are in many cases what distinguishes them, and therefore often the most important feature [1].

Binary: for every voxel it is checked if there is at least one LiDAR point present. If this is the case, this channel gets the value , if not it gets the value . The value is chosen instead of

to make sure the values have the same order of magnitude as the pre-trained Imagenet

[4] weigths. This approach is an important indicator of how the network handles the specific point cloud values. Consequently, it is a baseline test to see if other features add value.

Multichannel max height voxels: the input of only three channels might not be enough since it means that many of the original point cloud information is lost. Using more height will provide the network with more of the original information and should improve the scores as was done in [3, 30]. Nine height maps are used as input to the ResNet backbone instead of the three maps used in the max height approach. To make sure that this input is compatible with the ResNet feature extractor, the pre-trained weights are duplicated three times and stacked.

Height intensity density: instead of picking a subset of the points from the point cloud, it is also possible to compute features that could be interesting and feed them to the network. This configuration has been used in Birdnet [1] before.

A visualization of how all the features are extracted is displayed in Figure 3. This stage represents what happens in the "Pointcloud to BEV" block in Figure 2.

Iii-B Loss function

Faster R-CNN is a two-stage detector for which the total loss is a combination of the losses of the individual stages. The losses of the first stage are similar to the original Faster R-CNN paper, but do not have the logarithmic scaling factors. There are regression targets that describe the absolute difference between the ground truth and the network prediction for the center points (, ) and the dimensions (, ) of the object:

(1)
(2)
(3)
(4)

where are predictions from the network and are the ground truth targets which the network aims to predict. The classification loss is a softmax cross-entropy between the background and foreground classes. Together they form the RPN loss.

The horizontal branch of the head network uses the exact same regression targets, but should now consider multiple, instead of just background and foreground classes. In this research, only the car class is considered so it turns out to be the same as for the RPN case. The rotational branch of the head network has an additional regression loss for the rotation as show in Figure 2. This loss is the absolute difference between the predicted and ground truth angle in degrees:

(5)

It is important to note that by introducing rotation it becomes possible to describe the exact same rotated bounding box in multiple ways. Consider a bounding box defined by , , , , and . If the values of and are swapped and degrees are added to the angle , the same bounding box can be described. This has to be avoided since a loss larger than zero would be possible, even though the predict bounding box fits perfectly. This is avoided by making sure that the width is always larger than the height. If this is not the case the height and width are swapped and 90 degrees are added to the rotation to force unique bounding box configurations.

Once all regression targets are calculated smooth L1 loss is used according to the following equation:

(6)

where

is a tuning hyperparameter,

for the RPN network and for the head network are used, which is in line with Faster R-CNN implementation [20].

Combining the classification and regression losses for all branches results in six total losses. A classification and regression loss for the RPN, horizontal and rotational branch. All losses use smooth L1 loss to calculate the total loss value according to the following equation:

(7)

where is the regression loss of the RPN, is the classification loss of the RPN, is the regression loss of the horizontal head network, is the classification loss of the horizontal head network, is the regression loss of the rotational head network and is the classification loss of the rotational head network.

Iii-C Working with LiDAR data

The density and distance between points change over distance in the point clouds. Deep learning architectures are often designed with images in mind where this is not the case. The section shows how features change in the LiDAR data and how the network architecture can be adjusted to account for this.

Iii-C1 LiDAR features

An important reason why convolutional neural networks are able to work well with relatively few parameters compared to classic neural networks, is because of parameter sharing. It leverages the idea that important features are the same across the whole image and a single filter can be used at every position

[6]. This is a valid assumption for rectilinear images but not for LiDAR data. The representation of an object is different m from the sensor compared to m from the sensor. The amount of points reduces and the distance between different points increases. Figure 4 shows how the representation of a car changes over distance. The amount of points drastically decreases, for an object of the same class and similar size, while the distance between points increases.

The sparsity and divergence of the beams over time can also be observed in BEV images. Figure 5 shows the gaps that appear in the side of a car. This is not because there is not part of the car at that pixel but because the beams have diverged so far that it is simply not possible to cover all pixels with the beams. The farther away from the LiDAR, the larger the gaps. Note that these gaps in longitudinal and lateral dimension are much smaller compared to the divergence in the height dimension. The rapid divergence in the height dimension makes it difficult to detect objects that only cover a small area such as pedestrians. The green arrow in Figure 4 represents the height dimension and shows a bigger distance than the red arrow which relates to the longitudinal and lateral dimensions. This is confirmed by looking at the specifications of the LiDAR used for the KITTI dataset, which state that the angular resolution in longitudinal and lateral range is degrees and the vertical resolution is degrees [12]. The fourth car in Figure 4 is barely distinguishable at a distance of meters. The BEV range is by meters so considering that range, meters is not even near the edge of where cars should be detected for the KITTI evaluation. The difficulties for far-away objects in the KITTI dataset were quite recently addressed by Wang et al. [25]. They used adaptive layers to enhance the performance on distant objects and were able to improve SECOND [28] and VoxelNet [35] with roughly average precision for easy, moderate and hard categories. While the idea here is similar, our work does not need an adversarial approach.

Fig. 5: BEV representation of two cars in the same KITTI frame. The left car is located at 20 meters from the LiDAR and the right car is located 53 meters from the LiDAR. It can be seen that the amount of points decreases over distance and gaps start appearing between adjacent pixels.

Iii-C2 Combined network

With the knowledge that features are not consistent over the range of the point cloud, the network can be adjusted accordingly. Multiple instances of the baseline network are used and trained on different subsets of the training data. The first network is only trained on the objects that are close by, while an identical second network is trained on objects far away. The results of the separate regions are combined and then compared to a baseline network that is trained on the full range. To decide if objects should fall in the category "close by" or "far away" a distance threshold has to be established. From figure 6 it can be seen that a radial distance from the LiDAR sensor is used.

For the inside range, the point cloud is converted to a BEV image in the same manner as before, but now all the pixel values outside this threshold range are set to zero. This is done similarly for the outside range but the inside set of pixels is set to zero as displayed in Figure 6. The point cloud is converted to two BEV images. Each one contains part of the point cloud and part of the objects. If the distance to the center of an object is smaller than the threshold range it belongs to the inside network and if it is larger it belongs to the outside network.

When objects are excluded based on their center, it could happen that objects are partially cut out of the image and fed to the network. This should not occur since it will devalue the training data. To avoid this, an overlap region is introduced. This region is of the point cloud is used both for the inside and outside range networks.

Iv Implementation Details

Iv-a Network settings

All tests are done on a Faster R-CNN network with a rotational branch inspired by [27]. A learning rate of with decay steps at and with a decay factor of for both steps is applied. The dataset, which contains samples, is split in training, validation and testing in a fashion. For all networks, multiple checkpoints were evaluated and the one with the best score on the validation set was picked. The best scores occurred often between and steps which means that the network were trained for roughly to epochs. All networks are trained on a single Nvidia Tesla V100 GPU. Anchors of pixels are used to detect cars which is in line with the average car size.

The network is optimized using stochastic gradient descent with momentum of

. Weight decay of is used to prevent overfitting. All weights of the ResNet backbone are pre-trained on ImageNet [4]. The underlying data is quite different from the LiDAR data but we observed fast convergence for the network. A batch size of

is used with batch normalization applied with a decay factor of

. Batch normalization is applied but not trained, the values from the pre-trained ImageNet weights are used for the batch normalization layer.

Smooth L1 loss is used with thresholds of and for the RPN and head network respectively. The loss weights for the different components are found empirically. It is important to note that only one class is considered during training, which makes classification easier. Most errors occurred in the box calculation so the regression loss is twice as high as the classification loss for both RPN and head network branches.

Fig. 6: Pipeline of the combined network architecture. Inside and outside range network only consider the white part of the visualized range next to them. These networks have the exact same configuration and are in line with Figure 2. The outputs of both networks are combined to form the final output.

Iv-B Data pre-processing

Point clouds are converted to one of the four configurations described in the Methods section, displayed in Figure 3. Every pixel corresponds to meter space in the point cloud. All tests are done on the KITTI dataset where only the field of view (FOV) of the front camera is annotated. All pixel values outside this FOV are set to zero which results in a region of black pixels. Some objects that lay on the border are sometimes still annotated. To generalize this situation for all images, only objects that are annotated and have at least of the surface in the FOV are taken into account. No augmentations are used for the final models because they did not improve the performance of the network.

Iv-C Combined network

The combined network uses the exact same settings described in the previous sections, but applied to different range. One network is trained on data close to the LiDAR while the another network is trained on data far away from the LiDAR. The final output is their combined output for the respective regions. For the inside range network only boxes with the center closer than m to the LiDAR are taken into account. With a m overlap region between m and m, to make sure no labeled objects are cut in half. For the network that trains on objects that are far away, a m boundary is used where the m region is the overlap region mentioned before. Figure 6 shows how the images are combined. The choice of the threshold for a particular dataset affects the results of the network. We found that m was a good threshold since both regions still contain enough training samples.

For inference a more straight forward division is used. The KITTI evaluation considers a range of [m, m] laterally and [m, m] longitudinally. We simply divided this region in half to get a close by and far away range where we evaluate the baseline and combined network on. All objects in the top half of the image bounded by [m, m] laterally and [m, m] longitudinally will be detected by the inside network. The outside network detects objects in a space bounded by [m, m] laterally and [m, m]. The results of both networks are then merged together. The full pipeline is displayed in Figure 6.

V Results

All results are based on the KITTI benchmark which calculates average precision (AP) scores for three different categories, respectively easy, moderate and hard. AP is an often used object detection metric. In KITTI, equally spaced recall points are used where the precision at each point is calculated. Precision is evaluated at different recall points and combined according to the following equation:

(8)

where is a set of values linearly spaced between where each value is a specific recall value. The integral of all recall values divided by is the AP score.

Different true positive thresholds are considered. The official KITTI has a intersection over union (IoU) threshold for the car class but we also report scores for overlap as many other works analyse. percent is quite challenging to achieve with rotational bounding boxes around objects that are often occluded or contain few points.

0-35m range 35-70m range 0-70m range
Method (Threshold) Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
Baseline (0.7) 78.8 82.0 75.4 - 43.7 37.0 78.5 73.0 66.9
Combined Network (0.7) 81.5 83.4 76.5 - 46.5 42.0 81.2 74.7 68.9
Difference 2.7 1.4 1.1 - 2.8 5.0 2.7 1.7 2.0
Baseline (0.5) 89.3 89.7 89.2 - 60.9 53.0 89.0 82.9 81.2
Combined Network (0.5) 89.8 89.9 89.5 - 75.1 67.7 89.5 86.4 84.7
Difference 0.5 0.2 0.3 - 14.2 14.7 0.5 3.5 3.5
TABLE II: Baseline network compared to the combined network for 50% and 70% IoU threshold. The average precision for the 0-70 meter range is the weighted mean average precision of the 0-35 range and the 35-70 range. The reason for this is that the confidence scores of the combined network do not match. The confidence scores of the objects far away are too high because that network has never seen objects close by. This difference in confidence score between the two networks influences the AP calculations, so a mean average precision of the two ranges is used.

V-a Feature analysis

Multiple different input features and their impact on results are tested next. In the methods section, the considered four different input configurations are explained. Table III shows the results on the KITTI benchmark for the different pre-processing methods displayed in Figure 3. It can be seen that the first three methods, maximum heights, binary and height/intensity/density only differ by a small margin. Only the approach where 9 channels are used performs significantly worse.

Features Easy Moderate Hard
max height 79.5 73.1 66.6
height, intensity, density 79.2 73.1 67.0
binary 79.4 72.9 66.4
multichannel height vox. 76.0 65.8 65.0
TABLE III: Results of different input feature configurations.

Overall the networks seem to learn more the structure of the objects and not the absolute values stored in the channels. This does not come as a complete surprise since a ResNet feature extractor with batch normalization is used. The information of the absolute values is "lost" quite quickly in the network because of the normalization. The filters in the first layers search more for derivative-based features such as edges and corners. From that perspective, the differences are not so big. With that being said, the input values are still important for the performance of the network. Solely using intensity has a much worse performance than only using the max height value [1]. Another factor could be the pre-trained weights. That allows for fast training results but may limit the overall performance of the network. This could be a problem especially for the network with nine channel inputs.

These results highlight a limitation of 2D detectors processing on point clouds and a possible reason for the gap between them and methods directly using the point cloud. For LiDAR data these absolute values are of big importance, since they can give information about classes directly. An object is very unlikely to be a car if the highest values are only meter high even though the top view almost perfectly corresponds. It is possible that methods that use the point cloud directly are able to leverage this information more than 2D detectors. These networks also use batch normalization but are able to compute more complete features that help boost their scores. Relying too heavily on these input values could be dangerous when considering different classes, noise, sensor movement, etc. With the rise of new datasets, it will hopefully become clear how robust these methods are.

V-B Range analysis

Table II shows the results of the baseline network and the combined network on a IoU threshold and a threshold. Easy category is not considered for 35-70m range analysis, due to the lack of significant amount of cars. The combined network outperforms the baseline network for all categories. The inside and outside networks are trained on subsets of the total data and are still able to perform better than the baseline network. Not sharing the weights results in a better performing network, because objects in different ranges do not have to share the same feature extractors.

The results of IoU compared to the ones with show that even if the networks are able to detect the cars, in many cases the precision of the bounding boxes hurts the results significantly. Cars that are closer to the LiDAR contain more points and regressing their boxes is easier in these situations. If only a

threshold is used, the evaluation is more forgiving for errors in rotation because they have the largest influence on the overlap. For the 35-70m range the improvements of the combined network are remarkable. It seems that the difference in features over the distance hurts the ability to detect objects, when the model is trained with full range data. The same happens with the ability to detect the orientation of the objects accurately. In automated driving this is important since the orientation is key for further processing steps such as estimating heading angles and applying tracking.

Figure 7 shows qualitative results of the combined network. The green boxes are from the inside network and the red boxes from the outside network. The BEV image is adjusted for better visualization by smoothing the images and maximizing the input values. In the left image, it can be appreciated that the closest bounding box to the LiDAR is a correct detection, although almost all of the pixels that represent the car are black. The algorithm is able to detect it correctly based on the small corner that is still present. The right figure shows how the network performs all over the range.

Fig. 7: Detection results on KITTI validation set. The upper row in each image is the BEV representation with the detected cars. The other are the camera images corresponding to those BEVs. Green boxes are estimated by the inside network and red boxes by the outside network.

Vi Discussion

A 3D object detection algorithm is implemented and trained on the KITTI dataset, that outperforms many recent BEV based networks [23, 1, 15] and gets similar results to other networks [30, 3]. This paper shows how convolutional neural networks used for natural images are based on assumptions that do not transfer well to point cloud data. This concept can be used to increase the performance of many networks on the KITTI benchmark whether they are processing BEV images or point clouds directly as was also shown by [25]. Furthermore it looks into the advantages and limitations of BEV approaches. The main advantage is robustness since it does not rely on the specific values of the object, but mostly on the shape. This could also be thought as a limitation since these specific values could be directly used for classification in LiDAR data. However, these values can vary between LiDAR models and are more sensitive to possible noise. Data quantity that needs to be processed is another thing to take into account, as only an image needs to be processed instead of an entire point cloud.

Vii Conclusions

This work provides insights in how LiDAR data differs from natural images, which features are important for detecting cars and how a neural network architecture for 3D object detection can be adjusted to take into account the changing object features over the distance. The first tests check the effect of different handcrafted input features on the performance. Point clouds are converted to BEV images that are fed to a two stage detector. Different input configurations do not vary much in performance which can be attributed to the fact that the filters look for features such as edges and do not rely as much on the exact values in the BEV. These exact values are of more relevance for LiDAR data compared to camera images because they can be linked directly to certain classes. This is different from approaches that use the raw point clouds directly as input and might explain why they perform overall better on the KITTI benchmark. Nevertheless, relying on those values may be a problem when dealing with sensor noise or model differences. With the rise of new datasets, it will hopefully become clear how robust these methods are compared to BEV based approaches.

In addition, this work visualizes how the distance to the sensor influences the objects representation in the point cloud. Most convolutional neural network rely on the assumption that features are consistent over the full range of the image. This allows for one filter to be used to extract features from the entire feature map. For LiDAR data this is not the case which is shown by analyzing point clouds and objects at various distances. This observation is used to change the detection pipeline and have a separate detector for objects in the 0-35 meter range and another detector for objects in the 35-70 meter range. These changes lead to improvements, most notably of 2.7% AP on the 0-35 meter range for easy category and 5.0% AP on the 35-70 meter range for hard category, using a 70% IoU threshold.

References

  • [1] J. Beltrán, C. Guindel, F. M. Moreno, D. Cruzado, F. García, and A. de la Escalera (2018) BirdNet: a 3d object detection framework from lidar information. CoRR abs/1805.01195. External Links: Link, 1805.01195 Cited by: §II-A, §III-A2, §III-A2, §III-A2, §V-A, §VI.
  • [2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §II-D.
  • [3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2016) Multi-view 3d object detection network for autonomous driving. CoRR abs/1611.07759. External Links: Link, 1611.07759 Cited by: §II-A, §II-B, §III-A2, §III-A2, §VI.
  • [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §III-A2, §IV-A.
  • [5] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §I, §II-D.
  • [6] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. The MIT Press. External Links: ISBN 0262035618, 9780262035613 Cited by: §III-C1.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §II-A, §III-A1.
  • [8] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, and Z. Luo (2017) R2CNN: rotational region CNN for orientation robust scene text detection. CoRR abs/1706.09579. External Links: Link, 1706.09579 Cited by: §III-A1.
  • [9] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. IROS. Cited by: §II-B.
  • [10] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2018) PointPillars: fast encoders for object detection from point clouds. CoRR abs/1812.05784. External Links: Link, 1812.05784 Cited by: §II-C.
  • [11] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019-06) Multi-task multi-sensor fusion for 3d object detection. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §II-B.
  • [12] V. LiDAR HDL-64 users manual. External Links: Link Cited by: §III-C1.
  • [13] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2016) Feature pyramid networks for object detection. CoRR abs/1612.03144. External Links: Link, 1612.03144 Cited by: §II-C.
  • [14] W. Luo, B. Yang, and R. Urtasun (2018) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3569–3577. Cited by: §II-B.
  • [15] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington (2019) LaserNet: an efficient probabilistic 3d object detector for autonomous driving. CoRR abs/1903.08701. External Links: Link, 1903.08701 Cited by: §II-A, §VI.
  • [16] A. Patil, S. Malla, H. Gang, and Y. Chen (2019) The H3D dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes. CoRR abs/1903.01568. External Links: Link, 1903.01568 Cited by: §II-D.
  • [17] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2017) Frustum pointnets for 3d object detection from RGB-D data. CoRR abs/1711.08488. External Links: Link, 1711.08488 Cited by: §II-B.
  • [18] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2016) PointNet: deep learning on point sets for 3d classification and segmentation. CoRR abs/1612.00593. External Links: Link, 1612.00593 Cited by: §II-C.
  • [19] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2015) You only look once: unified, real-time object detection. CoRR abs/1506.02640. External Links: Link, 1506.02640 Cited by: §II-A.
  • [20] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Link, 1506.01497 Cited by: §II-A, §III-A1, §III-B.
  • [21] S. Shi, X. Wang, and H. Li (2018) PointRCNN: 3d object proposal generation and detection from point cloud. CoRR abs/1812.04244. External Links: Link, 1812.04244 Cited by: §II-C.
  • [22] S. Shi, Z. Wang, X. Wang, and H. Li (2019) Part-a^ 2 net: 3d part-aware and aggregation neural network for object detection from point cloud. arXiv preprint arXiv:1907.03670. Cited by: §II.
  • [23] M. Simon, K. Amende, A. Kraus, J. Honer, T. Sämann, H. Kaulbersch, S. Milz, and H. Gross (2019) Complexer-yolo: real-time 3d object detection and tracking on semantic point clouds. CoRR abs/1904.07537. External Links: Link, 1904.07537 Cited by: §III-A1, §VI.
  • [24] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger (2018) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. CoRR abs/1812.07179. External Links: Link, 1812.07179 Cited by: §II-A.
  • [25] Z. Wang, S. Ding, Y. Li, M. Zhao, S. Roychowdhury, A. Wallin, G. Sapiro, and Q. Qiu (2019) Range adaptation for 3d object detection in lidar. External Links: 1909.12249 Cited by: §III-C1, §VI.
  • [26] X. Weng and K. Kitani (2019) A baseline for 3d multi-object tracking. CoRR abs/1907.03961. External Links: Link, 1907.03961 Cited by: §II-B.
  • [27] Y. Xue (2018) Faster r2cnn. GitHub. External Links: Link Cited by: §IV-A.
  • [28] Y. Yan, Y. Mao, and B. Li (2018-10) SECOND: sparsely embedded convolutional detection. Sensors 18, pp. 3337. External Links: Document Cited by: §II-C, §III-C1.
  • [29] B. Yang, M. Liang, and R. Urtasun (2018) HDNET: exploiting hd maps for 3d object detection. In CoRL, Cited by: §II-B.
  • [30] B. Yang, W. Luo, and R. Urtasun (2019) PIXOR: real-time 3d object detection from point clouds. CoRR abs/1902.06326. External Links: Link, 1902.06326 Cited by: §I, §II-A, §III-A2, §III-A2, §VI.
  • [31] X. Yang, K. Fu, H. Sun, J. Yang, Z. Guo, M. Yan, T. Zhang, and X. Sun (2018) R2CNN++: multi-dimensional attention based rotation invariant detector with robust anchor strategy. CoRR abs/1811.07126. External Links: Link, 1811.07126 Cited by: §III-A1.
  • [32] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2018) IPOD: intensive point-based object detector for point cloud. arXiv preprint arXiv:1812.05276. Cited by: §II-C.
  • [33] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) STD: sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1951–1960. Cited by: §II.
  • [34] M. D. Zeiler and R. Fergus (2013) Visualizing and understanding convolutional networks. CoRR abs/1311.2901. External Links: Link, 1311.2901 Cited by: §I.
  • [35] Y. Zhou and O. Tuzel (2017) VoxelNet: end-to-end learning for point cloud based 3d object detection. CoRR abs/1711.06396. External Links: Link, 1711.06396 Cited by: §II-C, §III-C1.