Lidar-based Object Classification with Explicit Occlusion Modeling

07/09/2019 ∙ by Xiaoxiang Zhang, et al. ∙ Tencent QQ 0

LIDAR is one of the most important sensors for Unmanned Ground Vehicles (UGV). Object detection and classification based on lidar point cloud is a key technology for UGV. In object detection and classification, the mutual occlusion between neighboring objects is an important factor affecting the accuracy. In this paper, we consider occlusion as an intrinsic property of the point cloud data. We propose a novel approach that explicitly model the occlusion. The occlusion property is then taken into account in the subsequent classification step. We perform experiments on the KITTI dataset. Experimental results indicate that by utilizing the occlusion property that we modeled, the classifier obtains much better performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

LIDAR is one of the most popular sensors for unmanned vehicle due to its highly precise range measurements. Object detection and classification based on lidar point cloud is an extremely important technology for unmanned vehicle. However, the sparseness of the lidar point cloud and the mutual occlusion between neighboring objects poses significant challenges for object detection and classification algorithm. Fig.1 is a typical traffic scene. We could observe a lot of occlusions occurred in this figure.

Figure 1: In a typical traffic scenario, it is common to see the mutual occlusion between neighboring objects. The lidar point cloud of the object to be classified is often incomplete and fragmented, which could easily result in wrong classification results.

Ideally, the lidar point cloud corresponding to an object should be relatively complete and fully reflect the spatial distribution characteristics of objects. However, due to the mutual occlusion of neighboring objects, the object point cloud is usually incomplete which may result in the wrong classification of the object.

An illustrative example is shown in Fig.2. In the training phase, many positive samples, including sample A and B as shown in the top row of Fig.2, are fed into the classifier. It is seen that sample B is occluded by another obstacle, making its point cloud incomplete. The classifier is then trained to adapt to this intra-class variation. In the testing phase, the classifier encounters two samples, C and D. Among them, sample D is a true positive while C is composed of two small objects, E and F. The classifier will encounter difficulties in distinguishing C from D, and it is very likely to classify C as a false positive or classify D as a false negative.

Figure 2: In the training phase of the traditional approach, many positive samples, including sample A and B as shown in the top row, are fed into the classifier. Sample B is occluded by another obstacle, making its point cloud incomplete. In the testing phase, the classifier encounters two samples, C and D. Sample C is likely to be classified as false positive while sample D is likely to be classified as a false negative. In our approach, we add a pre-processing step that computes the occlusion property of the point cloud before the classification module. As shown in the bottom row, the occluded area is colored in yellow. With the help of the occlusion area, the classifier can now easily distinguish object C from D, thus both the false positive rate and false negative rate might be reduced.

In this paper, we consider occlusion as an intrinsic property of the point cloud data. The occlusion area could be accurately computed by considering the relative position between the LIDAR itself and each detected LIDAR point using ray-casting technique[1]. Therefore, we add a pre-processing step to add the occlusion property to the point cloud before any further processing. As shown in the bottom row of Fig.2, the occluded area is colored in yellow. With the help of the occlusion area, the classifier can now easily distinguish object C from D. Therefore, both the false positive rate and false negative rate might be reduced.

We test our approach on the KITTI dataset. We choose to use PointNet[2] as the basic classifier. We modified PointNet to enable it to utilize the occlusion property. Experimental results show that our method obtains a significant improvement compared to the original PointNet, both in the overall classification accuracy and per-class classification accuracy.

2 Related Work

There has been a large literature on object detection approaches based on point cloud. Petrovskaya et al. proposed an object detection algorithm based on object geometry and motion model [3, 4, 5]

, and used Bayesian filters to estimate the model parameters. Himmelsbach et al. extracted the geometric features of point cloud using the point feature histogram 

[6, 7] , and then used SVM to classify the object. Built on the work of [3, 4, 5] , Wojke et al.  [8] proposed an object detection algorithm based on the combination of line features and angular features. Cheng et al. proposed to use histogram features for object detection and recognition [9] .

Recently, deep learning based approaches have become popular due to its outstanding performance. MV3D 

[10] firstly projects point cloud onto the bird’s eye view and then trains a region proposal network (RPN) for generating 3D bounding box proposals. However, MV3D does not perform well in detecting small objects such as pedestrians and cyclists. VoxelNet [11] is an end-to-end object detection framework. It divides the point cloud into equally spaced three-dimensional voxels and then transforms the points in each voxel into a uniform feature representation through the newly introduced Voxel Feature Coding (VFE) layer. The point cloud is then encoded as a volume representation to perform the detection and classification. Different from those previous approaches that rely on a mid-level representation, such as the image grids or the 3D voxels, Qi et al. proposed a new type of network called PointNet [2] that works directly on the original point cloud. PointNet is a unified framework that can be applied to object classification, part segmentation and scene semantic parsing. It obtains competitive results on several 3D object classification benchmarks.

For occlusion handling, there have been several works [12, 13, 14, 15] trying to directly predict the occlusion mask. However, most of these works are image-based approaches. There has been little work on lidar-based approaches that directly models occlusion and utilize the occlusion property to aid the classification tasks.

3 The Proposed Approach

3.1 Point Cloud Definition

A point cloud is represented as a set of three dimensional points , where each point

is a vector of

.

We define the object point cloud data within the object bounding box as the object point cloud . The point cloud outside the object bounding box is defined as the obstacle point cloud . The obstacle point cloud will block the lidar ray from passing through it, thus resulting in an incomplete object point cloud. In Fig.3, we can see that the object point cloud is divided into two parts. The occlusion area generated by the point cloud using the ray-casting technique is defined as the occluded point cloud , and is colored in yellow and pink respectively.

Figure 3: Point cloud definition. The top figure is the 3D-view and the bottom figure is the corresponding birds-eye view. The gray cube represents the obstacle point cloud. The object point cloud is colored in blue. The occlusion area generated by the point cloud is colored in pink and yellow.

3.2 Occlusion Area Modeling

For each point of the obstacle point cloud and each point of the raw object point cloud, we use the ray-casting technique to model the occlusion. We define the position of the LIDAR as the origin . For each point and , we add occluded points along the direction of to or at a fixed step. The occluded points are added until their height is below the ground plane. The ground plane is estimated by using a block recursive Gaussian process regression algorithm [16].

For each point of the obstacle point cloud:

(1)
(2)

For each point of the object point cloud:

(3)
(4)

where , are positive integer, is the step size ( in our experiment we set ). The function represents the distance from the point to the origin .

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l)
Figure 4: The comparison of the object point cloud with and without the occluded points. The first row and the third row are the raw object point cloud without occluded points. The second and the fourth row are the new object point cloud with occluded points. We can clearly see that the object point cloud with occluded points is more complete compared with the original one.

To distinguish the added occluded point cloud from the original point cloud, we add a new dimension named ‘occluded’ to the original point cloud data, expanding the point cloud dimension from three dimensional to four dimensional . We set the occlusion property of original point clouds to 0 and set the occlusion property of newly added occlusion points to 1. We use our approach to add the occluded point cloud for both the object and the obstacle point cloud.

We show the comparison of the object point cloud with and without the occluded points in Fig.4. The first row and the third row are the raw object point cloud without occluded points. The second and the fourth row are the new object point cloud with occluded points. It is obvious that the object point cloud with occluded points is more complete than the raw object point cloud.

3.3 Deep Learning Based Point Cloud Classification Approach

We choose to use PointNet as the classification approach. PointNet[2]

proposed by Qi et. al, is an method that directly processes the original point cloud. PointNet mainly consists of several transformation layers and several Multi-Layer Perceptron (MLP) blocks. The first layer of PointNet takes

points as input and learns a transformation matrix through the T-Net learning, where represents the feature dimension.

The transformed data then goes through several Multi-Layer Perceptron(MLP) blocks shared by each point, an intermediate max pooling layer, a spatial transformation layer and two fully connected layers. The initial value of the spatial transformation matrix is set to an identity matrix. Except for the last layer, ReLU and Batch Normalization are applied to all other layers.

MLP of PointNet is implemented by the convolution of shared weights. The convolution kernel of the first layer is , and the subsequent convolution kernel size is .

3.4 Deep Learning Based Point Cloud Classification Approach With Occlusion Modeling

Based on the original PointNet network, we make some modifications to utilize the occlusion property proposed in this paper. We expand the input data from 3D to 4D, i.e. in order to enable PointNet to process new formats of point cloud data. For the transformation matrix obtained by the T-Net learning, we have also modified them so that the feature dimension of the new transformation matrix becomes .

In the subsequent module, we have also made appropriate modifications to the network. The size of the convolution kernel of the MLP is modified to according to the input data dimension, and the output dimension of the last layer is set to the number of classes.

Figure 5: Here we show the main structure of the PointNet’s classification network and the difference between the origin PointNet and ours. It is seen that we do not need to make many changes on the structure of the network itself.

In Fig.5, we show the comparison of PointNet and our modified PointNet. The top figure is the original PointNet. The bottom figure is our modified PointNet. Changed parts are shown in the bottom bounding box. We can see that we do not need to make many changes on the structure of the network itself. Our approach could be applied to any network which can directly process the raw lidar point cloud data.

4 Experimental Results

We divide our experiments into two parts and we choose to perform experiments on the KITTI dataset. We firstly did experiments on the seven categories (‘car’, ‘van’, ‘truck’, ‘pedestrain’, ‘cyclist’, ‘tram’ and ‘misc’) of KITTI dataset. As ‘car’, ‘van’ and ‘truck’ share a lot of similarities, and in fact they all belong to the ‘vehicle’ category, we then merge car, van and truck to a single category, and perform experiments on these five categories.

4.1 Classification Results on the 7 Categories

We separately train the PointNet network on the original point cloud and the point cloud with occluded points. The classification results are shown in Table.1 and Fig.6. Experimental results show that both the overall accuracy and per-class accuracy of our approach have a significant improvement compared with the original PointNet.

dataset
accuracy
avg. class
accuracy
overall
Ours KITTI 0.784 0.920
Table 1: Classification results on the KITTI 7 categories dataset.
Figure 6: Classification results on the KITTI 7 categories dataset.

In Fig.7, we show the confusion matrix of the original PointNet and our approach. In Fig.8, we show the comparision between the point cloud with and without the added points. For many samples occluded by obstacles, their incomplete point cloud always result in wrong classification, such as sample C in Fig.8. Due to the incompleteness of the point cloud, sample C is classified as ‘misc’ category in the original PointNet. In our approach, with the help of the added occluded points, it is correctly classified as the ‘car’.

(a) PointNet
(b) Our Approach
Figure 7: Confusion matrix on the 7 categories using the original PointNet and our approach.
(a) (b) (c) (d) (e)
Figure 8: The original point cloud is colored in blue. The added occluded points are colored in red. The original point cloud is mostly occluded and may easily lead to a wrong classification ressult. With the help of the occluded points, these samples have now been correctly classified.

4.2 Classification Results on the 5 Categories

We merge car, van and truck into a single class and perform the experiments on the five categories. We believe that these three categories all belong to the ‘vehicle’ class, and they are equally important to the self-driving cars. The classification results are shown in Table.3 and Fig.10.

car van truck pedestrain cyclist tram misc
Testing data 0.626 0.108 0.038 0.142 0.036 0.026 0.024
Table 2: The percentage of each category’s samples.
(a) (b) (c)
Figure 9: The object point cloud with and without occluded points of the van and car. Sample A is a van. Sample B and C are cars.
dataset
accuracy
avg. class
accuracy
overall
Ours KITTI 0.808 0.962
Table 3: Classification results on the KITTI 5 categories dataset. We can see that the accuracy overall results of ours modified PointNet have better performance than PointNet.

In Fig.10, it is easily seen that each category’s classification accuracy of our approach is improved in our approach. Some qualitative examples are shown in Fig.9. The confusion matrix is shown in Fig.11.

Figure 10: Classification results on the KITTI 5 categories dataset.
(a) PointNet
(b) Our Approach
Figure 11: Confusion matrix on the KITTI 5 categories using the original PointNet and our approach.

5 Concluding Remarks

In this paper, we investigate the lidar classification problem in occluded scenarios. We model occlusion as a intrinsic property of the lidar point cloud, and add a pre-precessing step to the lidar point cloud processing pipeline. It is important to emphasize that our approach is not only applicable to enhance PointNet’s classification performance. We believe that our approach for modeling occlusion is an important pre-processing step that can enhance any classification approaches.

References

  • [1] Scott D Roth. Ray casting for modeling solids. Computer graphics and image processing, 18(2):109–144, 1982.
  • [2] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 652–660, 2017.
  • [3] Anna Petrovskaya and Sebastian Thrun. Model based vehicle detection and tracking for autonomous urban driving. Autonomous Robots, 26(2-3):123–139, 2009.
  • [4] Anna Petrovskaya and Sebastian Thrun. Model based vehicle tracking in urban environments. In IEEE International Conference on Robotics and Automation, Workshop on Safe Navigation, volume 1, pages 1–8, 2009.
  • [5] Anna Petrovskaya and Sebastian Thrun. Efficient techniques for dynamic vehicle detection. In Experimental Robotics, pages 79–91. Springer, 2009.
  • [6] Michael Himmelsbach, Thorsten Luettel, and H-J Wuensche. Real-time object classification in 3d point clouds using point feature histograms. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 994–1000. IEEE, 2009.
  • [7] Chieh-Chih Wang, Charles Thorpe, and Sebastian Thrun. Online simultaneous localization and mapping with detection and tracking of moving objects: Theory and results from a ground vehicle in crowded urban areas. In 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422), volume 1, pages 842–849. IEEE, 2003.
  • [8] Nicolai Wojke and Marcel Häselich. Moving vehicle detection and tracking in unstructured environments. In 2012 IEEE International Conference on Robotics and Automation, pages 3082–3087. IEEE, 2012.
  • [9] Jian Cheng, Zhiyu Xiang, Teng Cao, and Jilin Liu. Robust vehicle detection using 3d lidar under complex urban environment. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 691–696. IEEE, 2014.
  • [10] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017.
  • [11] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018.
  • [12] Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, and Chunhua Shen. Repulsion loss: Detecting pedestrians in a crowd. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [13] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. Occlusion-aware r-cnn: Detecting pedestrians in a crowd. In The European Conference on Computer Vision (ECCV), September 2018.
  • [14] Pierre Baque, Francois Fleuret, and Pascal Fua. Deep occlusion reasoning for multi-camera multi-target detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [15] Hsiao Edward and Hebert Martial. Occlusion reasoning for object detectionunder arbitrary viewpoint. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1803 – 1815, 2014.
  • [16] 3D LIDAR-based Dynamic Vehicle Detection and Tracking. PhD thesis, 2016.