Fast Object Classification and Meaningful Data Representation of Segmented Lidar Instances

06/17/2020 ∙ by Lukas Hahn, et al. ∙ Aptiv Bergische Universität Wuppertal 0

Object detection algorithms for Lidar data have seen numerous publications in recent years, reporting good results on dataset benchmarks oriented towards automotive requirements. Nevertheless, many of these are not deployable to embedded vehicle systems, as they require immense computational power to be executed close to real time. In this work, we propose a way to facilitate real-time Lidar object classification on CPU. We show how our approach uses segmented object instances to extract important features, enabling a computationally efficient batch-wise classification. For this, we introduce a data representation which translates three-dimensional information into small image patches, using decomposed normal vector images. We couple this with dedicated object statistics to handle edge cases. We apply our method on the tasks of object detection and semantic segmentation, as well as the relatively new challenge of panoptic segmentation. Through evaluation, we show, that our algorithm is capable of producing good results on public data, while running in real time on CPU without using specific optimisation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Lidar sensors are used in a large number of fields, providing a three-dimensional representation of the given environment. Object detection on Lidar data is widely considered to be a crucial aspect for perception in automotive active safety and autonomous driving. Hence, there is a large number of works on object detection [26, 35, 33, 18] and odometry [22, 5, 19] in the Lidar space, with many approaches reporting incremental improvements on benchmark datasets [13, 3] targeted towards automotive applications.

While many methods show inventive data processing combined with deep learning techniques to achieve very good performance on those tasks, the vast majority is heavily optimised to these benchmarks and does not reach real-time performance, even when executed on powerful GPUs. This makes their application in vehicles unfeasible today and the foreseeable future, since dedicating so much computational power towards one algorithm is impractical concerning cost, power requirements and heat dissipation.


Fig. 1: Exemplary street scene from the SemanticKITTI dataset [3]. Ground truth annotations (top), all segmented object instances after clustering with the algorithm of [14] (middle) and after classification with our approach (bottom)

. The proposed method is capable of effectively suppressing ”None” object clusters and correctly classifying the relevant objects. Please note, that the clustering output is agnostic to object classes and the colours here are just used to differentiate instances.

We see promising results in recent development of methods for real-time object instance segmentation of Lidar point clouds [14, 6]

. Thus, in this work we propose a meaningful representation for object instances in Lidar sensor data, to facilitate a real-time capable, CPU-based classification. We accomplish this task with a small custom neural network architecture, by applying methods to maintain three-dimensional image information in a two-dimensional representation and selective object statistics.

Ii Related Work

There is a multitude of publications concerned with object detection and classification on Lidar data. In an attempt to provide a short overview, we roughly divide them into three categories; algorithms that work on unregularised point clouds, those that use regularised ones, and algorithms that use fusion approaches combining Lidar with other sensors, mostly camera data.
A lot of methods process unregularised point clouds and usually generate per-point features, with PointNet [26], offering more global feature generation, and PointNet++ [25], cascading instances of the aforementioned approach for more localised features, being the backbone for a larger number of publications including PointRCNN [27]. While facilitating good performance, those approaches are computationally intensive, which rules them out for embedded application.
A second group of algorithms regularises Lidar point clouds before processing them. To do so, grid structures are used in fixed [35, 28, 7] or variable sizes [1]. To alleviate the negative performance influence of many empty grid cells created by this methods, works like [33] established specific network layers to exploit this sparsity, providing a significant speed-up. [18] reduce grids to a two-dimensional representation, enabling omission of costly 3D convolutions for faster runtime.
The third and last set of works adds information from camera sensors to enrich Lidar data. Some of them use camera images to create region proposals, which are refined by a Lidar algorithm in a subsequent step. Here, a popular approach is the use of a frustum for projection into the point cloud, as shown in [24, 34] and [32] for example. Comparable methods utilise full three-dimensional object proposals from image inputs [29] or more complex deep learning fusion networks [8].
All of the approaches mentioned above are highly computationally intensive, with many of them not being real time capable even on high-end GPUs. There is not much work to be found concentrating on real-time application of Lidar algorithms on systems with less computational power. [21] claim such capabilities using specific optimisations for powerful CPUs, but report sub-par results.

Iii Methods

To facilitate real-time CPU-based Lidar object classification of segmented instances, we consider the following aspects as valuable.

Iii-a Lidar data representation

From the raw data of a Lidar sensor, several different information aspects can be used to describe a measured point. The obvious first aspects are the X, Y and Z coordinate values of this point, usually given in Cartesian coordinates with the origin of the coordinate system in the location of the Lidar sensor. Derived from this, the distance or range of the point can be calculated, for example as Euclidean distance. Furthermore, the measured intensity of a point is normally given in the raw data, enabling indication of the surface reflectivity of an object.
To gain more knowledge about the surface of an object in the Lidar space, we propose to calculate an image representation of the horizontal and vertical component of the normal vector of each measured point. Such a representation is known in SLAM/odometry algorithms for Lidar data [22, 5, 19]. The normal vector for a measured Lidar point can be determined using the angle relationships shown in Figure 2. and are the angles between the line from the sensor origin to the given point and the line from the given point to its respective neighbour. The angle bisector of the combined angle equals one component of the normal vector. We decided to use the scalar values of the angles instead of the more common cross product calculation between neighbouring points in the three-dimensional space. This strategy reduces the calculation of the normal vector image to a simple element-wise matrix subtraction. To compute the horizontal component, the neighbouring points left and right of the point in question are used. For the vertical component, the neighbour above and below are considered respectively. Using just one neighbouring point here would allow for a potentially larger error. An exemplary representation of both components can be seen in Figure 3, with an additional combined view for illustration purposes.
To identify redundant information in these seven data representations and to reduce the number of input layers which later need to be processed, we conducted an ablation experiment testing all possible input combinations (see Appendix). It showed, that we are able to efficiently select only three layers, namely the intensity values and our horizontal and vertical normal vector component representations, and maintain performance with only a minimal accuracy decrease compared to using all seven of them.

Fig. 2: Relationship of angles between lines connecting adjacent Lidar measurement points and the line from the respective point to the sensor origin for calculating the a normal vector component. With , the angle bisector can be determined as .
Fig. 3: Visualisation of the horizontal (top) and vertical (middle) normal vector component image, as well as a combined view of both components (bottom) for representative purpose.
Fig. 4: Architecture of the proposed CNN for fast Lidar object instance classification.

Iii-B Object instance and mask

We build our classification approach on segmented object instances in the Lidar space. This facilitates a number of possible different applications. While integrating the proposed contributions of this work into a larger end-to-end neural network would be feasible, we considered the use of clustering algorithms to generate object instances. There are some methods available, which have been optimised to provide good performance in real-time Lidar instance segmentation on automotive data [14, 6]. General purpose clustering algorithms like DBSCAN [11] could also be used, but are much more computationally intensive.
The use of the clustered objects for our classification is twofold; Primarily, we can apply the segmented instance as a mask on the different Lidar data representations described in Section III-A

, emphasising the points belonging to the object and eliminating background influence. Furthermore, we can extract statistical information about each segmented object without much difficulty. While there is no way to be completely certain, that an object is recognised in its full extent, such values are clearer and more deterministic than many machine learning representations and can be especially helpful in edge cases.


A clustering approach will segment all kind of object instances above a separated ground plane, independent of whether they belong to a supported class relevant for automotive application. Hence, our classification will be presented with a majority of ”None” objects along the roadside. For example; walls of shops and houses can be confused with trucks, street lights or other poles can resemble pedestrians and in some cases, cars might even be confused with larger bushes.
We use the width, length and height of a segmented object, the number of Lidar points belonging to it, as well as the Euclidean and X- and Y-axis distance from the sensor origin. In this way we generate a vector of size seven. [30] use different but comparable geometric features in their approach to distinguish between main object categories in their closed-source dataset.
While influence on the overall classification performance scores is not particularly substantial, the use of our statistic vector proved to offer valuable decision support for critical edge cases in our experiments. Examples of this advantage in correcting both false positives and false negatives can be seen in Figure 5.

Iii-C Object instance classification architecture

Following our objective to facilitate a fast classification of Lidar instances, which is capable of running on CPU in real time, we propose a small Convolutional Neural Network (CNN) architecture. As depicted in

Figure 4, it consists of two branches. The first one takes the statistic vector as input and comprises two fully connected layers. The second larger branch processes the Lidar image representations and is characterised by two residual separable modules in between common

convolution layers. Such a module features two depth-wise separable convolutions and maximum pooling in parallel to a residual connection with

filter kernel. Comparable structures have been popularised in deep learning architectures [15, 9, 31] and shown benefits in application to small real-time networks [12, 16, 2].
The complete CNN architecture has a total of parameters, which results in a weight checkpoint file of kBytes in size. For comparison, state-of-the-art end-to-end Lidar networks have many millions of parameters and need much more memory to store weights accordingly, which is another cost driving factor in addition to their much higher computational requirements.

Iv Experimental Evaluation

To evaluate the performance of our proposed method, we consider different aspects and metrics. While good detection/classification rates are important, we developed our approach with a focus on real-time capability for CPU-based platforms. Hence, we are not aiming to set new benchmark high scores competing directly with magnitudes larger end-to-end Lidar networks. We rather aspire to provide reliable output given strong computational limitations.
We use the public dataset SemanticKITTI by Behley et al. [3], which is an extension of the original KITTI [13] and provides semantic annotation for all sequences of its odometry benchmark, to make our evaluation transparent. An illustrative example of our method’s performance can be seen in Figure 1. The semantically segmented point clouds, as well as the additional instance labels in the dataset, allow for multiple pointwise evaluation approaches. Accordingly, we use three different tasks to assess the performance:

Iv-a Semantic Segmentation

For the semantic segmentation evaluation, we employ a general purpose clustering method to agnostically separate object instances in the three-dimensional point cloud. As described in Section III, we classify the segmented instances from these clusters.
Our approach classifies objects in five general automotive classes: “Cars”, “Trucks”, “Pedestrians”, “Bikes” and the “None” class, which embodies all static background classes such as road surface, buildings and vegetation. To achieve this mapping, we combined the SemanticKITTI classes “Bicycle”, “Bicyclist”, “Motorcycle” and “Motorcyclist” to “Bike”, as well as “Truck”, “Bus”, “On-Rails” and “Other-Vehicle” to “Trucks”. The classification network has been trained with the annotated point clouds of the available training logs. As suggested in the SemanticKITTI API documentation, we kept the 8th log separate for validation. In this manner we are able to test semantic segmentation results with the reduced class mapping on unseen data. Table I shows the results for the class-wise semantic segmentation intersection over union (IoU) metric of the combined approach of clustering the point cloud and classifying each cluster separately. The IoU or Jaccard index is defined as

and in practice can be described as the relation of true positives (TP) to the sum of TP, false positives (FP) and false negatives (FN). This metric provides a good impression of pointwise segmentation quality, since the correct predictions and both types of incorrect predictions for each point are included in the equation.
The score of this metric is computed for two approaches. The first approach is the classification of clustered instances as an end-to-end pipeline on point cloud data to show the performance of the proposed method on unseen samples. The second approach applies the classification directly on the annotated ground truth instances in the dataset. This allows for a comparison, on how the performance is influenced by the grade of the provided object instances. As the results show, performance of the semantic segmentation directly depends on the quality of the given object clusters.

Method None Car Truck Bike Pedestrian
Clustered Instances 0.954 0.750 0.472 0.265 0.282
GT Instances 0.994 0.926 0.732 0.525 0.558
TABLE I: Semantic Segmentation Results as Intersection over Union (IoU) using clustered instances or ground truth instances as input for our classification approach.

Iv-B Object Detection

The second evaluation method is meant to assess the performance of object detection. For this, the previously mentioned IoU metric is used to define bins of precision in the detection. Ten bins are defined with a pointwise overlap of the ground truth objects and proposed clusters, ranging from an IoU of 0.5 to 0.95 in steps of 0.05. The average of all 10 bins is the single metric score, the Average Precision (AP), which is shown in Table II.
Additionally the AP for the overlap values of 0.5, 0.75 and 0.95 are listed, in which the evaluation is restricted to objects above the denoted IoU.

Method
Clustered Instances 0.407 0.441 0.419 0.314
GT Instances 0.554 - - -
TABLE II: Object Detection Results as Average Precision (AP) on clustered instances and provided ground truth instances.

We adopt this metric definition from established benchmarks [20, 10] for two- and three-dimensional bounding box object detection.

Iv-C Panoptic Segmentation

Given the nature of our method, class-less separation of object clusters and background followed by object classification, we can use it to perform the task of panoptic segmentation. This term was coined by Kirillov et al. in their work of the same name [17]. According to the authors, this task “unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance)”.
In panoptic segmentation the differentiation between Stuff ie. background and Things, in our case active road users, is as important as the separation of Things among themselves. After separation into background and clusters and classification of the latter, we use the predicted class labels to remove the instance labels from clusters which are not part of the Things, in our case all None labels such as utility boxes, road signs and vegetation.
We validate our approach on the panoptic segmentation benchmark of the SemanticKITTI dataset [4]. This challenge uses the panoptic quality (PQ), originally proposed by [17], averaged over all classes , as used by Porzi et al. [23], on the whole test set. For a single class it is defined as

and is composed from the segmentation quality (SQ) and the recognition quality (RQ) as follows

At the time of writing, our method is the only submitted approach for the panoptic segmentation benchmark. We are positive, that at the time of publishing we will have dropped in this challenge, as our approach is as previously shown limited to only 4 of the 19 classes and will not yield competitive results summed over all classes. The results in Table III are therefore limited to the four classes described in Section III. We like to note, that our training set uses a reduced class mapping (see Section IV-A) and the reported official results are evaluated on the full class set. Therefore our performance on “Truck” and “Bike” suffers.

Class PQ SQ RQ IoU
Car 0.754 0.866 0.87 0.792
Truck 0.0534 0.888 0.0602 0.0371
Bike 0.0822 0.723 0.114 0.0462
Pedestrian 0.377 0.905 0.417 0.161
TABLE III: Panoptic Segmentation Results as panoptic quality (PQ), segmentation quality (SQ), recognition quality (RQ) and IoU using classification of clustered instances.

Iv-D Timings

We implemented the proposed network architecture in Python with Tensorflow, devoid of any further customisation or optimisation. For

input instances of object proposals from the point cloud, which is representative of a residential area to inner-city scene, inference time on an Intel i7-6820HQ laptop CPU @ 2.70 GHz is ms. If execution is limited to only two threads, resembling a small embedded processor, this value rises to ms.

Timing measurements are often not published alongside high-performing publications. In exceptional cases, they are reported on powerful GPUs. To get a rough estimate about the runtime of state-of-the-art networks, we timed the inference of one point cloud from

SemanticKITTI with a network closely comparable to PointPillars [18]. Doing so took s, which is nearly times slower than our approach, on the same CPU.

Fig. 5: Influence of the additional statistic vector described in III-B. The top left and middle left figure show examples for false negatives and the bottom left figure for false positives respectively. The additional statistic vector prevents some of these errors (right side).

V Conclusion

We have presented an algorithmic approach to facilitate real-time CPU-based classification of segmented object instances in Lidar data. Our approach uses a component-wise decomposed normal vector image, instance masks and selected statistics to create a meaningful data representation. The proposed CNN architecture is efficient in processing large batches of those instances, in a frequency high enough to enable real-time application while running on CPU, even in Python without specific optimisation. Through evaluation on public data we have shown, that our method achieves good performance on automotive Lidar semantic segmentation and object detection tasks, while being orders of magnitude faster to compute than current state-of-the-art approaches.
Through the combined use of instance segmentation and classification of these separated objects, we can further provide panoptic segmentation of street scenes. There is little previous work on that and to our best knowledge, ours is the first to report a method which can accomplish this task in real time on CPU. In future work, we aim to improve our detection accuracy by taking past time steps into account for refinement through causal tracking approaches.

Appendix

Channel Configuration
X Y Z I D HNV VNV Test Acc.
0.835
0.875
0.892
0.879
0.895
0.882
0.900
0.890
0.902
0.887
0.861
0.891
0.896
0.881
0.894
0.873
TABLE IV: Results of the input channel ablation experiment. Configuration options include the Cartesian coordinates (X, Y, Z), intensity (I) and depth (D), as well as the horizontal and vertical normal vector component images (HNV, VNV). A binary mask of reflected lidar points is applied at all times.

References

  • [1] M. Alsfasser, J. Siegemund, J. Kurian, and A. Kummert (2020) Exploiting polar grid structure and object shadows for fast object detection in point clouds. In International Conference on Machine Vision, pp. 114330G. Cited by: §II.
  • [2] O. Arriaga, M. Valdenegro-Toro, and P. G. Plöger (2019) Real-time convolutional neural networks foremotion and gender classification. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 221–226. Cited by: §III-C.
  • [3] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall (2019)

    SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences

    .
    In

    IEEE International Conference on Computer Vision

    ,
    pp. 9297–9307. Cited by: Fig. 1, §I, §IV.
  • [4] J. Behley, A. Milioto, and C. Stachniss (2020) A benchmark for lidar-based panoptic segmentation based on KITTI. arXiv:2003.02371 [cs.CV]. External Links: Link, 2003.02371 Cited by: §IV-C.
  • [5] J. Behley and C. Stachniss (2018) Efficient surfel-based slam using 3d laser range data in urban environments.. In Robotics: Science and Systems, Cited by: §I, §III-A.
  • [6] I. Bogoslavskyi and C. Stachniss (2016) Fast range image-based segmentation of sparse 3d laser scans for online operation. In IEEE International Conference on Intelligent Robots and Systems, pp. 163–169. External Links: ISSN 21530866, Document Cited by: §I, §III-B.
  • [7] Q. Chen, L. Sun, Z. Wang, K. Jia, and A. Yuille (2019) Object as hotspots: an anchor-free 3d object detection approach via firing of hotspots. arXiv: 1912.12791 [cs.CV]. External Links: 1912.12791 Cited by: §II.
  • [8] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 6526–6534. External Links: Document Cited by: §II.
  • [9] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1800–1807. External Links: Document Cited by: §III-C.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: §IV-B.
  • [11] M. Ester, H. Kriegel, J. Sander, and X. Xu (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In International Conference on Knowledge Discovery and Data Mining, pp. 226–231. Cited by: §III-B.
  • [12] I. Freeman, L. Roese-Koerner, and A. Kummert (2018) Effnet: an efficient structure for convolutional neural networks. In IEEE International Conference on Image Processing, Vol. , pp. 6–10. External Links: Document, ISSN 2381-8549 Cited by: §III-C.
  • [13] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. External Links: ISSN 10636919, Document Cited by: §I, §IV.
  • [14] F. Hasecke, L. Hahn, and A. Kummert (2020) Fast lidar clustering by density and connectivity. arXiv:2003.00575 [cs.CV]. Cited by: Fig. 1, §I, §III-B.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. External Links: ISSN 10636919, Document Cited by: §III-C.
  • [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 [cs.CV]. External Links: 1704.04861 Cited by: §III-C.
  • [17] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár (2019) Panoptic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9404–9413. Cited by: §IV-C.
  • [18] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) PointPillars: fast encoders for object detection from point clouds. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §I, §II, §IV-D.
  • [19] Q. Li, S. Chen, C. Wang, X. Li, C. Wen, M. Cheng, and J. Li (2019) LO-net: deep real-time lidar odometry. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 8465–8474. External Links: Document, ISSN 1063-6919 Cited by: §I, §III-A.
  • [20] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §IV-B.
  • [21] K. Minemura, H. Liau, A. Monrroy, and S. Kato (2018) LMNet: real-time multiclass object detection on cpu using 3d lidar. In Asia-Pacific Conference on Intelligent Robot Systems, pp. 28–34. Cited by: §II.
  • [22] F. Moosmann and C. Stiller (2011) Velodyne slam. In IEEE Intelligent Vehicles Symposium, pp. 393–398. Cited by: §I, §III-A.
  • [23] L. Porzi, S. R. Bulo, A. Colovic, and P. Kontschieder (2019) Seamless scene segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8277–8286. Cited by: §IV-C.
  • [24] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 918–927. External Links: Document, ISSN 1063-6919 Cited by: §II.
  • [25] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 77–85. External Links: Document Cited by: §II.
  • [26] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5100–5109. External Links: ISSN 10495258 Cited by: §I, §II.
  • [27] S. Shi, X. Wang, and H. Li (2019) PointRCNN: 3d object proposal generation and detection from point cloud. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 770–779. External Links: Document, ISSN 1063-6919 Cited by: §II.
  • [28] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li (2019) PV-rcnn: point-voxel feature set abstraction for 3d object detection. arXiv: 1912.13192 [cs.CV]. External Links: 1912.13192 Cited by: §II.
  • [29] K. Shin, Y. P. Kwon, and M. Tomizuka (2019) Roarnet: a robust 3d object detection based on region approximation refinement. In IEEE Intelligent Vehicles Symposium, pp. 2510–2515. Cited by: §II.
  • [30] W. Song, S. Zou, Y. Tian, S. Fong, and K. Cho (2018) Classifying 3d objects in lidar point clouds with a back-propagation neural network. Human-centric Computing and Information Sciences 8 (1), pp. 1–12. Cited by: §III-B.
  • [31] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–9. External Links: Document, ISSN 1063-6919 Cited by: §III-C.
  • [32] Z. Wang and K. Jia (2019) Frustum convnet: sliding frustums to aggregate local point-wise features for amodal. In IEEE International Conference on Intelligent Robots and Systems, pp. 1742–1749. Cited by: §II.
  • [33] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §I, §II.
  • [34] X. Zhao, Z. Liu, R. Hu, and K. Huang (2019) 3D object detection using scale invariant and feature reweighting networks. In

    AAAI Conference on Artificial Intelligence

    ,
    pp. 9267–9274. Cited by: §II.
  • [35] Y. Zhou and O. Tuzel (2018) VoxelNet: end-to-end learning for point cloud based 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. External Links: ISSN 10636919, Document Cited by: §I, §II.